Coding agents make a software background optional, but expertise still decides who succeeds

An analysis of roughly 400,000 Claude Code sessions finds a clear split of labour: people decide what to build, the agent decides how. Command of a domain, not the ability to write code, is what makes sessions succeed.

Published June 2026

70% of planning decisions ('what to do') are made by the person, while Claude makes about 80% of execution decisions

2× Expert-rated sessions reach verified success more than twice as often as novice-rated ones

+25% rise in the estimated value of the typical task across the seven-month window

33% → 19% fall in the share of sessions spent fixing broken code from October to April

People decide what, Claude decides how

The report rests on a privacy-preserving analysis of about 400,000 interactive Claude Code sessions from roughly 235,000 people between October 2025 and April 2026. No researcher reads individual transcripts. Instead, classifiers read each session and sort it by what work was attempted, who appears to be doing it, and whether it succeeded, and those labels are then checked against telemetry that is recorded automatically, such as whether any lines of code changed.

The headline pattern is a clean division of labour. A decision-attribution classifier lists the meaningful decisions in a session, splits them into planning (what to do, which approach, what counts as done) and execution (which files to touch, what code to write, which commands to run), and assigns each to the person or to Claude. On average people make about 70% of the planning decisions and only about 20% of the execution decisions. People decide what to build, and the agent decides how to build it.

How much the agent does between check-ins tracks who holds the wheel. A typical session runs about four turns, and each user prompt sets off roughly ten Claude actions and about 2,400 words of output, sometimes well over a hundred actions. The more expertise a person brings, the longer those chains get.

Expertise	Actions per prompt	Words of output per prompt
Novice	~5	~600
Expert	~12	~3,200

Hitzig et al. (2026), Figure 3. Approximate activity set off by a single user prompt, by rated expertise. White-dot values are geometric means.

Success climbs with expertise, then flattens

Reconstructed from Hitzig et al. (2026), Figure 5. Verified success requires a successful judgment plus hard evidence such as a passing test suite or a matching commit. The 28–33% verified range covers intermediate-rated sessions and above; the curve is drawn to the reported rates, with most of the gain coming between novice and intermediate.

The returns to expertise

Expertise here is task-specific, and it is not the same thing as a job title. A senior engineer asking their first Rust question counts as a beginner at Rust. An accountant who has never written Python, but tells Claude exactly which reconciliation rules a script must enforce and catches the edge case it mishandles at month-end, counts as an expert at that task. The classifier reads how precisely directions are framed, what the user asks Claude to verify, and who tends to correct whom.

Across every measure of success, more expertise means a higher chance of success, and most of the gain is front-loaded. A novice-rated session reaches verified success 15% of the time and at least partial success 77% of the time. Sessions rated intermediate or higher reach verified success 28–33% of the time and partial success 91–92% of the time. The jump from novice to intermediate is large; the step from intermediate to expert is modest.

The gap is sharpest when things go wrong. Among sessions that hit verified trouble, the share ending in verified success rises from 4% for novices to 15% for experts, and partial success rises from 60% to around 80%. Novices are also far more likely to walk away empty-handed: 19% of novice sessions are abandoned with zero lines of code written, against 5–7% for everyone else. Part of the value of expertise is simply the ability to steer the agent back on course.

Expertise	Verified success	At least partial success
Novice	4%	60%
Expert	15%	80–81%

Hitzig et al. (2026), Figure 5, restricted to sessions that hit verified trouble (errors, failed tests, repeated attempts, or user pushback). Rates are adjusted by comparing sessions with the same work mode, value band, month, subject, and occupation type.

The work keeps moving up the stack

What people ask of Claude Code shifted over the seven months. The clearest move is away from cleaning up broken code and towards the work that surrounds it. Fixing fell from 33% to 19% of sessions, while operating software grew from 14% to 21%, and writing documents and analysing data together roughly doubled from about 10% to 20%.

The tasks also grew more valuable. Estimating each session's worth by what the equivalent work would fetch on a freelance marketplace, the value of the average session rose by about 27% between October and April, with building, operating, and fixing tasks all up by roughly a third. These figures are coarse and best read as relative changes, not literal dollar amounts.

Occupation matters less than you might expect. Inferring each user's profession from session context, every one of the ten largest occupation groups lands within seven points of software engineers on success, and management sits slightly above them. Coding agents appear to be making a coding background less relevant to successful programming.

Measure	October 2025 / Software	April 2026 / Other
Fixing broken code (share of sessions)	33%	19%
Operating software (share of sessions)	14%	21%
Writing & data analysis (share of sessions)	~10%	~20%
Verified success, code-producing sessions	34%	29%
At least partial success, code-producing sessions	89%	88%

Hitzig et al. (2026), Figures 4 and 6. Top rows: share of sessions in each work mode at the start and end of the window. Bottom rows: success rates by whether the user's inferred occupation is software-related.

What it means for everyone else

Read together, the results sketch a tool that absorbs implementation-heavy work while rewarding people who understand the problem in front of them. Coding skill is becoming less of a gatekeeper, and domain expertise more of a multiplier. A working grasp of a field captures most of the benefit; deep mastery adds only a little more on top. So someone with command of any domain may now reach technical work that was previously out of range, while someone without it gets far less from the same agent.

The authors are careful about the limits. They never see real-world outcomes, such as whether code from a session was ever used, and the analysis excludes non-interactive and headless usage, which is a large slice of activity. Every label depends on a model reading a transcript. The interesting test is what happens next: if the returns to expertise start to shrink, that would be a sign the models are beginning to supply the judgment users currently have to bring themselves.

KEY CONTRIBUTION

Across about 400,000 Claude Code sessions, the work splits cleanly: people own the planning and Claude owns the execution. Success is driven by task-specific domain expertise rather than coding background, expert sessions reach verified success more than twice as often as novice ones, and the gains come mostly from competence rather than mastery.

Reference

Hitzig, Z., Massenkoff, M., Lyubich, E., Heller, R., & McCrory, P. (2026). Agentic coding and persistent returns to expertise. Anthropic. https://www.anthropic.com/research/claude-code-expertise