Research Explainer · Demirer, Musolff & Yang (2026)

AI coding agents triple the code developers write, but shipped software barely budges

A study of more than 100,000 GitHub developers finds that each generation of AI coding tool delivers bigger task-level gains, yet those gains shrink dramatically as they travel down the production chain toward actual releases and end users.

Published May 2026

180% cumulative increase in weekly commits after adopting all three generations of AI coding tools

30% what that 180% gain shrinks to by the time it reaches actual software releases

0.25 estimated elasticity of substitution between AI output and human effort, signalling strong complementarity

0 increase in total app usage across four major marketplaces, despite a surge in new app releases

Two questions everyone skips

Most studies of AI and productivity measure one tool, at one moment, on one task. This paper does something harder. Using public GitHub data combined with internal Microsoft telemetry on more than 100,000 developers, it tracks three successive generations of AI coding tools from 2022 to 2026: autocomplete (Copilot suggesting code as you type), sync agents (Claude Code or Copilot agent mode editing files while you watch), and async agents (Codex or the GitHub agent working autonomously on an assigned task).

Software development has a convenient property for economists: production happens in a visible hierarchy. Lines of code combine into files, files into commits, commits into pull requests, pull requests into projects, and projects into releases. Each layer leaves a trace on GitHub. That makes it possible to ask not just whether AI makes developers faster at writing code, but whether faster code-writing actually produces more shipped software.

The answer to the first question is an emphatic yes. The answer to the second is considerably more sobering.

Productivity gains evaporate as they climb the production hierarchy

Recreated from Demirer, Musolff & Yang (2026), Table 5 and Figure 1. Matched event-study estimates for weeks 21 to 30 after adoption, on a logarithmic scale. Autocomplete refers to AI code completion (GitHub Copilot at launch); sync agents work alongside the developer in real time (e.g. Claude Code); async agents work autonomously on assigned tasks (e.g. OpenAI Codex). Async release effects are not reported because async agents cannot ship releases directly.

A design built for paranoid skeptics

Observational data on tool adoption invites an obvious objection: developers adopt new tools when they are about to do a lot of work anyway, so any post-adoption spike could just be activity bias. The authors take this seriously. Each adopter is matched to a control developer with similar activity, drawn from exactly one year earlier. The one-year shift matters because contemporaneous non-adopters are contaminated: plenty of developers use AI tools privately without leaving public traces, so comparing against them would understate the effect.

The placebo tests are the most persuasive part. Adopting GitHub Pro, a paid subscription with zero AI features, produces the same short-run activity spike as the AI tools, then collapses to nothing within about five weeks. Adopting Docker, a non-AI tool detected the same way as Claude Code, stabilises at a 23% effect. AI coding agents stabilise at 109% and stay there for 30 weeks. Whatever activity bias exists, it cannot explain the gap.

One more validation worth noting: the autocomplete estimates line up with prior randomised field experiments on the same tool in the same period, and treatment effects for Claude Code jump precisely when Anthropic ships a new frontier model. The early-adopter cohort analysis shows the rising long-run effects track model releases, not developer learning. The tools are genuinely getting better.

Each generation roughly triples the last

On the headline task-level measure, weekly commits, the progression across generations is steep. Autocomplete lifts commits by roughly 40%. Adding sync agents takes the cumulative effect to about 140%. Async agents push it to 180% through agent-authored commits. Effects are largest for the least active developers (an 85% gain for the bottom quartile under autocomplete, 217% under sync agents) but remain substantial even for the most prolific.

The per-tool breakdown shows the gains are not an artifact of one product. Every tool delivers a large, statistically significant, persistent effect, though magnitudes differ: Claude Code's 199% long-run sync effect partly reflects its command-line interface attracting more sophisticated adopters, a selection effect the design cannot fully separate from tool quality.

Tool	Weeks 1–10	Weeks 11–20	Weeks 21–30
Autocomplete	61.3	40.3	35.9
Sync effect (pooled)	115.3	89.7	109.1
· Claude Code	186.2	139.5	199.2
· GitHub Sync	106.9	51.0	42.7
· OpenAI Codex	106.5	74.3	94.3
Async effect (pooled)	85.2	43.6	33.6
· GitHub Agent	52.8	36.8	41.9
· OpenAI Codex	134.5	51.4	31.0

Demirer, Musolff & Yang (2026), Table 4. Percentage change in weekly commits by tool and time horizon. The authors' preferred estimates are the long-run weeks 21 to 30 figures; early weeks may include experimentation effects.

The weak link eats the gains

Here is the paper's central finding. Sync agents raise lines of code by 741% and pull requests by 65%, yet releases rise by only 20%. Autocomplete follows the same gradient: 228% on lines of code, 36% on commits, 10% on releases. The further an outcome sits from raw code-writing, the less of the AI gain survives the journey.

The authors formalise this with a hierarchical production model in which each layer combines upstream output with human effort through a CES technology. The key parameter is the elasticity of substitution: if abundant AI-generated code could substitute for scarce human review, gains would pass through largely intact. Calibrating the model to the autocomplete attenuation pattern yields an elasticity of 0.25, deep in complements territory. With complements, even infinite automation of an upstream layer yields bounded gains. At an elasticity of 0.25, fully automating an upstream stage raises output by at most 26% if humans still gate the downstream stages. This is the O-ring and weak-links logic of Kremer and Jones, applied vertically inside a single production process and, for once, tested against data.

The model also explains why agents beat autocomplete on final output. A tool's impact depends not just on its strength but on where it enters the hierarchy. Autocomplete enters at layer one and must survive five layers of human bottlenecks; async agents create entire pull requests, entering at layer four with only two layers left to traverse. The empirical async effect on pull requests (72%) sharply exceeds what pass-through from the code layer would predict, confirming the tools are intervening directly at higher layers. The bottleneck is migrating up the chain as capabilities improve. It just has not reached the top yet.

More apps, nobody downloading them

The final act extends the question past GitHub to actual consumers. Across the Apple App Store, Google Play, the Chrome Web Store, and SourceForge, the authors track every new application and its early usage. The supply side responds: new iOS apps roughly double from around 50,000 to 100,000 per month between early 2025 and April 2026, Chrome extensions double too, and Google Play breaks out of a multi-year decline. SourceForge, whose developer base uses AI less, shows nothing. The timing aligns with the agentic-coding era beginning February 2025.

Demand does not respond. Total usage by new app cohorts in their first three months is flat or declining on every platform. That rules out new blockbusters and the long-tail story where many niche apps each find a small audience. Worse for optimists, the share of new apps failing to reach even a modest audience rose through 2025, from 79% to 86% on iOS and 18% to 31% on Chrome, which also rules out the better-matching story. The marginal AI-era app is, on the whole, invisible to users.

The authors are careful about interpretation. Either the marginal apps are low quality, or discovery and adoption form yet another bottleneck that takes time to clear, and barely a year of post-launch data cannot distinguish the two. Either way, the pattern is the aggregate echo of the developer-level result: publishing is no longer the binding constraint, but reaching users still is.

THE BOTTOM LINE

AI coding tools deliver enormous and growing task-level productivity gains, but software production is a chain of complementary stages, and a chain moves at the pace of its slowest link. With an estimated elasticity of substitution of 0.25 between AI output and human effort, gains at the code-writing layer compress sharply before reaching releases, and compress again before reaching users. The binding constraint in software is shifting from writing code to reviewing, integrating, and distributing it. Until AI eases those downstream stages too, expect impressive activity statistics and far more modest changes in what actually ships.

Reference

Demirer, M., Musolff, L., & Yang, L. (2026). Writing Code vs. Shipping Code: Productivity Effects Across Generations of AI Coding Tools. NBER Working Paper, No. 35275. http://www.nber.org/papers/w35275