Research Explainer · Mertens et al. (2026)
Across 17,000 worker evaluations of more than 3,000 real labor-market tasks, frontier models improve broadly across task lengths, not in sudden bursts. By 2029 most text-based tasks could hit 80–95% success rates.
Published March 2026
3.8 mo doubling time for the task length frontier models can complete at a 50% success rate
60% average rate at which model outputs are accepted by domain-expert evaluators without edits
80–95% projected success rate on most text-based labor tasks by 2029 at a minimally sufficient quality level
Two ways AI could eat the labor market
Most public discussion of AI automation assumes a crashing wave: capabilities loiter just out of reach for some class of work, then surge over it almost overnight. Workers in that class go from safe to obsolete in a quarter. Recent benchmark-based research from METR has been read this way, with steep logistic curves linking AI success to task duration.
Mertens and colleagues at MIT FutureTech propose an alternative: a rising tide. Same logistic shape, but flatter. Each new model lifts performance on short tasks and long tasks by similar amounts. No domain wakes up to find itself suddenly automated, but everything is gradually improving at once.
The distinction matters because it changes what workers can see coming. A crashing wave hides automation risk until it's too late to retrain. A rising tide is more visible, more disruptive in aggregate, and harder to hide from.
Mertens et al. (2026), Figure 4. Logistic curves of AI response acceptance against human task duration, fit on roughly 17,000 worker evaluations. Sufficient = no edits needed, Average = no edits to be average quality, Superior = no edits to be above average.
Mertens et al. (2026), Figure 6 Panel (a). Predicted success rate (score ≥ 7) for frontier models by quarter, by task length. The lines move up together — the slope barely changes.
Mertens et al. (2026), Table 1. Logistic slope of acceptance on log task duration, by O*NET job family. More negative means longer tasks are harder for AI. Asterisks omitted; selected families shown for clarity.
Build the dataset workers actually face
The authors took the U.S. Department of Labor's O*NET database of 18,786 job tasks and used GPT-4 to keep the 11,768 with at least 10% LLM time-savings potential. GPT-5 then generated six concrete scenarios per task — a five-minute restaurant check split, a one-week leadership programme design, a four-hour project status deck. Each scenario was filtered to be self-contained, text-completable, and a meaningful slice of the underlying job.
Then they did the part everyone else skips. They hired workers with at least six months of on-the-job experience in that exact occupation, paid via Prolific, and asked them to rate five different LLM responses per task on a 1–9 scale. Score 7 means "useful as is, minimally sufficient". Score 9 means "superior quality, no edits". After heavy quality control (34.6% of responses thrown out), 17,205 evaluations remained, spanning 41 models released between June 2023 and August 2025.
Crucially, this is non-deterministic real work — drafting, advising, planning, splitting cheques — not the deterministic coding puzzles in METR's benchmark.
What the curves actually show
At the sufficient-quality threshold, the logistic slope on log task duration is −0.31. A tenfold increase in task length drops accepted-without-edits rates by roughly 7.6 percentage points. That is shallow. METR's comparable slope on coding benchmarks is around −1.0.
In Q2 2024, frontier models already cleared a 50% acceptance rate on tasks taking humans about three hours. By Q3 2025 that crept up to roughly one week. At the 80% threshold the feasible duration is much shorter — five minutes or less — but the doubling time is the same: 3.8 months, with tight confidence bands.
The failure rate (1 minus the acceptance rate) halves every 2.4 to 3.2 years, which converts to 8–11 percentage points of additional acceptance per year across the five-minute to 24-hour range.
Size and vintage do different jobs
Splitting the 41 models reveals a clean asymmetry. Larger models (>100B parameters) rotate the success curve outward — they help most on short tasks, less on long ones. Newer models (released in 2025) shift the curve up in parallel — they help short and long tasks by similar amounts.
The authors' reading: scale buys you headroom on individual hard sub-steps, but stretching the chain of coupled steps needed for a week-long task probably requires deliberate post-training on long-horizon work. Reinforcement learning over sequenced actions, not just more parameters.
Not every job family fits the rising-tide story. Personal care and service tasks show a steep slope of −0.93. Architecture and engineering and creative-arts tasks are also meaningfully steeper than average. About a quarter of job families look more wave-like; the rest are genuinely flat.
What 2029 looks like, with caveats
Extrapolating the linear-in-logit trend to 2029 puts most text-based labor-market tasks at 80–95% acceptance rates at the minimally sufficient threshold. Reaching consistently near-perfect performance on tasks that currently sit at low success rates takes several more years beyond that.
The authors are unusually direct about what this doesn't mean. It is not a forecast of job displacement. Their sample over-represents easy-to-survey occupations, every task is delivered self-contained when in reality information must be gathered from messy systems, and "last-mile" integration costs are excluded. Task automation also doesn't map cleanly to occupational automation — losing a task can raise or lower wages depending on how it fits the wider bundle.
They also flag that their projections assume current AI scaling continues. Compute is already several orders of magnitude past where frontier training sat a few years ago, algorithmic progress may be slowing, and hardware is hitting physical limits. Treat the 2029 number as an upper bound, not a base case.
The bottom line
References
Mertens, M., Kuzee, A., Harris, B. S., Lyu, H., Li, W., Rosenfeld, J., Anto, M., Fleming, M., & Thompson, N. (2026). Crashing Waves vs. Rising Tides: Preliminary Findings on AI Automation from Thousands of Worker Evaluations of Labor Market Tasks. arXiv preprint arXiv:2604.01363. https://arxiv.org/abs/2604.01363
Kwa, T., West, B., Becker, J., et al. (2025). Measuring AI ability to complete long tasks. arXiv preprint arXiv:2503.14499. https://arxiv.org/abs/2503.14499
O*NET. (2024). O*NET 29.2 Database. https://www.onetcenter.org/dictionary/29.2/excel/