Research Explainer · Chen et al. (2026)

AI coding agents write tests compulsively — but the tests barely matter

Across six frontier models on SWE-bench Verified, agent-written tests are widespread but provide almost no leverage on task outcomes. GPT-5.2 writes virtually zero tests yet resolves issues at nearly the same rate as test-heavy models. Suppressing tests slashes costs by up to 49% while losing only 1.8–2.6% of successes.

83%

Outcomes unchanged when
tests are added or removed

Frontier LLMs studied
on SWE-bench Verified

−49%

Token cost reduction
from suppressing tests

The GPT-5.2 anomaly: almost no tests, comparable results

Tasks with agent-written tests (%)

GPT-5.2 is a clear outlier at 0.6%

Issue resolution rate (%)

Only 2.6pp below the top test-writing model

Model (SWE-bench Verified, 500 tasks)

claude-opus-4.5 gemini-3-pro gpt-5.2 kimi-k2-thinking minimax-m2 deepseek-v3.2

Based on Chen et al. (2026), Table 1. Test-writing frequency is the % of all 500 tasks where the agent creates at least one test artifact.

Pushing the lever hard barely moves outcomes

Prompt intervention results (RQ3)

Outcome unchanged Fail → Success Success → Fail

Based on Chen et al. (2026), Figure 3. When tests are encouraged or suppressed via prompt changes, ~83% of tasks end up with the same pass/fail result regardless.

Where the real impact lands: cost, not correctness

Resolution rate change

Minimal shift in either direction

Input token change (%)

Suppressing tests slashes compute cost

Intervention direction

Encourage tests Discourage tests

Based on Chen et al. (2026), Table 7. Encouraging tests adds cost without improving outcomes; discouraging tests saves 32–49% of input tokens with only 1.8–2.6% fewer successes.

What this study actually did — in plain English

Chen and colleagues wanted to know whether the tests that AI coding agents write for themselves — on the fly, while trying to fix real GitHub issues — actually help them produce correct patches. This matters because test-writing is one of the most common behaviours on the SWE-bench leaderboard. Nearly every top-ranked agent does it. But does it work, or is it just a habit?

They studied six state-of-the-art models running inside mini-SWE-agent, a lightweight scaffold that leaves all testing decisions to the model. Across 500 real-world tasks from SWE-bench Verified, five of the six models wrote at least one test file in 62–99% of tasks. The exception was GPT-5.2, which almost never wrote tests — yet still resolved 71.8% of issues, only 2.6 percentage points below the top model. That anomaly set up the rest of the investigation.

Tests as observation, not verification

When the researchers dug into what agent-written tests actually contain, they found that most feedback comes from value-revealing print statements, not from formal assertions. Across all models, prints consistently outnumber assertions. And the assertions that do appear are dominated by simple property checks and exact-value comparisons — the kind of thing you'd write to see what a function returns, not to rigorously verify correctness. In other words, agents use tests less like a QA engineer and more like a developer running print-debugging in a scratch file.

The intervention experiment

The most striking part of the study is the controlled experiment. For models that rarely wrote tests (GPT-5.2, Gemini-3-Pro-Preview), the researchers added a prompt instruction encouraging test creation. For heavy test-writers (Kimi-K2-Thinking, DeepSeek-v3.2-Reasoner), they removed test-encouraging language and added an instruction to avoid writing new test files.

The prompts worked — test-writing behaviour changed dramatically. But outcomes barely moved. Across all four models, an average of 83.2% of tasks ended with the same pass/fail result. Where real impact did appear was in efficiency. Discouraging tests in Kimi-K2-Thinking cut input tokens by 49% and API calls by 35.4%, while only 2.6% fewer tasks were resolved. The cost of test-writing, it turns out, is concrete and measurable; the benefit is not.

The implication for agent design

Agent-written tests are best understood as a learned process style — a behaviour models have picked up from training data — rather than a reliable driver of success. The paper doesn't argue that testing is inherently useless. Instead, it suggests that current LLMs don't know when a test is worth writing, what to test, or how to interpret the results. Until agents develop genuine test-design skill, the default should probably be less testing, not more — and the budget saved should be redirected toward reasoning, code reading, and more careful patch construction.

Reference

Chen, Z., Sun, Z., Shi, Y., Peng, C., Gu, X., Lo, D., & Jiang, L. (2026). Rethinking the value of agent-generated tests for LLM-based software engineering agents (arXiv:2602.07900). arXiv. https://arxiv.org/abs/2602.07900