Research Explainer · Chen et al. (2026)
Across six frontier models on SWE-bench Verified, agent-written tests are widespread but provide almost no leverage on task outcomes. GPT-5.2 writes virtually zero tests yet resolves issues at nearly the same rate as test-heavy models. Suppressing tests slashes costs by up to 49% while losing only 1.8–2.6% of successes.
The GPT-5.2 anomaly: almost no tests, comparable results
Tasks with agent-written tests (%)
GPT-5.2 is a clear outlier at 0.6%
Issue resolution rate (%)
Only 2.6pp below the top test-writing model
Model (SWE-bench Verified, 500 tasks)
Based on Chen et al. (2026), Table 1. Test-writing frequency is the % of all 500 tasks where the agent creates at least one test artifact.
Pushing the lever hard barely moves outcomes
Prompt intervention results (RQ3)
Based on Chen et al. (2026), Figure 3. When tests are encouraged or suppressed via prompt changes, ~83% of tasks end up with the same pass/fail result regardless.
Where the real impact lands: cost, not correctness
Resolution rate change
Minimal shift in either direction
Input token change (%)
Suppressing tests slashes compute cost
Intervention direction
Based on Chen et al. (2026), Table 7. Encouraging tests adds cost without improving outcomes; discouraging tests saves 32–49% of input tokens with only 1.8–2.6% fewer successes.
What this study actually did — in plain English
Chen and colleagues wanted to know whether the tests that AI coding agents write for themselves — on the fly, while trying to fix real GitHub issues — actually help them produce correct patches. This matters because test-writing is one of the most common behaviours on the SWE-bench leaderboard. Nearly every top-ranked agent does it. But does it work, or is it just a habit?
They studied six state-of-the-art models running inside mini-SWE-agent, a lightweight scaffold that leaves all testing decisions to the model. Across 500 real-world tasks from SWE-bench Verified, five of the six models wrote at least one test file in 62–99% of tasks. The exception was GPT-5.2, which almost never wrote tests — yet still resolved 71.8% of issues, only 2.6 percentage points below the top model. That anomaly set up the rest of the investigation.
Tests as observation, not verification
When the researchers dug into what agent-written tests actually contain, they found that most feedback comes from value-revealing print statements, not from formal assertions. Across all models, prints consistently outnumber assertions. And the assertions that do appear are dominated by simple property checks and exact-value comparisons — the kind of thing you'd write to see what a function returns, not to rigorously verify correctness. In other words, agents use tests less like a QA engineer and more like a developer running print-debugging in a scratch file.
The intervention experiment
The most striking part of the study is the controlled experiment. For models that rarely wrote tests (GPT-5.2, Gemini-3-Pro-Preview), the researchers added a prompt instruction encouraging test creation. For heavy test-writers (Kimi-K2-Thinking, DeepSeek-v3.2-Reasoner), they removed test-encouraging language and added an instruction to avoid writing new test files.
The prompts worked — test-writing behaviour changed dramatically. But outcomes barely moved. Across all four models, an average of 83.2% of tasks ended with the same pass/fail result. Where real impact did appear was in efficiency. Discouraging tests in Kimi-K2-Thinking cut input tokens by 49% and API calls by 35.4%, while only 2.6% fewer tasks were resolved. The cost of test-writing, it turns out, is concrete and measurable; the benefit is not.
The implication for agent design
Agent-written tests are best understood as a learned process style — a behaviour models have picked up from training data — rather than a reliable driver of success. The paper doesn't argue that testing is inherently useless. Instead, it suggests that current LLMs don't know when a test is worth writing, what to test, or how to interpret the results. Until agents develop genuine test-design skill, the default should probably be less testing, not more — and the budget saved should be redirected toward reasoning, code reading, and more careful patch construction.
Reference
Chen, Z., Sun, Z., Shi, Y., Peng, C., Gu, X., Lo, D., & Jiang, L. (2026). Rethinking the value of agent-generated tests for LLM-based software engineering agents (arXiv:2602.07900). arXiv. https://arxiv.org/abs/2602.07900