Research Explainer · Lin et al. (2026)
Agentic Harness Engineering turns harness tuning into an automated loop. Ten iterations lift pass@1 on Terminal-Bench 2 from a bash-only 69.7% to 77.0%, past the human-built Codex harness and every self-evolving baseline.
Published May 2026
77.0% pass@1 on Terminal-Bench 2 after ten automated iterations, up from a 69.7% bash-only seed and above the human-designed Codex harness at 71.9%
+10.1 pp largest cross-family gain when the frozen harness is dropped onto deepseek-v4-flash, with no re-evolution at all
12% fewer tokens than the seed on SWE-bench-verified, while still posting the highest aggregate success rate
11.1% regression recall: the loop can say which tasks an edit will fix, but is nearly blind to which it is about to break
The harness, not the model
A coding agent is more than its language model. Around the model sits a harness: the system prompt that shapes its work style, the tools that expose the shell and file system, the middleware that manages context and recovery, and the long-term memory it carries between tasks. On long-horizon software tasks this scaffolding moves the score as much as the model does, even when the model is held fixed.
The catch is that harness engineering is hand work. Developers read trajectories, spot recurring failure patterns, and hand-craft edits across prompts, tools, and middleware. Worse, the best harness is model-specific, so every new base model needs a fresh round of manual tuning. As models ship faster, that manual loop falls behind.
The authors argue the bottleneck is not the cleverness of an automating agent but observability: give an evolution agent a clear action space and structured evidence, and it can converge on better designs on its own. Agentic Harness Engineering (AHE) is their attempt to prove it.
Recreated from Lin et al. (2026), Figure 1. Per-iteration pass@1 and the best-so-far envelope on Terminal-Bench 2 (89 tasks), with the human-designed Codex baseline for reference. All three role agents share one base model, so the gain is attributable to harness edits.
Recreated from Lin et al. (2026), Figure 4. Cross-iteration mean precision and recall of the Evolve Agent's self-predictions over 9 rounds, against a random-prediction baseline. Fix predictions land ~5x above chance; regression predictions barely clear it.
Three pillars that make every edit falsifiable
AHE is a closed loop in which one agent rewrites another agent's harness while the base model stays frozen. It rests on three matched forms of observability, each turning a messy part of the problem into something an agent can actually read and act on.
Beating the humans and the prompt-tuners
A single ten-iteration campaign from a deliberately minimal bash-only seed (NexAU0) finished in about 32 hours and topped every baseline on the panel: three human-designed harnesses and two self-evolving loops, ACE and Training-Free GRPO, all starting from the same seed.
The gap to the prompt-only methods is a layer mismatch. ACE distills natural-language playbooks and TF-GRPO reinforces tool sequences, but neither opens the surrounding scaffolding to edits. AHE jointly evolves prompt, tools, middleware, and memory, and the gain concentrates in exactly the layers the others leave untouched. The one soft spot is the Hard tier, where AHE marginally trails Codex because its own components interfere on the longest tasks.
| Method | All (89) | Easy (4) | Med. (55) | Hard (30) |
|---|---|---|---|---|
| OpenCode (human) | 47.2% | 75.0% | 52.7% | 33.3% |
| Terminus-2 (human) | 62.9% | 75.0% | 74.5% | 40.0% |
| Codex (human) | 71.9% | 75.0% | 80.0% | 56.7% |
| NexAU0 seed | 69.7% | 87.5% | 78.2% | 51.7% |
| ACE | 68.9% | 91.7% | 78.2% | 48.9% |
| TF-GRPO | 72.3% | 100.0% | 79.4% | 55.6% |
| AHE | 77.0% | 100.0% | 88.2% | 53.3% |
The frozen harness travels
If the harness only encoded Terminal-Bench tricks, it would not survive a move. It does. Dropped unchanged onto SWE-bench-verified, AHE posts the highest aggregate success while spending 12% fewer tokens than the seed, 21% fewer than TF-GRPO, and 32% fewer than ACE. The two prompt-only baselines actually regress below the seed here, because the text they inject rides every model call and adds cost without reshaping behavior.
Cross-model transfer tells the same story. Re-evaluated on five alternate base models with no further evolution, the harness lifts pass@1 everywhere, from +2.3 pp on GPT-5.4 to +10.1 pp on deepseek-v4-flash. The cross-family gains are the largest, which the authors read as weaker bases leaning more heavily on coordination patterns AHE has fixed inside tools, middleware, and memory.
A component ablation localizes the value. Swap in memory, tools, or middleware alone and each beats the seed on its own. Swap in only the system prompt and it regresses by 2.3 pp. Factual harness structure transfers; prose-level strategy does not. The three positive single-component gains sum to +11.1 pp against full AHE's +7.3 pp, so the components interact non-additively rather than stacking cleanly.
| Variant | All (89) | Easy (4) | Medium (55) | Hard (30) |
|---|---|---|---|---|
| NexAU0 seed | 69.7% | 87.5% | 78.2% | 51.7% |
| + memory only | 75.3% | 50.0% | 83.6% | 63.3% |
| + tool only | 73.0% | 75.0% | 87.3% | 46.7% |
| + middleware only | 71.9% | 100.0% | 81.8% | 50.0% |
| + system prompt only | 67.4% | 75.0% | 78.2% | 46.7% |
| AHE full | 77.0% | 100.0% | 88.2% | 53.3% |
What it still can't see
The loop's self-attribution is sharp in one direction and dull in the other. Its fix predictions land roughly five times above chance, so each edit targets a real, anticipated failure rather than an arbitrary task. But its regression predictions barely clear the random baseline. Across nine rounds the agent issued 43 regression predictions and only 5 landed, while 40 regressions it never foresaw actually happened.
That blindness is exactly what produces the non-monotone dips in the evolution curve: the agent can justify why an edit should help, but cannot reliably name what the same edit is about to break. The authors flag regression foresight as the clearest direction for future self-evolution loops, and are candid that AHE is a controlled research prototype, with operating-point coupling and incomplete guardrails still on the table.
KEY CONTRIBUTION
AHE reframes harness tuning as an observability problem rather than an agent-capability one. By exposing components as files, distilling rollouts into a layered evidence corpus, and binding every edit to a falsifiable next-round prediction, it lets a coding agent improve its own scaffolding autonomously, and the resulting harness transfers across benchmarks and model families without re-evolution.
Reference
Lin, J., Liu, S., Pan, C., Lin, L., Dou, S., Xi, Z., Huang, X., Yan, H., Han, Z., Gui, T., & Jiang, Y.-G. (2026). Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses. arXiv preprint arXiv:2604.25850. https://arxiv.org/abs/2604.25850