Research Explainer · Lin et al. (2026)

An agent rewrites its own coding harness, and beats the engineers who used to do it by hand

Agentic Harness Engineering turns harness tuning into an automated loop. Ten iterations lift pass@1 on Terminal-Bench 2 from a bash-only 69.7% to 77.0%, past the human-built Codex harness and every self-evolving baseline.

Published May 2026

77.0% pass@1 on Terminal-Bench 2 after ten automated iterations, up from a 69.7% bash-only seed and above the human-designed Codex harness at 71.9%

+10.1 pp largest cross-family gain when the frozen harness is dropped onto deepseek-v4-flash, with no re-evolution at all

12% fewer tokens than the seed on SWE-bench-verified, while still posting the highest aggregate success rate

11.1% regression recall: the loop can say which tasks an edit will fix, but is nearly blind to which it is about to break

The harness, not the model

A coding agent is more than its language model. Around the model sits a harness: the system prompt that shapes its work style, the tools that expose the shell and file system, the middleware that manages context and recovery, and the long-term memory it carries between tasks. On long-horizon software tasks this scaffolding moves the score as much as the model does, even when the model is held fixed.

The catch is that harness engineering is hand work. Developers read trajectories, spot recurring failure patterns, and hand-craft edits across prompts, tools, and middleware. Worse, the best harness is model-specific, so every new base model needs a fresh round of manual tuning. As models ship faster, that manual loop falls behind.

The authors argue the bottleneck is not the cleverness of an automating agent but observability: give an evolution agent a clear action space and structured evidence, and it can converge on better designs on its own. Agentic Harness Engineering (AHE) is their attempt to prove it.

Ten iterations of an agent editing its own harness

Recreated from Lin et al. (2026), Figure 1. Per-iteration pass@1 and the best-so-far envelope on Terminal-Bench 2 (89 tasks), with the human-designed Codex baseline for reference. All three role agents share one base model, so the gain is attributable to harness edits.

Pipeline diagram showing the NexAU harness components feeding a coding agent that produces a raw trace, distilled into evidence by a debugger, which an evolve agent uses to modify components in a closed loop across three observability layers. — A schematic of the AHE closed loop, where a coding agent's harness components, rollout traces, and edit decisions are turned into observable artifacts an evolving agent reads and improves each round.

The loop knows its fixes, not its regressions

Recreated from Lin et al. (2026), Figure 4. Cross-iteration mean precision and recall of the Evolve Agent's self-predictions over 9 rounds, against a random-prediction baseline. Fix predictions land ~5x above chance; regression predictions barely clear it.

Three pillars that make every edit falsifiable

AHE is a closed loop in which one agent rewrites another agent's harness while the base model stays frozen. It rests on three matched forms of observability, each turning a messy part of the problem into something an agent can actually read and act on.

Component observabilityThe harness is decoupled into seven editable component types exposed as files at fixed mount points (system prompt, tool description, tool implementation, middleware, skill, sub-agent, long-term memory). Each failure pattern maps to one component class, and each logical edit is one git commit, so changes are localized and revertible.
Experience observabilityAn Agent Debugger distills roughly 10 million raw trajectory tokens into a layered, drill-down evidence corpus of about 10 thousand tokens. The evolver reads structured root causes and a benchmark-level overview rather than raw logs, with original traces available on demand.
Decision observabilityEvery edit ships with a self-declared prediction in a change manifest: the failure evidence, root cause, targeted fix, expected fixes, and at-risk regressions. The next round intersects those predictions with observed task-level deltas, so each edit becomes a falsifiable contract that is rolled back at file granularity if it fails.

Beating the humans and the prompt-tuners

A single ten-iteration campaign from a deliberately minimal bash-only seed (NexAU0) finished in about 32 hours and topped every baseline on the panel: three human-designed harnesses and two self-evolving loops, ACE and Training-Free GRPO, all starting from the same seed.

The gap to the prompt-only methods is a layer mismatch. ACE distills natural-language playbooks and TF-GRPO reinforces tool sequences, but neither opens the surrounding scaffolding to edits. AHE jointly evolves prompt, tools, middleware, and memory, and the gain concentrates in exactly the layers the others leave untouched. The one soft spot is the Hard tier, where AHE marginally trails Codex because its own components interfere on the longest tasks.

Method	All (89)	Easy (4)	Med. (55)	Hard (30)
OpenCode (human)	47.2%	75.0%	52.7%	33.3%
Terminus-2 (human)	62.9%	75.0%	74.5%	40.0%
Codex (human)	71.9%	75.0%	80.0%	56.7%
NexAU0 seed	69.7%	87.5%	78.2%	51.7%
ACE	68.9%	91.7%	78.2%	48.9%
TF-GRPO	72.3%	100.0%	79.4%	55.6%
AHE	77.0%	100.0%	88.2%	53.3%

Lin et al. (2026), Table 1. Pass@1 on Terminal-Bench 2 by official difficulty. NexAU0 is the shared seed; ACE, TF-GRPO, and AHE are self-evolution loops layered on it.

The frozen harness travels

If the harness only encoded Terminal-Bench tricks, it would not survive a move. It does. Dropped unchanged onto SWE-bench-verified, AHE posts the highest aggregate success while spending 12% fewer tokens than the seed, 21% fewer than TF-GRPO, and 32% fewer than ACE. The two prompt-only baselines actually regress below the seed here, because the text they inject rides every model call and adds cost without reshaping behavior.

Cross-model transfer tells the same story. Re-evaluated on five alternate base models with no further evolution, the harness lifts pass@1 everywhere, from +2.3 pp on GPT-5.4 to +10.1 pp on deepseek-v4-flash. The cross-family gains are the largest, which the authors read as weaker bases leaning more heavily on coordination patterns AHE has fixed inside tools, middleware, and memory.

A component ablation localizes the value. Swap in memory, tools, or middleware alone and each beats the seed on its own. Swap in only the system prompt and it regresses by 2.3 pp. Factual harness structure transfers; prose-level strategy does not. The three positive single-component gains sum to +11.1 pp against full AHE's +7.3 pp, so the components interact non-additively rather than stacking cleanly.

Variant	All (89)	Easy (4)	Medium (55)	Hard (30)
NexAU0 seed	69.7%	87.5%	78.2%	51.7%
+ memory only	75.3%	50.0%	83.6%	63.3%
+ tool only	73.0%	75.0%	87.3%	46.7%
+ middleware only	71.9%	100.0%	81.8%	50.0%
+ system prompt only	67.4%	75.0%	78.2%	46.7%
AHE full	77.0%	100.0%	88.2%	53.3%

Lin et al. (2026), Table 3. Component-level ablations on Terminal-Bench 2. Each row swaps a single AHE component into the seed, holding the other three at their defaults.

What it still can't see

The loop's self-attribution is sharp in one direction and dull in the other. Its fix predictions land roughly five times above chance, so each edit targets a real, anticipated failure rather than an arbitrary task. But its regression predictions barely clear the random baseline. Across nine rounds the agent issued 43 regression predictions and only 5 landed, while 40 regressions it never foresaw actually happened.

That blindness is exactly what produces the non-monotone dips in the evolution curve: the agent can justify why an edit should help, but cannot reliably name what the same edit is about to break. The authors flag regression foresight as the clearest direction for future self-evolution loops, and are candid that AHE is a controlled research prototype, with operating-point coupling and incomplete guardrails still on the table.

KEY CONTRIBUTION

AHE reframes harness tuning as an observability problem rather than an agent-capability one. By exposing components as files, distilling rollouts into a layered evidence corpus, and binding every edit to a falsifiable next-round prediction, it lets a coding agent improve its own scaffolding autonomously, and the resulting harness transfers across benchmarks and model families without re-evolution.

Reference

Lin, J., Liu, S., Pan, C., Lin, L., Dou, S., Xi, Z., Huang, X., Yan, H., Han, Z., Gui, T., & Jiang, Y.-G. (2026). Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses. arXiv preprint arXiv:2604.25850. https://arxiv.org/abs/2604.25850