Lin et al. (2026)
An agent rewrites its own coding harness, and beats the engineers who used to do it by hand
Agentic Harness Engineering turns harness tuning into an automated loop. Ten iterations lift pass@1 on Terminal-Bench 2 from a bash-only 69.7% to 77.0%, past the human-built Codex harness and every self-evolving baseline.
- 77.0%
- pass@1 on Terminal-Bench 2 after ten automated iterations, up from a 69.7% bash-only seed and above the human-designed Codex harness at 71.9%
- +10.1 pp
- largest cross-family gain when the frozen harness is dropped onto deepseek-v4-flash, with no re-evolution at all