Research Explainer · Benkovich & Valkov (2026)

Give AI agents a team structure and they outperform solo agents — even with weaker models

Agyn organises four specialised AI agents into a software engineering team with a manager, researcher, engineer, and code reviewer — each with its own sandbox, tools, and model. On SWE-bench 500, this production-first system resolves 72.2% of issues, beating single-agent baselines on comparable models by 7.2 percentage points. No benchmark tuning required.

72.2%
SWE-bench 500
resolution rate
4
Specialised agent roles
on the team
+7.2pp
Over single-agent baseline
(same model family)

The four-agent team structure

💼

Manager

GPT-5 (medium reasoning)

Coordinates the workflow. Decides when to research, specify, implement, or request review. Does not write code.

🔎

Researcher

GPT-5 (medium reasoning)

Explores the repo and issue, identifies root causes, and writes a structured task specification for the engineer.

💻

Engineer

GPT-5-Codex (medium)

Writes and tests code in an isolated sandbox. Uses a smaller, cheaper, code-specialised model for fast iteration.

📝

Reviewer

GPT-5-Codex (medium)

Opens a PR, inspects the diff, leaves inline code-review comments. Approves or requests changes.

GitHub-native development workflow

1
Analyse
Manager reads issue
2
Research
Researcher explores repo
3
Specify
Task spec written
4
Implement
Engineer codes & tests
5
PR Review
Reviewer inspects diff
6
Iterate
Until reviewer approves

Steps are not fixed — the manager decides dynamically when to loop back to research, request re-implementation, or proceed to review. The number of iterations emerges from the task, not from a predefined pipeline.

SWE-bench 500 results (GPT-5-family models)

Architecture type

Multi-agent (Agyn) Single-agent baselines

Based on Benkovich & Valkov (2026), Table 1. All systems use GPT-5-family models. Agyn was not tuned for SWE-bench — it runs the same configuration used in production.

What this system does differently — in plain English

Most AI coding agents today are solo operators: a single model gets handed a GitHub issue, a shell, and a code editor, and tries to do everything — read the repo, understand the bug, write a fix, test it, and submit a patch. Agyn takes a different approach, one borrowed from how human engineering teams actually work. It breaks the problem into roles: a manager who coordinates but doesn't code, a researcher who investigates the repo and writes a spec, an engineer who implements the fix in a sandboxed environment, and a reviewer who opens a real pull request and leaves inline code-review comments.

Each agent has its own context window, its own tools, and — critically — can use a different model. The engineer and reviewer use GPT-5-Codex, a smaller code-specialised model that's faster and cheaper for iterative work. The manager and researcher use GPT-5, a larger general-purpose model better suited to understanding issues, navigating repositories, and making coordination decisions. This mirrors how real teams allocate skill: you don't need a staff engineer to run test suites, and you don't want a junior deciding what to build next.

Key design decisions that mattered

Three design choices stand out. First, isolated execution environments: each agent gets its own sandbox where it can freely experiment, run tests, and discard failed attempts without polluting the shared repo. This means the engineer can try three different approaches and throw away two of them, exactly as a human developer would on a feature branch.

Second, manager-mediated communication: agents don't talk to each other directly. All coordination flows through the manager, who decides which agent to invoke next based on the current state. This prevents the confused back-and-forth that plagues naive multi-agent setups and creates a clean audit trail of who did what and why.

Third, automation-first prompting: because there's no human in the loop, the system can't afford agents that pause and ask for permission or produce partial outputs waiting for feedback. Agent prompts are explicitly designed to discourage conversational habits — requesting approval, hedging, or producing status updates instead of artifacts. Completion is defined objectively: a task is done when the reviewer approves the pull request, not when an agent says it's done.

The bigger idea: org design for agents

The central claim is that how you organise agents matters as much as which model you use. A thoughtfully structured team of weaker models can match or beat a more powerful solo agent. This reframes the frontier of autonomous software engineering: instead of only chasing better models, invest in better coordination, role specialisation, and team structure. The authors argue this mirrors a well-established lesson from human organisations — that a team of specialists with clear communication protocols routinely outperforms a brilliant generalist working alone.

Reference

Benkovich, N., & Valkov, V. (2026). Agyn: A multi-agent system for team-based autonomous software engineering (arXiv:2602.01465). arXiv. https://arxiv.org/abs/2602.01465