Agyn: Multi-Agent Software Engineering — Benkovich & Valkov (2026)

Based on Benkovich & Valkov (2026), Table 1. All systems use GPT-5-family models. Agyn was not tuned for SWE-bench — it runs the same configuration used in production.

What this system does differently — in plain English

Most AI coding agents today are solo operators: a single model gets handed a GitHub issue, a shell, and a code editor, and tries to do everything — read the repo, understand the bug, write a fix, test it, and submit a patch. Agyn takes a different approach, one borrowed from how human engineering teams actually work. It breaks the problem into roles: a manager who coordinates but doesn't code, a researcher who investigates the repo and writes a spec, an engineer who implements the fix in a sandboxed environment, and a reviewer who opens a real pull request and leaves inline code-review comments.

Each agent has its own context window, its own tools, and — critically — can use a different model. The engineer and reviewer use GPT-5-Codex, a smaller code-specialised model that's faster and cheaper for iterative work. The manager and researcher use GPT-5, a larger general-purpose model better suited to understanding issues, navigating repositories, and making coordination decisions. This mirrors how real teams allocate skill: you don't need a staff engineer to run test suites, and you don't want a junior deciding what to build next.

Key design decisions that mattered

Three design choices stand out. First, isolated execution environments: each agent gets its own sandbox where it can freely experiment, run tests, and discard failed attempts without polluting the shared repo. This means the engineer can try three different approaches and throw away two of them, exactly as a human developer would on a feature branch.

Second, manager-mediated communication: agents don't talk to each other directly. All coordination flows through the manager, who decides which agent to invoke next based on the current state. This prevents the confused back-and-forth that plagues naive multi-agent setups and creates a clean audit trail of who did what and why.

Third, automation-first prompting: because there's no human in the loop, the system can't afford agents that pause and ask for permission or produce partial outputs waiting for feedback. Agent prompts are explicitly designed to discourage conversational habits — requesting approval, hedging, or producing status updates instead of artifacts. Completion is defined objectively: a task is done when the reviewer approves the pull request, not when an agent says it's done.

Reference

Benkovich, N., & Valkov, V. (2026). Agyn: A multi-agent system for team-based autonomous software engineering (arXiv:2602.01465). arXiv. https://arxiv.org/abs/2602.01465

Give AI agents a team structure and they outperform solo agents — even with weaker models