Research · Academic & arXiv
Back to sweepResearch sweep · deep · 2025 – 2026
Agentic Engineering And Enterprise Architecture Discipline
Agentic engineering after Andrej Karpathy's vibe coding meme, April 2025-April 2026: how AI coding agents are changing enterprise software engineering across security, testability, reliability, maintainability, availability, resilience, observability, operability, cost, recovery, and engineering governance.
- frontier
- academic
- vc
- blogs
- tech
- financial
Synthesised 2026-04-30
Narrative
The academic and benchmark story is that agentic engineering is not being defined by one glamorous coding demo, but by a stack of increasingly hard evaluations that measure real modification work, long-horizon autonomy, and the gap between benchmark success and production acceptance. SWE-bench established the core problem: editing real repositories to resolve issues, where early top models solved only trivial cases. METR then pushed the field toward autonomy metrics with HCAST and RE-Bench, shifting attention from single-task pass rates to human-calibrated time horizons and sustained task completion. Recent METR reports on o3, DeepSeek/Qwen, and GPT-5.1-Codex-Max show that capability is rising, but the meaningful question is how far agents can operate without human intervention across multi-step software tasks.
A second clear theme is that enterprise-quality software properties are now the bottleneck, not just code synthesis. The literature on deprecated APIs, hallucinated code, SWE-bench-passing PRs that maintainers would not merge, and security-focused studies of exploitation and vulnerability repair all point in the same direction: agents can generate plausible artifacts, but maintainability, security, reviewability, and operational fit remain fragile. The most useful current research therefore is benchmark-heavy, empirical, and failure-oriented, with METR especially valuable because it measures both capability growth and the ways benchmark traces can mislead teams about real-world utility.
Sources
| ID | Title | Outlet | Date | Significance |
|---|---|---|---|---|
| a1 | SWE-bench: Can Language Models Resolve Real-World GitHub Issues? | ICLR 2024 Oral / Princeton publication record | 2024 | Foundational benchmark for real-world code modification: 2,294 GitHub issues across 12 repositories, with early results showing even strong models solve only the easiest tasks. |
| a2 | AgentBench: Evaluating LLMs as Agents | arXiv | 2023 | Early broad benchmark for LLM agents in interactive environments, useful as a conceptual precursor to later software-engineering-specific agent evaluations. |
| a3 | OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments | arXiv | 2024 | Important for agentic engineering because it shows current agents struggle with desktop-computer workflows, grounding, and repetitive action loops that resemble enterprise tool use. |
| a4 | BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval | arXiv | 2024 | Relevant to agentic coding because production coding agents depend on retrieval over docs, code, and logs; BRIGHT shows standard retrieval remains brittle on reasoning-heavy queries. |
| a5 | How and Why LLMs Use Deprecated APIs in Code Completion? An Empirical Study | arXiv | 2024 | Directly relevant to maintainability and enterprise drift: measures how often models select deprecated APIs and why, with concrete evidence on library evolution failure modes. |
| a6 | CodeMirage: Hallucinations in Code Generated by Large Language Models | arXiv | 2024 | Useful empirical work on code hallucination and invalid outputs, supporting the case that agentic systems need stronger verification than fluent generation. |
| a7 | A Vision on Open Science for the Evolution of Software Engineering Research and Practice | arXiv / FSE Companion 2024 | 2024 | Not an agent paper, but relevant as a governance and reproducibility foundation for evaluating code-generation and agentic software practices rigorously. |
| a8 | Can Language Models Solve Olympiad Programming? | arXiv | 2024 | Benchmark work on hard coding tasks with unit tests and reference solutions; useful as a bridge between pure code generation and robust algorithmic evaluation. |
| a9 | SWE-agent: Agent Computer Interfaces Enable Software Engineering Language Models | arXiv | 2024 | Core systems paper for agentic software engineering, showing that environment/tool interfaces materially change how capable coding agents are in real repositories. |
| a10 | DevBench: A Comprehensive Benchmark for Software Development | arXiv | 2024 | Broad software-development benchmark that helps situate coding agents beyond patching tasks toward the full workflow of development. |
| a11 | SWE-bench Verified | SWE-bench project / benchmark release | 2024 | High-signal benchmark subset used widely in agent evaluations; important because it reduces some noise from original SWE-bench and is closer to real engineering work. |
| a12 | HCAST: Human-Calibrated Autonomy Software Tasks | METR | 2025 | Key autonomy benchmark for software and related tasks; central to measuring time horizons rather than just pass rates, which is crucial for agentic engineering. |
| a13 | Evaluating frontier AI R&D capabilities of language model agents against human experts | METR / RE-Bench | 2024 | Introduces RE-Bench, a benchmark for day-long research-engineering tasks; important because it tests sustained agentic work, not just short code patches. |
| a14 | How Does Time Horizon Vary Across Domains? | METR | 2025 | Synthesizes HCAST, RE-Bench, SWAA, and SWE-bench to compare autonomous capability growth across domains, highlighting the time-horizon framing now used in METR work. |
| a15 | Details about METR’s preliminary evaluation of OpenAI’s o3 and o4-mini | METR | 2025-04 | Frontier-model capability study showing updated HCAST and RE-Bench results for o3/o4-mini; useful for current frontier estimates around autonomous software work. |
| a16 | Details about METR’s preliminary evaluation of DeepSeek and Qwen models | METR | 2025-06 | Shows how mid-2025 open models compare on autonomy task suites, giving a concrete empirical baseline for the state of agentic coding capability outside frontier labs. |
| a17 | MALT: A Dataset of Natural and Prompted Behaviors That Threaten Eval Integrity | METR | 2025-10 | Directly relevant to governance and benchmark integrity: documents reward hacking and sandbagging behaviors in realistic agentic software/research task traces. |
| a18 | Details about METR’s evaluation of OpenAI GPT-5.1-Codex-Max | METR | 2025-11 | Current frontier software-capability report tying HCAST, RE-Bench, and SWAA together for a code-focused OpenAI model, relevant to how far coding agents have progressed. |
| a19 | Many SWE-bench-Passing PRs Would Not Be Merged into Main | METR | 2026-03 | Important corrective evidence: benchmark-passing patches often fail maintainer acceptance, exposing the gap between synthetic task success and production-grade engineering value. |
| a20 | Teams of LLM Agents can Exploit Zero-Day | arXiv | 2024 | Security-relevant evidence that agentic systems can discover and exploit vulnerabilities, underscoring the need for stronger controls, review, and threat modeling. |
| a21 | A Case Study of LLM for Automated Vulnerability Repair | arXiv | 2024 | Shows how LLMs behave on vulnerability repair tasks, useful for understanding security, correctness, and patch quality in agent-assisted remediation workflows. |
| a22 | How Does Time Horizon Vary Across Domains? (METR-HRS synthesis note) | METR | 2025-07 | Provides the cross-benchmark framing that is especially useful for enterprise engineering discussions, where autonomy duration matters more than isolated task success. |
| a23 | Metr resources for measuring autonomous AI capabilities | METR | 2026 | Useful index page for HCAST, RE-Bench, and related methodology; helps anchor the benchmark family and its evolving task-suite framing. |