Research · Academic & arXiv

Back to sweep

Research sweep · deep · 2025 – 2026

Agentic Engineering And Enterprise Architecture Discipline

Agentic engineering after Andrej Karpathy's vibe coding meme, April 2025-April 2026: how AI coding agents are changing enterprise software engineering across security, testability, reliability, maintainability, availability, resilience, observability, operability, cost, recovery, and engineering governance.

  • frontier
  • academic
  • vc
  • blogs
  • tech
  • financial

Synthesised 2026-04-30

Narrative

The academic and benchmark story is that agentic engineering is not being defined by one glamorous coding demo, but by a stack of increasingly hard evaluations that measure real modification work, long-horizon autonomy, and the gap between benchmark success and production acceptance. SWE-bench established the core problem: editing real repositories to resolve issues, where early top models solved only trivial cases. METR then pushed the field toward autonomy metrics with HCAST and RE-Bench, shifting attention from single-task pass rates to human-calibrated time horizons and sustained task completion. Recent METR reports on o3, DeepSeek/Qwen, and GPT-5.1-Codex-Max show that capability is rising, but the meaningful question is how far agents can operate without human intervention across multi-step software tasks.

A second clear theme is that enterprise-quality software properties are now the bottleneck, not just code synthesis. The literature on deprecated APIs, hallucinated code, SWE-bench-passing PRs that maintainers would not merge, and security-focused studies of exploitation and vulnerability repair all point in the same direction: agents can generate plausible artifacts, but maintainability, security, reviewability, and operational fit remain fragile. The most useful current research therefore is benchmark-heavy, empirical, and failure-oriented, with METR especially valuable because it measures both capability growth and the ways benchmark traces can mislead teams about real-world utility.


Sources

ID Title Outlet Date Significance
a1 SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024 Oral / Princeton publication record 2024 Foundational benchmark for real-world code modification: 2,294 GitHub issues across 12 repositories, with early results showing even strong models solve only the easiest tasks.
a2 AgentBench: Evaluating LLMs as Agents arXiv 2023 Early broad benchmark for LLM agents in interactive environments, useful as a conceptual precursor to later software-engineering-specific agent evaluations.
a3 OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments arXiv 2024 Important for agentic engineering because it shows current agents struggle with desktop-computer workflows, grounding, and repetitive action loops that resemble enterprise tool use.
a4 BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval arXiv 2024 Relevant to agentic coding because production coding agents depend on retrieval over docs, code, and logs; BRIGHT shows standard retrieval remains brittle on reasoning-heavy queries.
a5 How and Why LLMs Use Deprecated APIs in Code Completion? An Empirical Study arXiv 2024 Directly relevant to maintainability and enterprise drift: measures how often models select deprecated APIs and why, with concrete evidence on library evolution failure modes.
a6 CodeMirage: Hallucinations in Code Generated by Large Language Models arXiv 2024 Useful empirical work on code hallucination and invalid outputs, supporting the case that agentic systems need stronger verification than fluent generation.
a7 A Vision on Open Science for the Evolution of Software Engineering Research and Practice arXiv / FSE Companion 2024 2024 Not an agent paper, but relevant as a governance and reproducibility foundation for evaluating code-generation and agentic software practices rigorously.
a8 Can Language Models Solve Olympiad Programming? arXiv 2024 Benchmark work on hard coding tasks with unit tests and reference solutions; useful as a bridge between pure code generation and robust algorithmic evaluation.
a9 SWE-agent: Agent Computer Interfaces Enable Software Engineering Language Models arXiv 2024 Core systems paper for agentic software engineering, showing that environment/tool interfaces materially change how capable coding agents are in real repositories.
a10 DevBench: A Comprehensive Benchmark for Software Development arXiv 2024 Broad software-development benchmark that helps situate coding agents beyond patching tasks toward the full workflow of development.
a11 SWE-bench Verified SWE-bench project / benchmark release 2024 High-signal benchmark subset used widely in agent evaluations; important because it reduces some noise from original SWE-bench and is closer to real engineering work.
a12 HCAST: Human-Calibrated Autonomy Software Tasks METR 2025 Key autonomy benchmark for software and related tasks; central to measuring time horizons rather than just pass rates, which is crucial for agentic engineering.
a13 Evaluating frontier AI R&D capabilities of language model agents against human experts METR / RE-Bench 2024 Introduces RE-Bench, a benchmark for day-long research-engineering tasks; important because it tests sustained agentic work, not just short code patches.
a14 How Does Time Horizon Vary Across Domains? METR 2025 Synthesizes HCAST, RE-Bench, SWAA, and SWE-bench to compare autonomous capability growth across domains, highlighting the time-horizon framing now used in METR work.
a15 Details about METR’s preliminary evaluation of OpenAI’s o3 and o4-mini METR 2025-04 Frontier-model capability study showing updated HCAST and RE-Bench results for o3/o4-mini; useful for current frontier estimates around autonomous software work.
a16 Details about METR’s preliminary evaluation of DeepSeek and Qwen models METR 2025-06 Shows how mid-2025 open models compare on autonomy task suites, giving a concrete empirical baseline for the state of agentic coding capability outside frontier labs.
a17 MALT: A Dataset of Natural and Prompted Behaviors That Threaten Eval Integrity METR 2025-10 Directly relevant to governance and benchmark integrity: documents reward hacking and sandbagging behaviors in realistic agentic software/research task traces.
a18 Details about METR’s evaluation of OpenAI GPT-5.1-Codex-Max METR 2025-11 Current frontier software-capability report tying HCAST, RE-Bench, and SWAA together for a code-focused OpenAI model, relevant to how far coding agents have progressed.
a19 Many SWE-bench-Passing PRs Would Not Be Merged into Main METR 2026-03 Important corrective evidence: benchmark-passing patches often fail maintainer acceptance, exposing the gap between synthetic task success and production-grade engineering value.
a20 Teams of LLM Agents can Exploit Zero-Day arXiv 2024 Security-relevant evidence that agentic systems can discover and exploit vulnerabilities, underscoring the need for stronger controls, review, and threat modeling.
a21 A Case Study of LLM for Automated Vulnerability Repair arXiv 2024 Shows how LLMs behave on vulnerability repair tasks, useful for understanding security, correctness, and patch quality in agent-assisted remediation workflows.
a22 How Does Time Horizon Vary Across Domains? (METR-HRS synthesis note) METR 2025-07 Provides the cross-benchmark framing that is especially useful for enterprise engineering discussions, where autonomy duration matters more than isolated task success.
a23 Metr resources for measuring autonomous AI capabilities METR 2026 Useful index page for HCAST, RE-Bench, and related methodology; helps anchor the benchmark family and its evolving task-suite framing.

We use analytics cookies to understand site usage and improve the service. We do not use marketing cookies.