Agentic Engineering And Enterprise Architecture Discipline

Agentic engineering after Andrej Karpathy's vibe coding meme, April 2025-April 2026: how AI coding agents are changing enterprise software engineering across security, testability, reliability, maintainability, availability, resilience, observability, operability, cost, recovery, and engineering governance.

frontier
academic
vc
blogs
tech
financial

Synthesised 2026-04-30

Narrative

The academic and benchmark story is that agentic engineering is not being defined by one glamorous coding demo, but by a stack of increasingly hard evaluations that measure real modification work, long-horizon autonomy, and the gap between benchmark success and production acceptance. SWE-bench established the core problem: editing real repositories to resolve issues, where early top models solved only trivial cases. METR then pushed the field toward autonomy metrics with HCAST and RE-Bench, shifting attention from single-task pass rates to human-calibrated time horizons and sustained task completion. Recent METR reports on o3, DeepSeek/Qwen, and GPT-5.1-Codex-Max show that capability is rising, but the meaningful question is how far agents can operate without human intervention across multi-step software tasks.

A second clear theme is that enterprise-quality software properties are now the bottleneck, not just code synthesis. The literature on deprecated APIs, hallucinated code, SWE-bench-passing PRs that maintainers would not merge, and security-focused studies of exploitation and vulnerability repair all point in the same direction: agents can generate plausible artifacts, but maintainability, security, reviewability, and operational fit remain fragile. The most useful current research therefore is benchmark-heavy, empirical, and failure-oriented, with METR especially valuable because it measures both capability growth and the ways benchmark traces can mislead teams about real-world utility.

Sources

ID	Title	Outlet	Date	Significance
a1	SWE-bench: Can Language Models Resolve Real-World GitHub Issues?	ICLR 2024 Oral / Princeton publication record	2024	Foundational benchmark for real-world code modification: 2,294 GitHub issues across 12 repositories, with early results showing even strong models solve only the easiest tasks.
a2	AgentBench: Evaluating LLMs as Agents	arXiv	2023	Early broad benchmark for LLM agents in interactive environments, useful as a conceptual precursor to later software-engineering-specific agent evaluations.
a3	OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments	arXiv	2024	Important for agentic engineering because it shows current agents struggle with desktop-computer workflows, grounding, and repetitive action loops that resemble enterprise tool use.
a4	BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval	arXiv	2024	Relevant to agentic coding because production coding agents depend on retrieval over docs, code, and logs; BRIGHT shows standard retrieval remains brittle on reasoning-heavy queries.
a5	How and Why LLMs Use Deprecated APIs in Code Completion? An Empirical Study	arXiv	2024	Directly relevant to maintainability and enterprise drift: measures how often models select deprecated APIs and why, with concrete evidence on library evolution failure modes.
a6	CodeMirage: Hallucinations in Code Generated by Large Language Models	arXiv	2024	Useful empirical work on code hallucination and invalid outputs, supporting the case that agentic systems need stronger verification than fluent generation.
a7	A Vision on Open Science for the Evolution of Software Engineering Research and Practice	arXiv / FSE Companion 2024	2024	Not an agent paper, but relevant as a governance and reproducibility foundation for evaluating code-generation and agentic software practices rigorously.
a8	Can Language Models Solve Olympiad Programming?	arXiv	2024	Benchmark work on hard coding tasks with unit tests and reference solutions; useful as a bridge between pure code generation and robust algorithmic evaluation.
a9	SWE-agent: Agent Computer Interfaces Enable Software Engineering Language Models	arXiv	2024	Core systems paper for agentic software engineering, showing that environment/tool interfaces materially change how capable coding agents are in real repositories.
a10	DevBench: A Comprehensive Benchmark for Software Development	arXiv	2024	Broad software-development benchmark that helps situate coding agents beyond patching tasks toward the full workflow of development.
a11	SWE-bench Verified	SWE-bench project / benchmark release	2024	High-signal benchmark subset used widely in agent evaluations; important because it reduces some noise from original SWE-bench and is closer to real engineering work.
a12	HCAST: Human-Calibrated Autonomy Software Tasks	METR	2025	Key autonomy benchmark for software and related tasks; central to measuring time horizons rather than just pass rates, which is crucial for agentic engineering.
a13	Evaluating frontier AI R&D capabilities of language model agents against human experts	METR / RE-Bench	2024	Introduces RE-Bench, a benchmark for day-long research-engineering tasks; important because it tests sustained agentic work, not just short code patches.
a14	How Does Time Horizon Vary Across Domains?	METR	2025	Synthesizes HCAST, RE-Bench, SWAA, and SWE-bench to compare autonomous capability growth across domains, highlighting the time-horizon framing now used in METR work.
a15	Details about METR’s preliminary evaluation of OpenAI’s o3 and o4-mini	METR	2025-04	Frontier-model capability study showing updated HCAST and RE-Bench results for o3/o4-mini; useful for current frontier estimates around autonomous software work.
a16	Details about METR’s preliminary evaluation of DeepSeek and Qwen models	METR	2025-06	Shows how mid-2025 open models compare on autonomy task suites, giving a concrete empirical baseline for the state of agentic coding capability outside frontier labs.
a17	MALT: A Dataset of Natural and Prompted Behaviors That Threaten Eval Integrity	METR	2025-10	Directly relevant to governance and benchmark integrity: documents reward hacking and sandbagging behaviors in realistic agentic software/research task traces.
a18	Details about METR’s evaluation of OpenAI GPT-5.1-Codex-Max	METR	2025-11	Current frontier software-capability report tying HCAST, RE-Bench, and SWAA together for a code-focused OpenAI model, relevant to how far coding agents have progressed.
a19	Many SWE-bench-Passing PRs Would Not Be Merged into Main	METR	2026-03	Important corrective evidence: benchmark-passing patches often fail maintainer acceptance, exposing the gap between synthetic task success and production-grade engineering value.
a20	Teams of LLM Agents can Exploit Zero-Day	arXiv	2024	Security-relevant evidence that agentic systems can discover and exploit vulnerabilities, underscoring the need for stronger controls, review, and threat modeling.
a21	A Case Study of LLM for Automated Vulnerability Repair	arXiv	2024	Shows how LLMs behave on vulnerability repair tasks, useful for understanding security, correctness, and patch quality in agent-assisted remediation workflows.
a22	How Does Time Horizon Vary Across Domains? (METR-HRS synthesis note)	METR	2025-07	Provides the cross-benchmark framing that is especially useful for enterprise engineering discussions, where autonomy duration matters more than isolated task success.
a23	Metr resources for measuring autonomous AI capabilities	METR	2026	Useful index page for HCAST, RE-Bench, and related methodology; helps anchor the benchmark family and its evolving task-suite framing.