Research · Summary

Back to sweep

Research sweep · deep · 2025 – 2026

Agentic Engineering And Enterprise Architecture Discipline

Agentic engineering after Andrej Karpathy's vibe coding meme, April 2025-April 2026: how AI coding agents are changing enterprise software engineering across security, testability, reliability, maintainability, availability, resilience, observability, operability, cost, recovery, and engineering governance.

  • frontier
  • academic
  • vc
  • blogs
  • tech
  • financial

Synthesised 2026-04-30

Overview

Karpathy’s “vibe coding” named a real practice: using an AI assistant to generate working software while staying loose about the code itself. Independent practitioners quickly narrowed that meaning. Simon Willison treated vibe coding as useful for experiments, but not as a synonym for all AI-assisted programming. Addy Osmani later separated prototype-scale “vibes” from agentic engineering, where the work is specified, reviewed, tested, operated, and governed like production software. Sources: Simon Willison's Weblog (2025) (); Simon Willison's Weblog (2025) (); AddyOsmani.com (2026) ()

The defining shift from April 2025 to April 2026 is that AI coding moved from autocomplete and chat assistance into long-running agent workflows. Anthropic’s Claude releases, OpenAI’s Responses API, AgentKit, ChatGPT agent, and Codex system cards all frame software work as multi-step tool use with sandboxing, checkpoints, model safety evaluation, and enterprise controls. Google DeepMind’s Computer Use model extends the same pattern from code to operating software environments. Sources: Anthropic (2025) (); OpenAI (2025) (); OpenAI (2025) (); OpenAI (2025) (); Google DeepMind (2025) ()

Agentic engineering is the name for the production problem that vibe coding does not cover. It asks how teams preserve security, testability, reliability, maintainability, availability, observability, operability, compliance, and cost control when software changes are proposed or executed by agents. The point is not that agents replace engineering discipline. The evidence shows they make discipline more explicit, because faster code production pushes pressure into review, verification, architecture, deployment, and incident response. Sources: DORA (2025) (); DORA (2026) (); martinfowler.com / ThoughtWorks (2026) (); martinfowler.com / ThoughtWorks (2026) ()

Financial and analyst coverage confirms that this is now an enterprise software economics issue, not a developer tooling niche. Reuters reported Replit’s $250 million raise at a $3 billion valuation and Vercel’s $300 million raise at a $9.3 billion valuation, while the Financial Times reported a $7.5 billion funding wave into AI coding start-ups. The capital thesis is that code generation becomes cheaper, but the enterprise cost question shifts toward who pays for validation, integration, operational risk, and lock-in. Sources: Reuters (2025) (); Reuters (2025) (); Financial Times (2025) ()

Key Findings

1. The serious market has moved from vibe coding to governed agentic engineering. Vibe coding remains useful as a label for prototype work where the developer accepts a generated artifact without deeply interrogating it. Production use now centers on specs, acceptance criteria, test harnesses, code review, observability, and rollback. InfoQ’s coverage of Amazon Kiro shows this shift in product language, while ThoughtWorks’ work on context and harness engineering gives it a practitioner vocabulary. Sources: Simon Willison's Weblog (2025) (); InfoQ (2025) (); martinfowler.com / ThoughtWorks (2026) (); martinfowler.com / ThoughtWorks (2026) ()

2. Coding agents shift the bottleneck from implementation to verification. SWE-bench, SWE-agent, HCAST, RE-Bench, and METR’s GPT-5.1-Codex-Max evaluation all measure more realistic software work than prompt-level code generation. The important pattern is that agents increasingly complete multi-step tasks, but acceptance still depends on maintainability, repository fit, reviewability, and intent alignment. METR’s 2026 finding that many SWE-bench-passing PRs would not be merged is the cleanest warning against treating benchmark pass rates as production readiness. Sources: ICLR 2024 Oral / Princeton publication record (2024) (); arXiv (2024) (); METR (2025) (); METR (2025) (); METR (2026) ()

3. The enterprise “-ability” suite becomes harder because generated code is often plausible before it is durable. Studies on deprecated APIs and hallucinated code show that agents can produce code that compiles, resembles accepted practice, and still embeds obsolete interfaces or fabricated dependencies. Security studies on zero-day exploitation and automated vulnerability repair show the same dual-use profile: agents can help defend and repair systems, but they also lower the cost of attack exploration and superficial fixes. Sources: arXiv (2024) (); arXiv (2024) (); arXiv (2024) (); arXiv (2024) ()

4. Architecture becomes more important, not less. Agents perform better when they operate inside bounded contexts with stable interfaces, reference applications, explicit dependency rules, and local conventions. ThoughtWorks’ recommendation to anchor coding agents to reference applications is a concrete response to this problem. Context engineering gives agents the right system information, while harness engineering constrains what they can do and how outputs are checked. Sources: ThoughtWorks Technology Radar (2025) (); martinfowler.com / ThoughtWorks (2026) (); martinfowler.com / ThoughtWorks (2026) ()

5. Security governance has to include the agent’s context, tools, identity, and supply chain. The relevant attack surface is no longer only the generated code. It includes prompt and context poisoning, tool permissions, dependency selection, secrets exposure, sandbox escape, and provenance loss. ThoughtWorks warned that coding assistants threaten the software supply chain, OpenAI published system-card work for agent behavior, and Cloudflare’s identity-aware sandboxing posts show infrastructure vendors treating agent execution as an access-control problem. Sources: martinfowler.com / ThoughtWorks (2025) (); martinfowler.com (2025) (); OpenAI (2025) (); Cloudflare Blog (2026) ()

6. Observability and production verification become part of the coding loop. Agents can introduce silent failures, degraded behavior, shallow tests, hidden coupling, and policy violations that ordinary unit tests miss. Serious workflows need traces, logs, metrics, synthetic checks, SLOs, canary releases, rollback paths, and post-incident learning connected to the agent change history. DORA’s 2026 work describes the tension between faster code creation and greater downstream instability, while InfoQ’s Dapr Agents coverage emphasizes retries, workflows, and Kubernetes-native coordination for production agents. Sources: DORA (2026) (); InfoQ (2025) (); martinfowler.com / ThoughtWorks (2026) ()

7. Individual productivity gains do not automatically become organizational throughput. The strongest practitioner and management sources separate local speed from system performance. DORA treats AI as an amplifier of existing organizational capability, not a substitute for it. HBR’s “workslop” framing identifies the management failure mode: agents produce artifacts that look complete but transfer cleanup costs to reviewers and downstream teams. MIT Sloan’s work on team-level rules points toward governance at the team operating model level. Sources: DORA (2025) (); MIT Sloan Management Review (2025) (); Harvard Business Review (2025) ()

8. The cost model is moving from “cheaper code” to total cost of change. VC coverage emphasizes large software-development markets and orchestration opportunities, but enterprise sources point to review, QA, audit, infrastructure, incident cost, vendor dependence, and technical debt as the durable economics. CB Insights’ agent market maps and a16z’s software stack thesis show why vendors are racing to capture the workflow layer. Forrester and McKinsey frame the harder problem as SDLC transformation rather than tool adoption. Sources: Andreessen Horowitz (2025) (); CB Insights (2025) (); Forrester (2025) (); McKinsey (2025) ()

9. Governance evidence will become a normal delivery artifact. Regulated enterprises need auditable records of prompts, model versions, tool access, approvals, test results, policy checks, and deployment decisions. HBR’s AI auditing work and OpenAI’s system cards both point toward auditability as a first-class requirement. Agentic engineering therefore converges with secure SDLC, policy-as-code, change-management evidence, and incident review. Sources: Harvard Business Review (2025) (); OpenAI (2025) (); OpenAI (2025) ()

Evidence & Data

The clearest empirical base comes from benchmarks and autonomy evaluations. SWE-bench established the repository-level issue-resolution task. SWE-bench Verified tightened evaluation quality. SWE-agent showed that agent-computer interfaces matter, because software engineering agents need to inspect, edit, run, and iterate inside real repositories. METR’s HCAST and RE-Bench shifted attention from isolated benchmark success to human-calibrated autonomy and real task horizons. Sources: ICLR 2024 Oral / Princeton publication record (2024) (); SWE-bench project / benchmark release (2024) (); arXiv (2024) (); METR (2025) (); METR / RE-Bench (2024) ()

The strongest negative empirical signal is that benchmark-visible capability overstates mergeable engineering value. METR’s 2026 SWE-bench PR study directly addresses the enterprise problem: a patch can pass the benchmark and still fail maintainer expectations. That finding aligns with research on deprecated APIs and hallucinated code, where correctness at the surface does not guarantee maintainability, security, or ecosystem fit. Sources: METR (2026) (); arXiv (2024) (); arXiv (2024) ()

The adoption and capital data show rapid commercialization. McKinsey reported that 62% of surveyed organizations were experimenting with agents in 2025. CB Insights described enterprise AI agents and copilots as a $5 billion-plus market, mapped more than 400 agent companies, and documented the agent stack as a distinct market structure. Reuters reported Replit’s $250 million round at a $3 billion valuation and Vercel’s $300 million round at a $9.3 billion valuation. The Financial Times reported more than $7.5 billion invested in AI coding start-ups over three months. Sources: McKinsey (2025) (); CB Insights (2025) (); CB Insights (2025) (); Reuters (2025) (); Reuters (2025) (); Financial Times (2025) ()

The model evidence shows a product frontier organized around longer work loops. Anthropic’s Claude 4 family and OpenAI’s Codex-Max materials emphasize coding, tool use, safety, and agent operation. METR’s evaluations of o3, o4-mini, DeepSeek, Qwen, and GPT-5.1-Codex-Max provide the counterweight: capability is rising, but robust autonomy remains bounded by evaluation integrity, tool behavior, and oversight. Sources: Anthropic (2025) (); OpenAI (2025) (); METR (2025) (); METR (2025) (); METR (2025) ()

Signals & Tensions

1. Capability is improving faster than organizational absorption. Frontier labs and vendors are shipping agents that can run longer workflows, but DORA and HBR show that organizations still struggle to convert AI output into high-quality system throughput. The weak link is not only model quality. It is review capacity, test design, team rules, platform maturity, and ownership of downstream cleanup. Sources: OpenAI (2025) (); DORA (2026) (); Harvard Business Review (2025) ()

2. Benchmarks are necessary and insufficient. SWE-bench and METR-style evaluations are essential because they move the discussion away from demos. They still do not fully measure enterprise readiness, because maintainers care about design fit, operational risk, migration impact, and long-term ownership. The “passing PRs not merged” result is the central tension. Sources: SWE-bench project / benchmark release (2024) (); METR (2026) ()

3. The investor narrative emphasizes market capture, while practitioners emphasize control. a16z and CB Insights describe a large software-development stack and agent market. ThoughtWorks, DORA, Forrester, and HBR focus on the operating model required to avoid workslop, supply-chain exposure, and unreviewable change. Both are correct, but they measure different things. Sources: Andreessen Horowitz (2025) (); CB Insights (2025) (); martinfowler.com / ThoughtWorks (2025) (); DORA (2025) ()

4. Security is both accelerated and weakened. Agents can help find bugs, repair vulnerabilities, and review code at scale. They can also exploit vulnerabilities, ingest poisoned context, leak sensitive data through tool use, and create plausible but unsafe patches. Bloomberg’s reporting on security flaws in ChatGPT and Claude Code shows this issue entering mainstream enterprise risk discussion. Sources: arXiv (2024) (); arXiv (2024) (); Bloomberg (2025) (); Bloomberg (2026) ()

5. Underreported value sits in boring infrastructure. The durable practices are not flashy prompts. They are sandboxing, identity-aware auth, policy-as-code, reference applications, build reproducibility, golden tests, observability, and rollback. Cloudflare’s sandbox work and ThoughtWorks’ harness engineering point toward the likely production substrate for serious agentic engineering. Sources: Cloudflare Blog (2026) (); Cloudflare Blog (2026) (); martinfowler.com / ThoughtWorks (2026) ()

Open Questions

1. How should organizations measure agentic engineering productivity at system level? The evidence base still overweights individual coding speed and benchmark performance. Enterprises need metrics that connect agent use to lead time, change failure rate, recovery time, escaped defects, incident cost, maintenance load, and reviewer burden. Sources: DORA (2025) (); DORA (2026) ()

2. What is the right acceptance standard for agent-authored code? Passing tests is too weak, and human review alone does not scale. The unresolved standard combines contract tests, mutation tests, security checks, policy checks, architectural conformance, production telemetry, and maintainer judgment. Sources: METR (2026) (); martinfowler.com / ThoughtWorks (2026) ()

3. How much autonomy is safe for regulated enterprise systems? Current evidence supports scoped autonomy with audit trails, least privilege, and approval gates. It does not yet support broad unattended autonomy across high-risk systems where compliance, data integrity, availability, or safety are binding constraints. Sources: Harvard Business Review (2025) (); METR (2025) (); METR (2026) ()

4. Who owns failures introduced by agents? Tool vendors provide models, enterprises provide context and permissions, developers approve changes, and platform teams operate the systems. Incident governance still has to assign accountability across that chain. Sources: martinfowler.com / ThoughtWorks (2026) (); Harvard Business Review (2025) ()

5. Will agents reduce technical debt or accelerate it? The answer depends on architecture and review discipline. Agents can pay down debt when given clear constraints and tests. They can also create hidden coupling, duplicate abstractions, obsolete dependencies, and unowned complexity at higher speed. Sources: arXiv (2024) (); martinfowler.com / ThoughtWorks (2025) (); martinfowler.com / ThoughtWorks (2026) ()

6. Which agent platforms become durable enterprise control planes? OpenAI, Anthropic, Google, Cloudflare, Cursor, Replit, Vercel, and open-source coding stacks are competing across models, IDEs, sandboxes, deployment surfaces, and workflow orchestration. The unresolved enterprise risk is vendor lock-in at the development workflow layer. Sources: OpenAI (2025) (); Anthropic (2025) (); Reuters (2025) (); Reuters (2025) ()


![[sources-agentic-engineering-after-andrej-karpathy-s-vibe-c]]


Sources

Summary: ↑ Back to summary


Frontier Lab & Model News

ID Title Outlet Date Significance
t1 Introducing Claude 4 Anthropic 2025-05 Launches Claude Opus 4 and Sonnet 4 with strong coding and long-running agent claims, making Anthropic one of the clearest frontier references for agentic engineering.
t2 Claude Sonnet 4.5 Anthropic 2025-09 Positions Sonnet 4.5 as the best coding model and adds checkpoints and memory tooling, directly linking model capability to engineering workflow controls.
t3 Introducing Claude Haiku 4.5 Anthropic 2025-10 Shows the cost/speed pressure in agentic coding by framing a cheaper model as competitive for coding and computer-use tasks.
t4 Introducing Claude Opus 4.5 Anthropic 2025-11 Anthropic's flagship frontier coding release for late 2025, explicitly targeting coding, agents, and computer use with enterprise workflow framing.
t5 Model System Cards Anthropic 2025-2026 Central index of Claude system cards documenting safety evaluations and deployment decisions across the 2025-2026 model line.
t6 New tools for building agents OpenAI 2025-03 Introduces Responses API and related agent-building primitives, an early 2025 marker for turning model capability into production agent infrastructure.
t7 Operator System Card OpenAI 2025-03 Documents OpenAI's computer-using agent risks and limitations, useful for understanding reliability and human-oversight boundaries.
t8 ChatGPT agent System Card OpenAI 2025-07 Shows OpenAI combining browser, terminal, and connectors into a broader agent runtime while emphasizing safety mitigations.
t9 Introducing gpt-oss OpenAI 2025-08 Represents OpenAI's open-weight reasoning push, relevant for tooling and deployment economics even though it is not a coding-specific model.
t10 Introducing AgentKit OpenAI 2025-10 A major enterprise-agent platform announcement covering builder workflows, connectors, evals, and optimization for production agents.
t11 OpenAI DevDay 2025 OpenAI 2025-10 Conference hub capturing the broader shift toward tools for coding faster and building agents more reliably at platform scale.
t12 Building more with GPT-5.1-Codex-Max OpenAI 2025-11 Explicitly frames a frontier agentic coding model around long-running work, compaction, and project-scale software engineering.
t13 GPT-5.1-Codex-Max System Card OpenAI 2025-11 Technical safety and deployment documentation for a frontier coding model, including prompt-injection and sandboxing considerations.
t14 Addendum to GPT-5.2 System Card: GPT-5.2-Codex OpenAI 2025-12 Shows OpenAI continuing to harden and specialize Codex for real-world software engineering, with cybersecurity and long-horizon work front and center.
t15 Devstral Mistral AI 2025-05 Mistral's explicit 'agentic LLM for software engineering' release, notable for open-source coding-agent positioning and SWE-Bench Verified claims.
t16 Codestral Mistral AI 2025-01 A code-focused model card that anchors the year's early coding-model baseline and the migration toward agents and test generation.
t17 Codestral Embed Mistral AI 2025-05 Highlights code retrieval and representation as part of the engineering stack, not just generation.
t18 Models Overview Mistral AI 2025-2026 Provides Mistral's current framing of frontier and code-agent models, including Devstral 2 and Mistral Large 3.
t19 The Llama 4 herd: the beginning of a new era of natively multimodal AI innovation Meta 2025-04 Meta's Llama 4 launch ties open-weight multimodal models to coding and reasoning benchmarks, even if the release is broader than software engineering.
t20 Introducing the Meta AI App: A New Way to Access Your AI Assistant Meta 2025-04 Shows Meta turning Llama 4 into a consumer assistant product, relevant to how model capability gets productized outside developer tools.
t21 Model cards Google DeepMind 2025-2026 Landing page for DeepMind's model cards, including Gemini 2.5 Pro, Gemini 2.5 Computer Use, and Gemma releases that matter for agentic workflows.
t22 Gemini achieves gold-medal level at the International Collegiate Programming Contest World Finals Google DeepMind 2025-09 Shows high-end coding and abstract problem-solving capability in a competitive programming setting, useful as a proxy for frontier code reasoning.
t23 Gemini 2.5 Computer Use model Google DeepMind 2025-10 Important for agentic engineering because computer-use capability moves beyond code generation into GUI-driven operational tasks.
t24 METR's preliminary evaluation of o3 and o4-mini METR 2025-04 Key external evaluation linking frontier models to autonomy and software-engineering task horizons, including reward-hacking behavior.
t25 Details about METR's preliminary evaluation of Claude 3.7 Sonnet METR 2025-04 Benchmarks Claude 3.7's autonomous task horizon and flags the model's AI R&D capability as a safety-relevant signal.

Academic & arXiv

ID Title Outlet Date Significance
a1 SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024 Oral / Princeton publication record 2024 Foundational benchmark for real-world code modification: 2,294 GitHub issues across 12 repositories, with early results showing even strong models solve only the easiest tasks.
a2 AgentBench: Evaluating LLMs as Agents arXiv 2023 Early broad benchmark for LLM agents in interactive environments, useful as a conceptual precursor to later software-engineering-specific agent evaluations.
a3 OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments arXiv 2024 Important for agentic engineering because it shows current agents struggle with desktop-computer workflows, grounding, and repetitive action loops that resemble enterprise tool use.
a4 BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval arXiv 2024 Relevant to agentic coding because production coding agents depend on retrieval over docs, code, and logs; BRIGHT shows standard retrieval remains brittle on reasoning-heavy queries.
a5 How and Why LLMs Use Deprecated APIs in Code Completion? An Empirical Study arXiv 2024 Directly relevant to maintainability and enterprise drift: measures how often models select deprecated APIs and why, with concrete evidence on library evolution failure modes.
a6 CodeMirage: Hallucinations in Code Generated by Large Language Models arXiv 2024 Useful empirical work on code hallucination and invalid outputs, supporting the case that agentic systems need stronger verification than fluent generation.
a7 A Vision on Open Science for the Evolution of Software Engineering Research and Practice arXiv / FSE Companion 2024 2024 Not an agent paper, but relevant as a governance and reproducibility foundation for evaluating code-generation and agentic software practices rigorously.
a8 Can Language Models Solve Olympiad Programming? arXiv 2024 Benchmark work on hard coding tasks with unit tests and reference solutions; useful as a bridge between pure code generation and robust algorithmic evaluation.
a9 SWE-agent: Agent Computer Interfaces Enable Software Engineering Language Models arXiv 2024 Core systems paper for agentic software engineering, showing that environment/tool interfaces materially change how capable coding agents are in real repositories.
a10 DevBench: A Comprehensive Benchmark for Software Development arXiv 2024 Broad software-development benchmark that helps situate coding agents beyond patching tasks toward the full workflow of development.
a11 SWE-bench Verified SWE-bench project / benchmark release 2024 High-signal benchmark subset used widely in agent evaluations; important because it reduces some noise from original SWE-bench and is closer to real engineering work.
a12 HCAST: Human-Calibrated Autonomy Software Tasks METR 2025 Key autonomy benchmark for software and related tasks; central to measuring time horizons rather than just pass rates, which is crucial for agentic engineering.
a13 Evaluating frontier AI R&D capabilities of language model agents against human experts METR / RE-Bench 2024 Introduces RE-Bench, a benchmark for day-long research-engineering tasks; important because it tests sustained agentic work, not just short code patches.
a14 How Does Time Horizon Vary Across Domains? METR 2025 Synthesizes HCAST, RE-Bench, SWAA, and SWE-bench to compare autonomous capability growth across domains, highlighting the time-horizon framing now used in METR work.
a15 Details about METR’s preliminary evaluation of OpenAI’s o3 and o4-mini METR 2025-04 Frontier-model capability study showing updated HCAST and RE-Bench results for o3/o4-mini; useful for current frontier estimates around autonomous software work.
a16 Details about METR’s preliminary evaluation of DeepSeek and Qwen models METR 2025-06 Shows how mid-2025 open models compare on autonomy task suites, giving a concrete empirical baseline for the state of agentic coding capability outside frontier labs.
a17 MALT: A Dataset of Natural and Prompted Behaviors That Threaten Eval Integrity METR 2025-10 Directly relevant to governance and benchmark integrity: documents reward hacking and sandbagging behaviors in realistic agentic software/research task traces.
a18 Details about METR’s evaluation of OpenAI GPT-5.1-Codex-Max METR 2025-11 Current frontier software-capability report tying HCAST, RE-Bench, and SWAA together for a code-focused OpenAI model, relevant to how far coding agents have progressed.
a19 Many SWE-bench-Passing PRs Would Not Be Merged into Main METR 2026-03 Important corrective evidence: benchmark-passing patches often fail maintainer acceptance, exposing the gap between synthetic task success and production-grade engineering value.
a20 Teams of LLM Agents can Exploit Zero-Day arXiv 2024 Security-relevant evidence that agentic systems can discover and exploit vulnerabilities, underscoring the need for stronger controls, review, and threat modeling.
a21 A Case Study of LLM for Automated Vulnerability Repair arXiv 2024 Shows how LLMs behave on vulnerability repair tasks, useful for understanding security, correctness, and patch quality in agent-assisted remediation workflows.
a22 How Does Time Horizon Vary Across Domains? (METR-HRS synthesis note) METR 2025-07 Provides the cross-benchmark framing that is especially useful for enterprise engineering discussions, where autonomy duration matters more than isolated task success.
a23 Metr resources for measuring autonomous AI capabilities METR 2026 Useful index page for HCAST, RE-Bench, and related methodology; helps anchor the benchmark family and its evolving task-suite framing.

VC & Analyst Reports

ID Title Outlet Date Significance
v1 The $3 Trillion AI Coding Opportunity Andreessen Horowitz 2025-12 Frames coding agents as a massive labor-market reallocation story, with “agents with environments” and new repo/PR abstractions as the core thesis.
v2 How 100 Enterprise CIOs Are Building and Buying Gen AI in 2025 Andreessen Horowitz 2025-06 Surveys 100 CIOs and shows enterprise AI budgets moving from pilots to recurring line items, with multi-model buying and cost-performance tradeoffs becoming standard.
v3 What Is an AI Agent? Andreessen Horowitz 2025-04 Useful for the vocabulary shift from copilots to agents, including pricing, boundaries, and what counts as an agent versus an LLM or function.
v4 State of AI: An Empirical 100 Trillion Token Study with OpenRouter Andreessen Horowitz 2025-12 Empirical usage study that helps ground hype with real token-level behavior across developers, models, and agentic workflows.
v5 Big Ideas 2026: The Agentic Interface Andreessen Horowitz 2025-12 Argues software is shifting from chat to action, with machine-legible systems and agent-readable interfaces becoming a product layer.
v6 Big Ideas 2026: The Enterprise Orchestration Layer Andreessen Horowitz 2025-12 Frames AI as an enterprise workflow orchestration layer, emphasizing coordinated multi-agent execution across tools and teams.
v7 The Trillion Dollar AI Software Development Stack Andreessen Horowitz 2025 Key a16z market-sizing thesis for the software-development stack, positioning AI coding assistants and agentic tools as a trillion-dollar layer.
v8 The Architect’s Guide To TuringBots, 2025 Forrester 2025-04 Directly addresses compliant and secure adoption of genAI in software development, with architects and security teams as central to adoption.
v9 AI Is Evolving The Development Workforce In Dramatic Ways Forrester 2025-10 Treats agentic AI as a full SDLC workforce shift, not just coding assistance, and stresses governance, role changes, and skill gaps.
v10 Create Your AI-Enhanced SDLC Transformation 90-Plus-Day Roadmap Forrester 2025-11 Provides an implementation roadmap for embedding AI into the SDLC, with governance and operating-model changes required to scale.
v11 The State Of Generative AI For Language, 2025 Forrester 2025-12 Shows enterprise adoption is advancing but trust, token economics, and platform disruption are creating growing pains.
v12 The State Of AI, 2025 Forrester 2025-12 Broad survey evidence that many firms have AI in production but few measure financial impact, reinforcing the gap between adoption and value capture.
v13 The state of AI in 2025: Agents, innovation, and transformation McKinsey 2025-11 Offers adoption-curve evidence: most firms are still piloting, 62% are experimenting with agents, and enterprise-level EBIT impact remains limited.
v14 Seizing the agentic AI advantage McKinsey 2025-06 A CEO playbook that argues the value is shifting from horizontal copilots to vertical use cases, many of which remain stuck in pilot mode.
v15 What is an AI agent? McKinsey 2025-03 Useful definitional framing for agents as software components with agency, and for multi-agent orchestration as a workflow design pattern.
v16 State of AI 2025 Report CB Insights 2026-01 Market-level macro view: record AI funding, rising M&A, and a strong signal that corporate acquisitions are shaping the agent market.
v17 State of AI Q3’25 Report CB Insights 2025-10 Shows the 2025 funding boom continuing even as deal activity softens, with AI agents remaining a key investor and enterprise focus.
v18 The AI agent market map CB Insights 2025-11 Maps 400+ AI agent startups across 16 categories and highlights how quickly the agent landscape expanded in under a year.
v19 The AI agent tech stack CB Insights 2025-08 Important for infrastructure and tooling layers around agents, including oversight, deployment, and management markets.
v20 Enterprise AI agents & copilots: Our growth projections for the $5B+ market CB Insights 2025-04 A direct market-sizing piece that puts enterprise AI agents and copilots at more than $5B and identifies coding agents as a $1B+ market.
v21 What’s next for AI agents? 4 trends to watch in 2025 CB Insights 2025-02 Early 2025 view of agent market dynamics, including rapid funding growth and the shift from copilots to autonomous task execution.
v22 AI 100: The most promising artificial intelligence startups of 2025 CB Insights 2025-04 Shows where venture attention is concentrating across observability, infrastructure security, and vertical AI agents.
v23 Reflection AI Launches Asimov: Breakthrough Agent for Code Comprehension Sequoia Capital 2025-07 Signals Sequoia’s view that code comprehension is as important as generation, and that the real opportunity is understanding large codebases.
v24 Partnering with Zed: The AI-Powered Code Editor Built from Scratch Sequoia Capital 2025-08 Connects AI coding to editor/IDE architecture and cites adoption signals such as 150K monthly active developers and 9% Rust developer usage.
v25 LangChain: From Agent 0-to-1 to Agentic Engineering Sequoia Capital 2025-10 One of the clearest named theses in the set, arguing that agent engineering needs scaffolding, orchestration, and production packaging.

Blogs & Independent Thinkers

ID Title Outlet Date Significance
b1 Not all AI-assisted programming is vibe coding (but vibe coding rocks) Simon Willison's Weblog 2025-03 Defines vibe coding narrowly and argues for separating reckless prompt-only coding from disciplined AI-assisted engineering.
b2 Two publishers and three authors fail to understand what “vibe coding” means Simon Willison's Weblog 2025-05 Shows the term immediately being stretched beyond Karpathy’s original meaning, clarifying the vocabulary problem the lane is tracking.
b3 Vibe engineering Simon Willison's Weblog 2025-10 Introduces a disciplined middle ground between meme-driven vibe coding and production-grade engineering.
b4 Claude Code for web—a new asynchronous coding agent from Anthropic Simon Willison's Weblog 2025-10 Treats asynchronous coding agents as a distinct operational form factor, not just a better autocomplete.
b5 Claude Code Can Debug Low-level Cryptography Simon Willison's Weblog 2025-11 Provides a serious security-adjacent example where agents are useful as debugging assistants without being trusted to write final code.
b6 mistralai/mistral-vibe Simon Willison's Weblog 2025-12 Notes the emerging terminal-agent pattern and the consolidation of coding agents into a recognizable tooling category.
b7 GLM-5: From Vibe Coding to Agentic Engineering Simon Willison's Weblog 2026-02 Captures the shift in naming from vibe coding toward agentic engineering as the professional framing becomes clearer.
b8 Linear walkthroughs Simon Willison's Weblog 2026-02 Shows agents being used for codebase comprehension and recovery, not just generation.
b9 Introducing Showboat and Rodney, so agents can demo what they’ve built Simon Willison's Weblog 2026-02 Highlights the need for proof artifacts and manual verification when agents produce software.
b10 Ladybird adopts Rust, with help from AI Simon Willison's Weblog 2026-02 A strong case study for human-directed, high-rigor agent use on critical code with extensive tests.
b11 Agentic Engineering AddyOsmani.com 2026-02 Explicitly distinguishes vibe coding from production-grade agentic work and argues for specs, review, and testing.
b12 Stop Using /init for AGENTS.md AddyOsmani.com 2026-02 Argues that useful agent instructions must encode non-discoverable project knowledge, not boilerplate.
b13 The Factory Model: How Coding Agents Changed Software Engineering AddyOsmani.com 2026-02 Frames coding agents as a change in software production model while insisting engineering constraints still matter.
b14 Scaffolding AddyOsmani.com 2026 Makes the case that types, linting, tests, CI, and conventions are the trellis that keeps agent output on track.
b15 Harness engineering for coding agent users martinfowler.com / Thoughtworks 2026-04 One of the clearest pieces on feedforward controls, feedback sensors, behavior harnesses, and harnessability.
b16 Context Engineering for Coding Agents martinfowler.com / Thoughtworks 2026-02 Explains how context curation, rules, skills, and specs become core engineering inputs for coding agents.
b17 Autonomous coding agents: A Codex example martinfowler.com / Thoughtworks 2025-06 Separates supervised from autonomous coding agents and describes their operating model in practical terms.
b18 Coding Assistants Threaten the Software Supply Chain martinfowler.com / Thoughtworks 2025-05 A strong security-focused analysis of new attack surfaces introduced by agent loops, MCP, and rules files.
b19 Building your own CLI Coding Agent with Pydantic-AI martinfowler.com / Thoughtworks 2025-08 Shows why teams may need custom agents tuned to their testing, documentation, and file-system standards.
b20 Exploring Generative AI martinfowler.com / Thoughtworks 2025-07 A useful hub page for a run of practical memos on how AI is changing software delivery practice.
b21 AI Agent Benchmarks Are Broken LessWrong 2025-07 Argues that benchmark design can overstate agent capability by large margins, which matters for enterprise claims.
b22 METR Research Update: Algorithmic vs. Holistic Evaluation LessWrong 2025-08 Shows that agents can look good under algorithmic scoring while failing on real-world code quality and usability.
b23 OpenAI: How we monitor internal coding agents for misalignment LessWrong 2026-03 Surfaces concrete monitoring practices and misalignment failure modes from real internal coding-agent deployments.
b24 Dynamic, identity-aware, and secure Sandbox auth Cloudflare Blog 2026-04 Explains sandboxed execution and identity-aware auth as core infrastructure for untrusted agent workloads.
b25 Project Think: building the next generation of AI agents on Cloudflare Cloudflare Blog 2026-04 Describes durable execution, sub-agents, persistent sessions, and sandboxed code as the substrate for long-running agents.

Tech Industry & Practitioner

ID Title Outlet Date Significance
p1 State of AI-assisted Software Development 2025 DORA 2025 Flagship empirical report showing AI as an amplifier of existing organizational strengths and weaknesses, with a formal AI capabilities model for engineering performance.
p2 Balancing AI tensions: Moving from AI adoption to effective SDLC use DORA 2026-03 Explains the core tradeoff in agentic engineering: coding speed rises, but verification, auditing, and downstream instability can absorb the gains.
p3 Capabilities: Platform engineering DORA 2026 Argues that platform quality determines whether AI adoption produces positive organizational performance or merely downstream disorder.
p4 DORA 2025: Year in review DORA 2026-01 Summarizes the year’s research trilogy and reinforces the idea that AI improves throughput only when the underlying delivery system is strong.
p5 Team of coding agents ThoughtWorks Technology Radar 2025-11 Frames multi-agent coding as an orchestrated technique rather than a novelty, useful for distinguishing serious workflows from toy vibe coding.
p6 Anchoring coding agents to a reference application ThoughtWorks Technology Radar 2025-11 Shows a concrete control pattern for agentic development: use a living reference app to constrain drift, maintain consistency, and reduce architectural entropy.
p7 The role of developer skills in agentic coding martinfowler.com / ThoughtWorks 2025-03 Provides practitioner evidence that agentic coding still depends on senior engineering judgment for maintainability, reuse, and workflow design.
p8 Coding Assistants Threaten the Software Supply Chain martinfowler.com / ThoughtWorks 2025-05 Connects coding agents to supply-chain risk, highlighting the attack surface created by elevated developer environments and agent access.
p9 Autonomous coding agents: A Codex example martinfowler.com / ThoughtWorks 2025-06 Distinguishes supervised from autonomous coding agents and gives an end-to-end example of task execution in a controlled environment.
p10 I still care about the code martinfowler.com / ThoughtWorks 2025-07 Argues that AI does not eliminate the need to care about code quality, especially for on-call responsibility and long-term maintainability.
p11 How far can we push AI autonomy in code generation? martinfowler.com / ThoughtWorks 2025-08 Reports on experiments showing that agents can build simple applications but still fail under complexity, shifting assumptions and declaring success prematurely.
p12 Agentic AI and Security martinfowler.com 2025-10 A clear practitioner treatment of agent security risks, including instruction/data confusion, the lethal trifecta, sandboxing, and human review.
p13 Context Engineering for Coding Agents martinfowler.com / ThoughtWorks 2026-02 Shows that controlling what the agent sees is becoming a core engineering discipline, not an incidental prompt-tuning exercise.
p14 Harness Engineering martinfowler.com / ThoughtWorks 2026-02 Recasts agent-first development as a harness problem, emphasizing scaffolding, guardrails, and workflow design over free-form code generation.
p15 Assessing internal quality while coding with an agent martinfowler.com / ThoughtWorks 2026-01 Centers internal quality and sustainability as the key measure for agent-generated code rather than feature throughput alone.
p16 Humans and Agents in Software Engineering Loops martinfowler.com / ThoughtWorks 2026-03 Argues for humans on the loop rather than off the loop, framing agentic engineering as operating the right control loop, not replacing it.
p17 Beyond Vibe Coding: Amazon Introduces Kiro, the Spec-Driven Agentic AI IDE InfoQ 2025-08 Shows the shift from prompt-first coding to spec-driven workflows with explicit stories, acceptance criteria, design docs, and tracked tasks.
p18 Dapr Agents: Scalable AI Workflows with LLMs, Kubernetes & Multi-Agent Coordination InfoQ 2025-03 Positions resilient orchestration, security, and observability as prerequisites for production agent systems.
p19 AI Assisted Coding InfoQ 2026 A topic hub capturing a stream of practitioner reporting on agentic coding, with many pieces on governance, bottlenecks, and production constraints.
p20 AI, ML and Data Engineering Trends Report - 2025 InfoQ 2025-09 Provides a broader industry-practitioner view that software is moving toward AI as a co-creator, not just an assistant.
p21 Agentic AI at Scale: Redefining Management for a Superhuman Workforce MIT Sloan Management Review 2025-09 Uses executive survey and expert panel evidence to argue that agentic AI requires new management and accountability approaches.
p22 For AI Productivity Gains, Let Team Leaders Write the Rules MIT Sloan Management Review 2025-10 Argues governance should be pushed down to team level, where local context and risk are actually understood.
p23 What Leaders Need to Know About Auditing AI Harvard Business Review 2025-03 Gives governance language for auditability, accountability, and control when AI systems affect consequential decisions and workflows.
p24 AI-Generated “Workslop” Is Destroying Productivity Harvard Business Review 2025-09 A strong warning that AI output can create downstream cleanup work and organizational drag instead of real productivity.
p25 Designing a Successful Agentic AI System Harvard Business Review 2025-10 Focuses on cross-functional redesign and operating model change as the real challenge of enterprise agentic AI.

Financial Press

ID Title Outlet Date Significance
f1 AI Coding Assistant Cursor Draws a Million Users Without Even Trying Bloomberg 2025-04-07 Early evidence that AI coding tools were already reaching mainstream developer usage across major companies, not just demos or startups.
f2 OpenAI Takes on Google, Anthropic With New AI Agent for Coders Bloomberg 2025-05-16 Marks the launch of Codex as a business product aimed at enterprise software work, including writing features, fixing bugs, and running tests.
f3 Google Debuts Gemini AI Coding Tool in Bid to Entice Developers Bloomberg 2025-06-25 Shows the competitive scramble among platform vendors to own the developer workflow and capture enterprise coding budgets.
f4 OpenAI Fixed ChatGPT Security Flaw That Put Gmail Data at Risk Bloomberg 2025-09-18 Illustrates how agentic tools can create new security and data-governance risks even when they are meant to improve productivity.
f5 Morgan Stanley’s Tech Boss Says AI Coding Has ‘Profound’ Impact Bloomberg 2025-10-02 Concrete enterprise commentary from a major financial institution that AI coding is shifting engineer time toward code review and higher-order work.
f6 Anthropic Says Its New AI Model Is Better at Coding and Office Work Bloomberg 2025-11-24 Useful for understanding how model makers are repositioning coding agents as broad enterprise workflow tools, not just developer assistants.
f7 Anthropic Accidentally Exposes System Behind Claude Code Bloomberg 2026-04-01 A sharp example of the operational and security risks introduced by fast-moving AI coding-agent release cycles.
f8 Claude Code and the Great Productivity Panic of 2026 Bloomberg 2026-02-26 Frames the shift from vibe coding as a meme to agentic engineering as an economic and organisational pressure point.
f9 AI Is Finding More Bugs Than Open-Source Teams Can Fight Off Bloomberg 2026-04-17 Highlights the security burden on maintainers and the way AI can overwhelm small teams with vulnerability discovery.
f10 AI Agents ‘Perilous’ for Secure Apps Such as Signal, Whittaker Says Bloomberg 2026-01-20 Provides an explicit security and privacy critique from a major trust-and-safety voice on the danger of deep agent access.
f11 ASML, SAP Show Widening Gap Between AI Winners and Losers Bloomberg 2026-01-29 Shows how investors are already pricing AI coding tools as a threat to incumbent enterprise software margins and valuations.
f12 Claude Code and the Great Productivity Panic of 2026 Bloomberg 2026-02-26 Reinforces the market narrative that AI coding is moving from novelty to a forcing function across the software industry.
f13 How Anthropic achieved AI coding breakthroughs - and rattled business Financial Times 2026-02-04 One of the strongest FT explainers on how Claude Code and Anthropic’s enterprise strategy are reshaping software economics.
f14 AI threatens enterprise software companies, says Franklin Templeton CEO Financial Times 2026-02-23 Important investor commentary that coding-capable AI could challenge the long-term business model of enterprise software vendors.
f15 The AI Shift: Is this the 'take off' moment for AI agents? Financial Times 2026-02-05 Useful for framing the macro question of whether coding agents are now showing measurable productivity gains rather than hype.
f16 Start-ups promise to help vibe coders catch the AI bugs Financial Times 2025-12-03 Directly addresses the testability, validation, and security gap created by AI-generated code in production settings.
f17 AI coding start-ups reap $7.5bn wave of investment Financial Times 2025-09-25 Key market-sizing and capital-flow piece showing investors treating software engineering as the first major AI killer application.
f18 OpenAI Launches Codex, an AI coding agent The Wall Street Journal 2025 Confirms broad business press recognition that coding agents are becoming a mainstream enterprise product category.
f19 The Trillion Dollar Race to Automate Our Entire Lives The Wall Street Journal 2026-03-21 Captures the broader market and consumerization narrative around agents, while also surfacing reliability and job-displacement concerns.
f20 AI coding startup Replit raises $250 million at $3 billion valuation Reuters 2025-09-10 Shows how capital continues to flow into code-generation platforms as enterprise adoption expands.
f21 Anthropic’s valuation more than doubles to $183 billion after $13 billion fundraise Reuters 2025-09-02 Signals investor belief that the enterprise coding market can support a very large private valuation, especially for coding-capable models.
f22 AI coding startup Vercel raises $300 million, valued at $9.3 billion Reuters 2025-09-30 Useful for enterprise demand, security spend, and the growth of developer platforms with embedded AI agents.
f23 AI startup Modular raises $250 million, seeks to challenge Nvidia dominance Reuters 2025-09-24 Shows investment flowing into the infrastructure layer that underpins enterprise AI and AI-assisted engineering.
f24 How far will AI agents go? Economist Impact 2025 Provides enterprise deployment context and governance themes that help explain why firms move cautiously from pilots to production.
f25 Say ‘hi’ to your new virtual team members: AI agents Economist Impact 2026 Useful for the enterprise readiness angle: data quality, governance, and measured business objectives as prerequisites for agentic systems.

We use analytics cookies to understand site usage and improve the service. We do not use marketing cookies.