Research · Tech Industry & Practitioner

Back to sweep

Research sweep · deep · 2025 – 2026

AI on Deterministic Rails

  • Claude Opus 4.8
  • financial
  • frontier
  • academic
  • vc
  • blogs
  • tech

Synthesised 2026-06-07

Narrative

The 2025 DORA report, drawing on nearly 5,000 practitioners, delivered the most empirically grounded verdict to date on AI in software engineering: AI functions as an amplifier of existing conditions rather than a capability creator. Teams with mature DevOps practices, well-defined workflows, and strong platform infrastructure converted AI-driven productivity gains into measurable delivery improvements; teams with fragmented tooling experienced accelerated technical debt and instability. The report's DORA AI Capabilities Model maps seven systemic capabilities — platform quality, data ecosystems, user-centricity, governance, training, communities of practice, and clear AI stance — that predict whether adoption yields net benefit. Faros AI's telemetry across 22,000 developers adds a complicating signal: individual throughput is up substantially but median PR review time has risen 441% and 31% of PRs are now merging with no review at all, a pattern they call 'Acceleration Whiplash'.

The Thoughtworks Technology Radar Vol. 33 (November 2025) documented a practitioner-level shift away from experimental AI theatre toward structured engineering. CTO Rachel Laycock noted that vibe coding had 'practically disappeared' and that the industry had moved to serious work on context, infrastructure, and security. The Radar's four themes — infrastructure orchestration for AI, the rise of agents elevated by MCP, AI coding workflows, and emerging AI antipatterns — all point in the same direction: AI is being absorbed into deterministic software infrastructure rather than displacing it. MCP, open-sourced by Anthropic in late 2024, reached near-ubiquitous vendor adoption within a year, providing a standardised integration layer between agents and deterministic systems.

The agent harness has emerged as the decisive architectural variable for agentic performance. Analysis of the Claude Code codebase established that 98.4% of the system is deterministic infrastructure — permission gates, context management, tool routing, and recovery logic — with the AI reasoning loop itself a simple while-loop. Anthropic's own engineering documentation shows that frontier models fail to build production applications without structured harness scaffolding, requiring initialiser agents, progress artefacts, and explicit testing gates. Academic work published in March 2026 confirmed that harness scaffold differences can dominate outcomes under fixed base models, establishing 'context engineering' as the successor discipline to prompt engineering.

The token-cost crisis moved from boardroom anxiety to operational emergency between late 2025 and June 2026. Stanford Digital Economy Lab research found agentic coding tasks consume 1,000x more tokens than single-turn code reasoning, with costs varying up to 30x on identical tasks. Enterprises reported hitting annual AI budgets in three months; per-developer token consumption reportedly rose 18.6x in nine months. Uber, Microsoft, and others became named examples of agentic overspend. In response, the Linux Foundation announced the Tokenomics Foundation as a FinOps-equivalent standards body, EY published a Total Cost of Agents framework, and Deloitte issued a CFO guide to AI token economics — all in spring 2026. The structural lever most consistently cited for cost control is intelligent routing: sending routine classification, extraction, and summarisation to open-weight models (DeepSeek, Qwen, Llama) while reserving frontier models for tasks where quality premium justifies the cost. One practitioner deployment report cited a 60–70% cost reduction with no measurable quality degradation using this approach.


Sources

ID Title Outlet Date Significance
p1 2025 DORA State of AI-Assisted Software Development Report Google Cloud / DORA 2025-09 Primary empirical survey of nearly 5,000 practitioners establishing that AI amplifies existing engineering conditions rather than creating new capability, and introducing the DORA AI Capabilities Model as a framework for contextualising adoption.
p2 AI Is Amplifying Software Engineering Performance, Says the 2025 DORA Report InfoQ 2026-03 Authoritative InfoQ synthesis of the 2025 DORA report findings, documenting that AI adoption continues to have a negative relationship with software delivery stability absent strong automated testing and feedback loops.
p3 Thoughtworks Technology Radar Vol. 33 — November 2025 Thoughtworks 2025-11 Canonical biannual practitioner signal report documenting the shift from prompt engineering and RAG toward context engineering, MCP-driven agent orchestration, and emerging AI antipatterns such as AI-accelerated shadow IT.
p4 Thoughtworks Technology Radar Highlights The Rapid Evolution of AI Assistance in 2025 Thoughtworks 2025-11 CTO Rachel Laycock's statement that vibe coding has 'practically disappeared' and the industry has moved to serious work on context, infrastructure, and security — a practitioner-level signal about the end of the PoC-as-theatre phase.
p5 Macro Trends in the Tech Industry — November 2025 Thoughtworks 2025-11 Expanded commentary on Radar Vol. 33 themes, documenting MCP's rapid proliferation to thousands of servers in under a year and the structural challenge of GPU cost management in AI inference at scale.
p6 2025 Stack Overflow Developer Survey — AI Section Stack Overflow 2025-07 Large-scale survey of 49,000-plus developers across 177 countries showing 84% AI tool adoption alongside declining trust (only 29% trust AI outputs), with 66% citing near-miss outputs as the top frustration — grounding the trust-gap problem empirically.
p7 Stack Overflow's 2025 Developer Survey Reveals Trust in AI at an All Time Low Stack Overflow 2025-07 Official press release providing the headline finding that positive developer sentiment towards AI tools has fallen from above 70% in 2023–24 to 60% in 2025, with 45% reporting that debugging AI-generated code is more time-consuming than writing it.
p8 Developers Remain Willing but Reluctant to Use AI: The 2025 Developer Survey Results Are Here Stack Overflow Blog 2025-12 Contextual analysis of 2025 survey data noting that 72% of developers say vibe coding is not part of their professional work, reinforcing that AI is being adopted as a tool layer rather than a replacement for deterministic engineering practice.
p9 Why 88 to 95 Percent of Enterprise AI Pilots Never Reach Production SoftwareSeni 2026-03 Consolidates data points from IDC/Lenovo, MIT NANDA, McKinsey, S&P Global, PwC, and Gartner into a comparative analysis of different PoC failure measurements, providing the most cited statistical overview of the enterprise adoption gap.
p10 MIT Report: 95% of Generative AI Pilots at Companies Are Failing Fortune 2025-08 Fortune coverage of the MIT NANDA GenAI Divide report documenting the misalignment between where AI budgets are spent (sales and marketing) versus where ROI has been documented (back-office automation), providing context for PoC stall rates.
p11 AI PoCs to Production: A Balanced Perspective Omdia 2025-11 Independent analyst survey showing 40% of enterprises run 6–20 simultaneous PoCs, offering a more nuanced counter-reading to alarmist PoC failure statistics and noting the path to production is iterative rather than linear.
p12 Agentic AI Enterprise Token Cost EY 2026-06 First edition of EY's Total Cost of Agents series, framing token costs as only the visible surface of agentic spend and recommending 'Agent FinOps' as a discipline with hard kill switches, per-task benchmarks, and centralised cost ownership.
p13 AI Token Economics for CFOs Deloitte 2026-04 Based on a survey of 550 US enterprise leaders, documents that many companies already generate above 10 billion tokens per month and that agentic capabilities are shifting pricing from per-seat to usage-based models with material forecasting implications for CFOs.
p14 The Token Bill Comes Due: Inside the Industry Scramble to Manage AI's Runaway Costs TechCrunch 2026-06 Documents enterprises hitting annual AI budgets in three months, per-developer token consumption rising 18.6x in nine months, and the Linux Foundation's creation of the Tokenomics Foundation as a FinOps-style standards body for AI spend governance.
p15 Uber, Microsoft, and Others Burning Through AI Budgets. Now What? SmarterX 2026-06 Named-company analysis of Uber's CTO burning the entire 2026 Claude Code budget in four months and Microsoft cancelling most Claude Code licences over cost, with Google I/O data showing a 7x token volume jump at Google in a single year.
p16 The Real Cost of Agentic AI InfoWorld 2026-06 Practitioner cost modelling showing all-in operating costs for agentic systems are two to five times raw token costs, and that a deterministic workflow with a single model call can handle classification, extraction, and summarisation at a fraction of the cost and risk.
p17 How Are AI Agents Spending Your Tokens? Stanford Digital Economy Lab 2026-05 Based on a paper co-authored by Erik Brynjolfsson finding that agentic coding tasks consume 1000x more tokens than code reasoning tasks and that agents cannot predict their own costs — costs vary up to 30x on the same task — establishing the fundamental unpredictability problem.
p18 Microsoft Reports Are Exposing AI's Real Cost Problem Fortune 2026-05 Fortune reporting on cases where AI compute costs exceed the cost of the human labour being replaced, citing Goldman Sachs forecasts of a 24-fold token consumption increase by 2030 alongside a Gartner finding that cheaper tokens will not translate to cheaper enterprise AI.
p19 Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems arXiv (VILA-Lab) 2026-04 Systematic architectural analysis of the Claude Code harness documenting that 98.4% of the system is deterministic infrastructure — permission gates, context management, tool routing, recovery — with the AI loop itself a simple while-loop.
p20 Effective Harnesses for Long-Running Agents Anthropic Engineering 2025 First-party Anthropic documentation demonstrating that even frontier models fail to build production applications without structured harness design, requiring initialiser agents, progress files, and explicit browser-automation testing to bridge context-window gaps.
p21 A Harness for Every Task: Dynamic Workflows in Claude Code Anthropic / Claude Blog 2026-06 Anthropic's own practitioner guide to dynamic workflows, documenting agentic laziness, self-preferential bias, and goal drift as structural failure modes that deterministic workflow orchestration mitigates through isolated subagent context windows.
p22 Natural-Language Agent Harnesses arXiv 2026-03 Academic paper establishing that harness scaffold differences can dominate outcomes even under fixed base models, reframing 'prompt engineering' as the broader practice of 'context engineering' — deciding what state should be available at each step of a long run.
p23 The Architectural Shift: AI Agents Become Execution Engines While Backends Retreat to Governance InfoQ 2025-10 InfoQ analysis of enterprise agentic architecture patterns, citing Gartner's prediction that 40% of enterprise applications will include task-specific agents by 2026, and outlining a three-tier governance framework where trust must precede autonomy.
p24 Open-Weight Models H1 2026: DeepSeek, Qwen, Llama Recap Digital Applied 2026-05 Technical retrospective documenting that inference cost on leading open-weight stacks dropped by roughly an order of magnitude in H1 2026 versus H2 2025, with sovereign-cloud deployment consolidating around on-prem vLLM, managed Llama 4, and air-gapped quantised DeepSeek/Qwen.
p25 A Comprehensive Review of Qwen and DeepSeek LLMs Preprints.org 2025 Drawing on 32 benchmarks and 18 peer-reviewed studies, documents that open-source models now achieve 89–92% of proprietary capabilities at 5–15% of operational cost, with MoE architectures delivering up to 4.3x faster inference at equivalent parameter count.

We use analytics cookies to understand site usage and improve the service. We do not use marketing cookies.