Research · Summary

Research sweep · deep · 2025 – 2026

AI on Deterministic Rails

Claude Opus 4.8
financial
frontier
academic
vc
blogs
tech

Synthesised 2026-06-07

AI on Deterministic Rails

Overview

The story of enterprise AI between January 2025 and June 2026 is not about models getting smarter. It is about the rest of the stack catching up to make probabilistic models useful. The defining shift of the past 18 months is the recognition that a stochastic language model embedded inside deterministic software, sandboxed execution, type checkers, CI/CD hooks, permission gates, acquires properties (verifiability, repeatability, auditability) it cannot provide alone. The frontier moved from raw capability to harness design, orchestration, and cost control.

Two pressures forced this maturation. The first is the so-called PoC-opalypse: near-universal adoption coupled with thin value capture. McKinsey's November 2025 State of AI survey found 88 percent of organisations using AI in at least one function, but only 39 percent reporting EBIT impact and roughly 5.5 percent qualifying as high performers. Sources: McKinsey Global Institute / QuantumBlack (2025) (↗); McKinsey & Company (2025) (↗)

The second pressure is cost. Per-token prices fell roughly 98 percent from late-2022 levels, yet enterprise AI bills tripled, because agentic loops consume orders of magnitude more tokens than chatbot queries. Uber exhausting its full-year 2026 AI coding budget within four months became the emblematic case of usage divorced from outcome. Sources: The Next Web (2026) (↗); EE News Europe (2026) (↗); TechCrunch (2026) (↗)

The synthesis across six lanes is consistent and slightly uncomfortable for the hype cycle: the organisations that benefit are those that already had clean data, well-defined workflows, and mature engineering discipline. AI amplifies existing conditions rather than creating capability. The deterministic infrastructure that most observers expected AI to replace turns out to be the precondition for AI working at all.

Timeline

Key milestones, Jan 2025 - Jun 2026

Q1 2025

Claude Code research preview with Claude 3.7
DeepSeek V3/R1 reset open-weight cost expectations
METR publishes task-horizon doubling

Q2 2025

Claude 4 and Claude Code GA (ASL-3)
OpenAI Codex launches as sandboxed agent
Meta ships Llama 4 herd
Gartner predicts 40% agentic cancellations by 2027

Q3 2025

MIT 95% pilot-failure figure goes viral
METR RCT finds devs 19% slower
DORA names AI an amplifier not creator

Q4 2025

Claude Opus 4.5 and GPT-5.2 make coding agents production-reliable
Thoughtworks declares vibe coding dead
Sequoia calls agentic coding the first discontinuous shift

Q1 2026

Claude Code codebase leak reveals 98% deterministic infrastructure
Harness engineering named successor to prompt engineering
SAP shifts to consumption pricing

Q2 2026

Token cost shock becomes operational emergency (Uber, Microsoft)
Linux Foundation Tokenomics body, EY/Deloitte CFO frameworks
DeepSeek V4 cuts cost-to-serve 10x
Goldman projects 24x token growth by 2030

Key Findings

The harness, not the model, is the binding constraint. The most striking empirical claim of the period is that agentic performance is dominated by scaffolding. Analysis of the leaked Claude Code codebase found that 98.4 percent of the system is deterministic infrastructure, permission gates, context management, tool routing, recovery logic, with the AI reasoning loop itself a simple while-loop. A May 2026 arXiv survey on scaling the harness found evolved harness components transfer across model families with a 12 percent token reduction on SWE-bench-verified, confirming the harness encodes general engineering knowledge rather than benchmark-specific tuning. Sources: arXiv (VILA-Lab) (2026) (↗); Agentic AI (Substack — Ken Huang) (2026) (↗); arXiv (2026) (↗)

This is corroborated independently. Ben Dickson at TechTalks, citing a UC Berkeley paper, argued system scaling has replaced model scaling as the bottleneck, while the SWE Atlas paper quantified that native scaffolds make substantially more tool calls than minimal scaffolds on identical underlying models, making scaffold choice a primary determinant of benchmark score. Sources: TechTalks (Substack — Ben Dickson) (2026) (↗); arXiv (2026) (↗)

Software is uniquely favourable because it is machine-verifiable. Tests pass or fail, type systems accept or reject, linters score against rules. This creates a tight feedback loop that lets agentic systems self-correct without human review at every step, a property prose and image generation lack. A February 2026 arXiv paper on agentic architecture documents production systems increasingly enforcing symbolic constraints on tool execution while using LLMs only for high-level decomposition, the AI-plus-deterministic-software thesis instantiated directly. Sources: arXiv (2026) (↗); arXiv (2026) (↗)

DORA's amplifier finding reframes the whole adoption debate. The 2025 DORA report, drawing on nearly 5,000 practitioners, found AI amplifies existing conditions rather than creating capability. Mature teams converted gains into delivery improvements; fragmented teams accelerated technical debt. Faros AI telemetry across 22,000 developers added a sharp complicating signal: individual throughput rose substantially, but median PR review time climbed 441 percent and 31 percent of PRs now merge with no review at all, a pattern they call Acceleration Whiplash. Sources: Google Cloud / DORA (2025) (↗); InfoQ (2026) (↗)

Token consumption is a poor proxy for value, and everyone now knows it. The Uber case, 5,000 engineers on Claude Code with the operations chief admitting no clear link between token spend and consumer features, crystallised the problem. Bain found the top 5 percent of users consume more tokens than the other 95 percent combined, locating cost in the highest-value, hardest-to-throttle engineers. Stanford's Digital Economy Lab found agentic coding consumes 1,000x more tokens than single-turn reasoning, with costs varying up to 30x on identical tasks. Sources: EE News Europe (2026) (↗); Bain & Company (2026) (↗); Stanford Digital Economy Lab (2026) (↗)

Open-weight models created a structural cost floor, not merely a discount. DeepSeek V3, released December 2024 at a claimed $6 million training cost, offered inference roughly 12.5x cheaper than Claude 3.5 Sonnet by IISS analysis. By H1 2026, DeepSeek V4 cut cost-to-serve over 10x versus V3.2 through hybrid sparse attention, and MindStudio found open-weight models matching closed frontier on coding, classification, and extraction at 5 to 20x lower cost. An arXiv cost-benefit study found private inference on consumer Blackwell GPUs reaches parity with commercial APIs within one to four months at 30 million tokens per day, then runs at 40 to 200x lower marginal cost. Sources: International Institute for Strategic Studies (IISS) (2025) (↗); The Register (2026) (↗); MindStudio (2026) (↗); arXiv (2026) (↗)

Routing is the cost lever that actually moves the needle. The most consistently cited corrective is intelligent routing: sending routine classification, extraction and summarisation to open-weight models while reserving frontier APIs for complex reasoning. Particula Tech reported a 60 to 70 percent cost reduction routing 80 percent of requests to DeepSeek V4 or Qwen 3 variants. A April 2026 arXiv survey on dynamic model routing and cascading formalised this as an emerging production discipline rather than a one-off optimisation. Sources: Particula Tech Blog (2026) (↗); arXiv (2026) (↗); arXiv (2025) (↗)

Deterministic scaffolding is also a financial control. Veso AI and Praetorian quantified that deterministic scaffolding layers, typed wrappers, schema validation, output truncation, reduce per-query token use by 60 to 98 percent. This collapses the harness debate and the cost debate into one: the same engineering that makes agents reliable also makes them affordable. Sources: Veso AI Blog (2026) (↗); Praetorian Blog (2026) (↗)

MCP standardised the symbiosis. Model Context Protocol, open-sourced by Anthropic in late 2024, reached near-ubiquitous vendor adoption within a year, formalising deterministic connectors (databases, APIs, workflow engines) as first-class citizens in the model's tool layer. Thoughtworks' Technology Radar Vol. 33 documented this alongside a declaration that vibe coding had "practically disappeared", a practitioner-level shift from theatre to structured engineering. Sources: Thoughtworks (2025) (↗); InfoQ (2025) (↗)

The perceived-versus-measured productivity gap is structural. METR's randomised controlled trial found experienced open-source developers using early-2025 AI tools took 19 percent longer to complete tasks. A May 2026 survey reported a median 1.4 to 2x self-reported gain, but METR warns respondents select into tasks where AI helps most. By February 2026, METR had to redesign its experiment because developers refused AI-free conditions, behavioural lock-in independent of demonstrated productivity. Sources: arXiv / METR (2025) (↗); METR (2026) (↗); METR (2026) (↗)

Evidence & Data

The adoption-versus-value figures cluster tightly. McKinsey: 88 percent adoption, 39 percent EBIT impact, 5.5 percent high performers. Gartner forecast 30 percent GenAI PoC abandonment by end-2025, upgraded toward 50 percent, plus a prediction that over 40 percent of agentic projects will be cancelled by end-2027. S&P Global data showed enterprises scrapping most AI initiatives jumped from 17 percent in 2024 to 42 percent in 2025. The MIT NANDA-derived 95 percent pilot-failure figure travelled furthest but rests on the weakest methodology. Sources: McKinsey Global Institute / QuantumBlack (2025) (↗); Gartner (2024) (↗); Gartner (2025) (↗); Fortune (2025) (↗)

On cost mechanics, Gartner's March 2026 analysis found agentic models require 5 to 30x more tokens per task because each step resends the full context window. The average annual enterprise AI budget grew from $1.2 million in 2024 to $7 million in 2026. The FinOps Foundation reported companies 3x over their full-year token budgets by April. Goldman Sachs projected a 24-fold increase in token consumption by 2030, reaching 120 quadrillion tokens per month, while forecasting 60 to 70 percent annual declines in per-token cost. Sources: Gartner (2026) (↗); Goldman Sachs Research (2026) (↗); Tom's Hardware (2026) (↗)

On capability measurement, METR's task-horizon work found the length of software tasks completable at 50 percent reliability doubled roughly every seven months. The critical caveat: at 80 percent reliability the horizon collapses to tens of seconds, far below the 99-plus percent threshold unattended production needs. SWE-bench Pro, contamination-resistant, placed even frontier models at 25 to 45 percent Pass@1 on long-horizon tasks. Sources: METR (2025) (↗); METR (2026) (↗); arXiv (2025) (↗)

On market scale, Bloomberg Intelligence raised its generative AI forecast to $2.3 trillion by 2032, with coding agents alone on track from $1 billion in 2024 to near $100 billion by 2032. CB Insights documented $66.6 billion in AI funding in Q1 2025 alone, nearly two-thirds of all 2024 investment, with agentic orchestration recording the highest Mosaic health scores. Sources: Bloomberg Intelligence (2026) (↗); CB Insights (2025) (↗)

On pricing-model disruption: SAP shifted to consumption pricing; GitHub moved Copilot to usage-based AI Credits from June 2026; Bloomberg estimated subscription pricing could fall from 60 to 30 percent of software models over a decade. TechTimes reported the token tax locking agentic gross margins 30 points below SaaS baseline. Sources: ERP Today (2026) (↗); TechTimes (2026) (↗)

Signals & Tensions

The November 2025 inflection is real but practitioner-defined. Simon Willison identified Claude Opus 4.5 and GPT-5.2 as the point coding agents became reliable for daily production use, and Sequoia called agentic coding the first discontinuous shift. This is convergent practitioner judgement, not controlled measurement, and should be held as a strong signal rather than established fact. Sources: Simon Willison's Newsletter (Substack) (2026) (↗); Sequoia Capital (2025) (↗)

Falling unit cost, exploding total spend is a Jevons dynamic, not an anomaly. The 98 percent price drop and tripled bills are not contradictory. They are the expected pattern when cheaper inference expands consumption faster than efficiency gains absorb it. The open question is whether outcome value scales with the consumption or lags it. Sources: The Next Web (2026) (↗); Axios (2026) (↗)

Self-hosting TCO is routinely understated. Vendor comparisons focus on per-token API prices and ignore GPU fleet management, quantisation engineering, and version-lifecycle discipline. The arXiv break-even analyses are credible but assume high, steady token volumes; below those thresholds metered APIs remain cheaper. Sources: arXiv (2025) (↗); arXiv (2026) (↗)

Vendor ROI claims dominate the evidence base. The Klarna 700-agent figure is a workload-equivalence calculation from a press release, not an audited headcount. ServiceNow's 90 percent deflection and most agentic ROI numbers share this provenance. Independent, controlled measurement of agentic output quality and TCO across a representative enterprise sample does not yet exist at scale. Sources: Perspective AI (2026) (↗); Computer Weekly (2025) (↗)

Lock-in is the underreported counter-thesis. a16z's CIO surveys tracked a transition from model interchangeability to harness lock-in: prompts tuned for one provider, switching risks breaking downstream agent dependencies. The orchestration layer that abstracts model variance also creates a new vendor moat. Haverin's "EXTRACTION" framing argues this is the point. Sources: Andreessen Horowitz (a16z) (2026) (↗); Haverin (Substack) (2026) (↗)

The token-inflation finding is small but corrosive. An arXiv paper showed tokenisation ambiguity can allow providers to over-report token counts below audit detection thresholds, and Willison flagged Anthropic's Opus 4.7 tokeniser changes as a 40 percent invisible price rise. If buyers cannot independently verify the meter, FinOps for tokens has a trust problem at its foundation. Sources: arXiv (2026) (↗); arXiv (2026) (↗)

Open Questions

The failure-rate figures are not comparable. Gartner, MIT NANDA, McKinsey and S&P each measure different things, abandonment, zero P&L, failure to reach production, scrapped PoCs, producing a 30 to 95 percent range that conflates distinct phenomena. The tractable question is why pilot conditions (clean data, narrow scope, relaxed reliability) do not transfer to production. Sources: SoftwareSeni (2026) (↗); Omdia (2025) (↗)

How much of the harness advantage is durable engineering knowledge versus model-specific tuning that decays at the next release? The 12 percent cross-family transfer is suggestive but thin. Sources: arXiv (2026) (↗)

Where does verifiability break down? The symbiotic thesis rests on machine-checkable outputs, but most enterprise agentic use cases (knowledge work, decision support) lack a compiler-grade quality signal. Whether the software pattern generalises beyond software is unresolved. Sources: arXiv (2025) (↗); arXiv (2026) (↗)

Does cost-per-outcome converge with token spend, or stay decoupled? No source offers an audited unit-economics framing that links agentic spend to delivered value. EY's Total Cost of Agents and Deloitte's CFO guide are frameworks, not measurements. Sources: Deloitte (2026) (↗); EY (2026) (↗)

Is there genuine AI-displaces-deterministic-software counter-evidence? The InfoQ "execution engines while backends retreat to governance" thesis and the COBOL-to-Python modernisation work hint at displacement at the edges, but both still wrap the model in deterministic control. A clean displacement case is absent from this sweep. Sources: InfoQ (2025) (↗); arXiv (2026) (↗)

Will Meta's pause on open-weight Llama, while pivoting to a closed line, shrink the regulated-market default and hand leverage back to metered APIs and Chinese-origin models that carry procurement risk? Sources: Meta AI (2025) (↗); Digital Applied (2026) (↗)

The strongest case for "new layer, not replacement" is that compilers, test runners and type checkers generate the most reliable RL signal of any AI domain and remain indispensable to producing it. The strongest case against is that nobody has yet shown the deterministic rails surviving contact with a model good enough to write them too.

![[sources-ai-on-deterministic-rails-how-ai-and-traditional-d]]

Sources

Summary: ↑ Back to summary

Financial Press

ID	Title	Outlet	Date	Significance
f1	Agentic AI 2026 Outlook — Bloomberg Intelligence	Bloomberg Intelligence	2026-05	Bloomberg Intelligence's primary research report on how agentic AI is restructuring enterprise software pricing from seat-based to usage- and outcome-based models, directly relevant to the token-economics and SaaS disruption angles.
f2	Generative AI Market Poised to Reach $2.3 Trillion by 2032 as Agentic Systems Proliferate — Bloomberg Intelligence	Bloomberg Intelligence	2026-06	Bloomberg Intelligence's June 2026 market-size forecast, including the shift of AI revenue from training to inference, and the projection that coding agents grow from $1 billion in 2024 to near $100 billion by 2032.
f3	Bloomberg Intelligence on AI Agents in the Enterprise (video)	Bloomberg	2026-05	Bloomberg Intelligence senior software analyst discusses how AI agents are disrupting the enterprise software stack, offering independent Wall Street analytical perspective on the agentic transition.
f4	The AI Trainers Charging $25,000 a Day to Push Wall Street's Agentic Shift	Bloomberg	2026-05	Bloomberg feature on the practical and financial challenge of enterprise agentic adoption in financial services, documenting the gap between capital commitment and realised workflow automation.
f5	How DeepSeek and Open-Source AI Models Are Disrupting Big Tech	Bloomberg	2025-08	Bloomberg's primary news analysis of the open-weight model disruption, covering how DeepSeek and Chinese labs pushed competitive inference costs down and forced OpenAI to release its first open model in six years.
f6	OpenAI Releases Open-Weight Models After DeepSeek's Success	Bloomberg	2025-08	Bloomberg's reporting on OpenAI's strategic response to open-weight competition, marking a structural shift in how frontier lab models are distributed and priced.
f7	AI Sticker Shock Hits Corporate America	Axios	2026-05	Axios original reporting naming specific enterprise AI cost-management crises, including direct executive quotes on misallocated use cases and uncontrolled token spend, with commentary from a former Microsoft chief AI officer.
f8	AI Agents Forecast to Boost Tech Cash Flow as Usage Soars — Goldman Sachs Research	Goldman Sachs Research	2026-05	Goldman Sachs primary research publication ('Decoding the Agentic Economy') forecasting a 24-fold increase in token consumption by 2030 and a coming margin inflection for hyperscalers, the key investment-bank framework for understanding agentic cost dynamics.
f9	AI Token Costs Force Rethink at Uber and Microsoft	EE News Europe	2026-05	Synthesises the Goldman Sachs token-demand forecast against documented enterprise pullbacks at Uber and Microsoft, providing the sharpest single-article framing of the cost paradox: cheaper tokens, higher total bills.
f10	Token Shock Hits Silicon Valley's Biggest Spenders	PYMNTS	2026-05	Documents Microsoft's Claude Code licence cancellations and Uber's full 2026 AI budget exhausted in four months, with specific financial figures including Uber's $3.4 billion R&D spend and the structural mismatch between usage-based billing and enterprise finance cycles.
f11	AI Agent Economics: Token Tax Locks Gross Margins 30 Points Below SaaS Baseline	TechTimes	2026-06	Provides the clearest unit-economics framing of the token tax problem, citing ICONIQ Capital data showing AI-native product gross margins at 52 percent in 2026 versus 75–85 percent for mature SaaS, and documents the structural cost asymmetry between model-maker and API buyer.
f12	Agentic AI Enterprise Token Cost	EY	2026-06	EY consulting framework coining 'Agent FinOps' as a necessary enterprise discipline, and documenting how a single customer-service interaction can inflate from $0.04 to $1.20 under agentic orchestration — practical TCO evidence missing from vendor claims.
f13	AI Costs Begin to Bite as Agents May Increase Token Demand by 24 Times — Uber and Microsoft Among Companies Feeling the Bite	Tom's Hardware	2026-05	Documents Uber's admission that 80 percent of engineers used agentic AI and over 60 percent of code was AI-generated with no clear correlation to consumer product value, making it the most cited enterprise cost-ROI mismatch case study of 2026.
f14	The State of AI in 2025: Agents, Innovation, and Transformation	McKinsey & Company	2025-11	McKinsey's annual survey of 1,993 organisations across 105 countries, finding that while 88 percent use AI in at least one function, only 23 percent are scaling agentic systems and just 39 percent report EBIT impact — the primary independent benchmark for enterprise AI adoption status.
f15	From PoC to Production: Why Enterprise AI Struggles for Trust	Tech Journal UK	2025-12	Named practitioner testimony from NatWest Group's global AI architecture lead on the governance gap that kills pilots in production — a key primary-voice source on the PoC-to-production structural barrier.
f16	MIT Report: 95% of Generative AI Pilots at Companies Are Failing	Fortune	2025-08	Fortune's coverage of MIT data showing that PoC stall rates are structural rather than accidental, with the MIT finding that back-office automation delivers the highest ROI but receives less than half of budget allocation.
f17	DeepSeek's Release of an Open-Weight Frontier AI Model	International Institute for Strategic Studies (IISS) Strategic Comments	2025-04	IISS authoritative policy analysis establishing that DeepSeek V3 inference is 12.5 times cheaper than Claude 3.5 Sonnet and over 15 times cheaper than GPT-4o, providing the foundational cost-differential evidence for the open-weight right-sizing thesis.
f18	Open-Weight AI Models Are Catching Up: What It Means for Enterprise Automation	MindStudio	2026-05	Practical enterprise analysis showing where open-weight models (DeepSeek V3, Qwen 3, Llama 4 Maverick) now match closed frontier models on coding and structured tasks while remaining 5–20 times cheaper, and identifying where the performance gap persists in long agentic workflows.
f19	SAP Shifts to AI Consumption Pricing as Agents Threaten SaaS Revenue Model	ERP Today	2026-04	Documents CEO Christian Klein's March 2026 Bloomberg interview announcing SAP's shift from per-user to consumption-based pricing, the most concrete large-enterprise example of the seat-to-usage pricing transition and its predictability risks for buyers.
f20	Enterprise SaaS in the Agentic AI Era: Salesforce, ServiceNow, Workday	VaaSBlock	2026-06	Independent financial analysis comparing how Salesforce, ServiceNow, and Workday are positioned against the agentic pricing shift, with market performance data showing compressed Salesforce valuation multiples as Agentforce revenue conversion lags seat revenue erosion.
f21	Klarna AI Customer Service: Replacing 700 Agents — A 2026 Case Study	Perspective AI	2026-05	Forensic case study distinguishing Klarna's verified operational results from vendor-narrative inflation, documenting the May 2025 Bloomberg/Reuters-reported reversal as a scope correction rather than full walkback, and providing the architectural detail about authenticated context access that makes the deployment non-replicable generically.
f22	Artificial Intelligence Helps Klarna Double Revenues with Half the Staff	Computer Weekly	2025-11	Documents Klarna's Q3 2025 financial results showing revenue of $903 million against a workforce reduced from 5,500 to below 3,000, providing the primary financial performance data point for the AI-enabled workforce contraction thesis.
f23	AI Prices Are Going Up, Up, Up — And What This Means for Enterprise AI	Josh Bersin	2026-05	Names Uber and failed projects at Pizza Hut and Starbucks as cases where token burn preceded project failure, and documents Big 4 hyperscaler 2025 capex at $370–410 billion with Reuters-cited Bridgewater estimates, framing the macro investment context.
f24	Agentic AI 2024–2025 Retrospective: Shipped vs Walked Back	AgentModeAI	2026-05	Provides a four-class evidence taxonomy distinguishing vendor-controlled wins, audited pilots, public walk-backs, and structural failure modes — the most rigorous independent sceptical framework for evaluating agentic ROI claims available in the date range.
f25	Products Over Models: Why the AI Harness Matters More Than Benchmarks in 2026	MindStudio	2026-05	Practitioner analysis establishing that harness design — orchestration logic, context management, tool integrations, output handling — accounts for more production performance variance than model selection, directly supporting the orchestration-over-model-capability thesis.

Frontier Lab & Model News

ID	Title	Outlet	Date	Significance
t1	Claude 3.7 Sonnet and Claude Code — Anthropic	Anthropic	2025-02	Announces Claude 3.7 Sonnet as the first hybrid-reasoning production model and introduces Claude Code as an agentic coding CLI, marking Anthropic's direct entry into the harness-plus-model stack for software engineering.
t2	Introducing Claude Opus 4.5 — Anthropic	Anthropic	2025-11	Documents Opus 4.5's multi-agent coding capabilities, pricing reduction to $5/$25 per million tokens, and parallel subagent session support in Claude Code — directly relevant to agentic orchestration economics.
t3	System Card: Claude Opus 4 & Claude Sonnet 4	Anthropic	2025-05	Primary safety documentation for the Claude 4 generation, covering ASL-3 classification, dangerous capability evaluations, and agentic risk assessment — the canonical reference for Claude 4's safety posture.
t4	System Card: Claude Opus 4.5	Anthropic	2025-11	Formal safety evaluation claiming Opus 4.5 is the best-aligned frontier model to date, with preliminary alignment audit results and ASL-3 deployment rationale — important for understanding the safety-capability tradeoff at the frontier.
t5	System Card Addendum: Claude Opus 4.1	Anthropic	2025-08	Addendum covering reward-hacking evaluations and regression findings for Opus 4.1 — evidence of the iterative, incremental safety testing process as agentic capabilities increase.
t6	Claude Opus 4.6 System Card	Anthropic	2026-02	ASL-3 deployment documentation covering agent teams and new office-software capabilities, illustrating how Anthropic manages safety as Claude moves deeper into enterprise workflow automation.
t7	System Card: Claude Haiku 4.5	Anthropic	2025-10	Safety documentation for Anthropic's cost-efficient tier, confirming ASL-2 deployment and documenting how a smaller model is evaluated against the same dangerous-capability thresholds — key evidence for the right-sizing thesis.
t8	Claude API Platform Release Notes	Anthropic	2026-06	Running changelog of API feature additions including Agent Skills, Code Execution Tool v2, Usage and Cost API, and Secure MCP — the primary record of how Anthropic's platform shifted from raw model access to orchestration infrastructure.
t9	Anthropic's Code with Claude showed off coding's future — whether you like it or not	MIT Technology Review	2026-05	Independent journalistic account of Anthropic's May 2026 developer conference, including the Dreaming memory-consolidation feature and evidence that most Anthropic code is now written by Claude Code — a rare non-vendor data point on real production adoption.
t10	Measuring AI Ability to Complete Long Tasks — METR	METR	2025-03	Foundational METR paper establishing the 7-month task-horizon doubling time for frontier agents across 170 software engineering and reasoning tasks — the most rigorous independent measurement of agentic capability trajectory.
t11	Time Horizon 1.1 — METR	METR	2026-01	Updated METR task suite expanding long-task coverage from 14 to 31 tasks of 8+ hours, tightening confidence intervals on the time-horizon trend and acknowledging remaining methodological limitations.
t12	Frontier Risk Report (February to March 2026) — METR	METR	2026-05	First-ever METR pilot assessing rogue-deployment risks from AI agents used inside frontier labs, with participation from Anthropic, Google, Meta, and OpenAI — extends evaluation scope beyond pre-deployment model assessment.
t13	AI models can be dangerous before public deployment — METR	METR	2025-01	METR's policy argument for broadening evaluation scope beyond post-deployment harms, contextualising the organisation's role in the pre-deployment evaluation ecosystem alongside UK AISI and Apollo Research.
t14	Common Elements of Frontier AI Safety Policies (December 2025 Update) — METR	METR	2025-12	Comparative analysis of twelve frontier safety policies, documenting the convergence on capability thresholds, pre-/during-/post-deployment evaluation timing, and third-party accountability mechanisms across the major labs.
t15	Introducing Codex — OpenAI	OpenAI	2025-05	Launch announcement for OpenAI's cloud-based parallel agentic coding agent, powered by codex-1 (o3-derived), running tasks in isolated sandboxes and providing terminal-log citations for verifiable task tracing.
t16	Introducing GPT-5.2-Codex — OpenAI	OpenAI	2025-12	Documents context compaction for long-horizon coding, state-of-the-art SWE-Bench Pro performance, and the moment practitioners describe as when autonomous coding agents began to feel reliable — a credibility inflection point.
t17	Introducing GPT-5.3-Codex — OpenAI	OpenAI	2026-03	Announces GPT-5.3-Codex achieving new SWE-Bench Pro and Terminal-Bench highs while using fewer tokens than any prior model — key evidence for token efficiency improving alongside capability in the agentic coding vertical.
t18	Run long horizon tasks with Codex — OpenAI Developers	OpenAI	2026-02	Practitioner writeup of a 25-hour, 13M-token Codex run building a design tool from scratch, demonstrating the role of durable project memory and deterministic markdown specs as scaffolding for long-horizon agentic work.
t19	OpenAI for Developers in 2025	OpenAI	2025-12	Year-end synthesis of OpenAI's platform shifts: agent-native APIs, Responses API, Agents SDK, open-weight gpt-oss models, and distillation tooling — documents the full infrastructure stack alongside model releases.
t20	Introducing GPT-5.5 — OpenAI	OpenAI	2026-04	Announces GPT-5.5 with explicit token-efficiency framing — claiming higher quality outputs with fewer tokens than GPT-5.4 — directly addressing enterprise cost shock and the unit-economics critique of agentic spend.
t21	OpenAI API Changelog	OpenAI	2026-06	Running record of OpenAI API changes including the Agents SDK launch, Secure MCP Tunnel for enterprise, and the shift from full-session to per-minute container billing — the canonical source for platform-level orchestration primitives.
t22	Gemini 2.5: Our newest Gemini model with thinking — Google DeepMind	Google DeepMind	2025-03	Announces Gemini 2.5 Pro with integrated thinking, controllable reasoning budget, and #1 LMArena ranking — Google's entry into the hybrid-reasoning model class that Claude 3.7 and o1 opened.
t23	Google I/O 2025: Updates to Gemini 2.5 — Google DeepMind	Google DeepMind	2025-05	Documents Gemini 2.5 Pro Deep Think, thinking-budget controls extended to Pro, native MCP SDK support, and computer-use integration — a comprehensive record of Google's agentic tooling expansion at I/O 2025.
t24	Gemini API Release Notes	Google DeepMind	2026-06	Official changelog documenting the progression through Gemini 3 Pro Preview, Gemini 3.5 Flash GA, Managed Agents API launch, and Antigravity general-purpose agent — the most complete record of Google's agentic platform build-out.
t25	The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation — Meta AI	Meta AI	2025-04	Official announcement of Llama 4 Scout (10M-token context), Maverick (128 experts, LM Arena 1417), and Behemoth (in training) — the open-weight release that most directly challenged the pricing power of closed frontier models.

Academic & arXiv

ID	Title	Outlet	Date	Significance
a1	HCAST: Human-Calibrated Autonomy Software Tasks	METR	2025	Foundational METR benchmark providing human-time-calibrated software task completions against which frontier model agentic performance is measured, directly informing claims about what AI can and cannot automate reliably.
a2	RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts	arXiv / METR	2024-11	Establishes empirical baseline for AI R&D task performance relative to human experts, showing AI agents outpace humans at short time budgets but are surpassed at longer horizons, grounding claims about the current limits of autonomous agentic work.
a3	Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity	arXiv / METR	2025-07	METR's RCT finding that AI tools caused a 19% slowdown among experienced developers in early 2025 directly challenges the assumption that token consumption and tool adoption translate into productivity gains.
a4	Research Update: Algorithmic vs. Holistic Evaluation	METR	2025-08	METR's follow-up analysis reconciling benchmark success rates with developer productivity RCT findings, highlighting the gap between algorithmic scoring and production-readiness of agent outputs.
a5	We are Changing our Developer Productivity Experiment Design	METR	2026-02	METR's February 2026 update documenting that developers were increasingly refusing to work without AI tools, complicating RCT design and indicating rapid behavioural change in developer-AI dependence by early 2026.
a6	Measuring the Self-Reported Impact of Early-2026 AI on Technical Worker Productivity	METR	2026-05	Survey of 349 technical workers finding median 1.4-2x self-reported change in work value from AI tools, with critical methodological warnings about overestimation, providing the most current data point on perceived versus measured productivity.
a7	Multi-Agent LLM Orchestration Achieves Deterministic, High-Quality Decision Support for Incident Response	arXiv	2025-11	Empirical demonstration that multi-agent orchestration produces zero quality variance across trials, enabling production SLA commitments impossible with single-agent outputs, directly supporting the argument that orchestration architecture matters more than raw capability.
a8	Deterministic vs. LLM-Controlled Orchestration for COBOL-to-Python Modernization	arXiv	2026-05	Comparative study separating execution control from generative reasoning in legacy code modernisation, arguing that structured workflows benefit from deterministic orchestration and that reliability and economic sustainability require this separation.
a9	Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation	arXiv	2026-04	Proposes compiling LLM intent into deterministic executable code rather than invoking models at runtime, addressing token waste and non-determinism in enterprise workflow automation; cites 79% of multi-agent failures stemming from specification issues.
a10	Towards a Science of AI Agent Reliability	arXiv	2026-02	Introduces twelve reliability metrics across four dimensions for agentic systems, finding that recent capability gains have yielded only small improvements in reliability, providing the strongest empirical counter-argument to claims of production-ready autonomous agents.
a11	From Prompt-Response to Goal-Directed Systems: The Evolution of Agentic AI Software Architecture	arXiv	2026-02	Architecture survey showing production systems increasingly adopt hybrid patterns using LLMs for high-level decomposition while enforcing symbolic constraints on tool execution, providing formal framing for the AI-plus-deterministic-software stack.
a12	The Six Sigma Agent: Achieving Enterprise-Grade Reliability in LLM Systems Through Consensus-Driven Decomposed Execution	arXiv	2026-01	Frames the fundamental tension between probabilistic LLM outputs and enterprise determinism requirements, citing MIT GenAI Divide data that 95% of enterprise deployments fail, and proposing consensus-based decomposition as a reliability engineering solution.
a13	From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents	arXiv	2026-03	Comprehensive survey mapping the transition from static prompt templates to dynamic workflow graphs, covering the full landscape of orchestration paradigms relevant to the harness-versus-model capability debate.
a14	From Model Scaling to System Scaling: Scaling the Harness in Agentic AI	arXiv	2026-05	Directly addresses harness-as-unit-of-scale thesis, documenting that Claude Code and Codex-style harness engineering package agent primitives into programmable runtimes and arguing the harness rather than the backbone model is now the primary scalable variable.
a15	Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses	arXiv	2026-04	Demonstrates that evolved harness components transfer cross-model-family with 12% token reduction on SWE-bench-verified, providing evidence that harness design encodes general engineering experience independent of specific models.
a16	SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution	arXiv	2026-05	Documents that native scaffolds (Claude Code, Codex CLI) outperform minimal scaffolds on identical models and traces Claude Opus capability improvements from August 2025 to April 2026, quantifying the scaffold contribution to benchmark scores.
a17	SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?	arXiv	2025-09	Establishes a harder, contamination-resistant SWE-bench variant where even frontier models remain below 25-45% Pass@1, grounding realistic expectations about the gap between benchmark performance and production software engineering capability.
a18	Difficulty-Aware Agent Orchestration in LLM-Powered Workflows	arXiv	2025-09	Proposes difficulty-aware routing across heterogeneous model ensembles, demonstrating that right-sizing model selection by task hardness improves cost-performance ratios over homogeneous frontier-model deployments.
a19	Towards Generalized Routing: Model and Agent Orchestration for Adaptive and Efficient Inference	arXiv	2025-09	Presents MoMA, a routing framework integrating both LLM and agent-level routing based on intent recognition, formalising the production practice of dynamically directing queries to cost-optimal execution units.
a20	Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey	arXiv	2026-04	Comprehensive taxonomy of six routing paradigms across independently trained LLMs, synthesising the 2024-2026 literature on cost-performance optimisation through intelligent query routing in production inference stacks.
a21	Token Economics for LLM Agents: A Dual-View Study from Computing and Economics	arXiv	2026-05	Formal dual-view framework treating token consumption as both computational and economic variable, showing that agentic iterative workflows make cost unpredictable and framing inference acceleration as an economic imperative rather than an engineering optimisation.
a22	Token Inflation: How Dishonest Providers Can Overcharge for Large Language Model Usage	arXiv	2026-05	Identifies a structural billing integrity problem in per-token pricing, finding tokenisation ambiguity alone allows over-reporting below detection thresholds, directly relevant to enterprise cost-shock and the gap between vendor-quoted and actual inference cost.
a23	The Price of Progress: Algorithmic Efficiency and the Falling Cost of AI Inference	arXiv	2025-11	Empirical analysis of token price trends from April 2024 to October 2025 across proprietary and open-weight providers, isolating algorithmic efficiency gains from hardware and competitive effects to inform enterprise build-versus-buy decisions.
a24	A Cost-Benefit Analysis of On-Premise Large Language Model Deployment: Breaking Even with Commercial LLM Services	arXiv	2025-09	Quantified break-even analysis showing medium-scale open-weight models (Llama-3.3-70B class) running on two A100s achieve less than 10% accuracy loss versus frontier models at substantially lower total ownership cost, grounding the open-weight economics case.
a25	Private LLM Inference on Consumer Blackwell GPUs: A Practical Guide for Cost-Effective Local Deployment in SMEs	arXiv	2026-01	Benchmarks four open-weight models across 79 configurations and finds self-hosted inference reaches cost parity with commercial APIs within 1-4 months at moderate usage, then operates at 40-200x lower cost, providing the strongest empirical backing for private inference economics.

VC & Analyst Reports

ID	Title	Outlet	Date	Significance
v1	How 100 Enterprise CIOs Are Building and Buying Gen AI in 2025	Andreessen Horowitz (a16z)	2026-02	Third annual a16z CIO survey documents the shift from model interchangeability to agentic workflow lock-in, with 81% of enterprises now using three or more model families and innovation-budget share collapsing from 25% to 7% of LLM spend.
v2	Leaders, Gainers and Unexpected Winners in the Enterprise AI Arms Race	Andreessen Horowitz (a16z)	2026-02	Quantifies Anthropic's 25-percentage-point enterprise penetration gain since May 2025, names Claude Code and software development as the primary vector, and notes that 65% of enterprises prefer incumbent solutions for trust and procurement simplicity.
v3	Where Enterprises Are Actually Adopting AI	Andreessen Horowitz (a16z)	2026-04	Frames the adoption thesis around verifiability and defined standard operating procedures, explaining why customer support and software development lead adoption while industries lacking verifiable outputs lag.
v4	The State of AI in 2025: Agents, Innovation, and Transformation	McKinsey Global Institute / QuantumBlack	2025-11	Canonical 2025 enterprise survey (1,993 respondents, 105 countries) finding that 88% use AI but only 39% report EBIT impact; 23% are scaling agentic systems; high performers are 3x more likely to have fundamentally redesigned workflows.
v5	Gartner Predicts 30% of Generative AI Projects Will Be Abandoned After Proof of Concept by End of 2025	Gartner	2024-07	Foundational PoC-abandonment forecast, citing poor data quality, inadequate risk controls, escalating costs, and unclear business value as the four driving causes; later updated to 50% abandonment by Gartner's own revised analysis.
v6	Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027	Gartner	2025-06	Extends the PoC-failure thesis to the agentic era, warns of widespread 'agent washing' by vendors, and estimates only around 130 of the thousands of self-described agentic AI vendors possess genuine agentic capabilities.
v7	Why Half of GenAI Projects Fail: Avoid These 5 Common Mistakes	Gartner	2026-04	Updated Gartner analysis reporting that at least 50% of GenAI projects were abandoned after PoC by end of 2025, with poor use-case selection and lack of business-value metrics consistently topping the failure list.
v8	State of the Art of Agentic AI Transformation — Technology Report 2025	Bain & Company	2025-09	Introduces a four-level agentic maturity framework (information retrieval to multi-agent constellations), documents that tech-forward enterprises achieved 10–25% EBITDA gains in 2023–2024 by scaling Level 1–2 tools, and warns that cross-system orchestration (Level 3–4) will require fit-for-purpose builds and human-in-the-loop governance.
v9	The Future of Opex in the Agent Economy	Bain & Company	2026-05	Bain's empirical framing of token-cost concentration — top 5% of users consuming more tokens than the other 95% combined — and the forecast that token costs could displace 20–30% of headcount opex, without a clear migration glide-path.
v10	AI in 2026: A Tale of Two AIs	Sequoia Capital	2025-12	Sequoia's annual AI outlook naming adoption fatigue on DIY implementations as a tailwind for packaged AI startups, flagging data-centre delays as a supply constraint, and noting that only coding and ChatGPT have established themselves as undisputed killer applications.
v11	Sequoia AI Ascent 2026: The Future of AI	Sequoia Capital (via AI Opportunities newsletter)	2026-05	Sequoia partner Sonya Huang's 2026 AI Ascent framing: 2022–2024 was chat, 2024–2025 was reasoning models, 2026 is agents; characterises Claude Code and long-horizon agentic coding as the first genuinely discontinuous capability shift, not just an incremental improvement.
v12	Services Are the New Software — Sequoia's Julien Bek on AI-Native Services	Fortune / Sequoia Capital	2026-04	Sequoia partner Julien Bek articulates the margin-compression thesis for AI-service businesses (gross margins around 70% vs. 90% for pure SaaS) and documents real-time enterprise token-rationing behaviour after Anthropic capped Claude Code at peak hours.
v13	State of AI 2025 Report	CB Insights	2026-02	Full-year 2025 overview recording $200B-plus in AI venture funding, AI agent acquisitions comprising 10% of all AI M&A by value, and Salesforce as the most acquisitive buyer with 10 AI deals — reflecting incumbent capture of agentic stack.
v14	State of AI Q1 2025 Report	CB Insights	2025-09	Documents AI funding surging 51% to $66.6B in Q1 2025, with the three largest acquisitions going to agentic AI companies; Mosaic scores for multi-agent systems and orchestration platforms averaging 705+ — among the highest health scores across all industries.
v15	The AI Agent Market Map	CB Insights	2026-03	Maps 400+ private agentic AI companies, noting the landscape tripled from roughly 300 to thousands since March 2025; identifies software development as the most revenue-active agent category and flags that reasoning-model inference costs are already driving pricing pressure in that segment.
v16	AI Token Economics for CFOs	Deloitte	2026-04	Practitioner-level CFO guidance documenting the structural shift from per-seat to consumption pricing, citing AT&T's 8-billion-tokens-per-day workload and subsequent 90% cost reduction through multi-agent architecture as a real-world cost-control case.
v17	Agentic AI Enterprise Token Cost	EY	2026-06	EY's Total Cost of Agents framework argues that token costs are only the visible component of a broader operating stack including infrastructure, governance, and engineering overhead — and that current API pricing may be structurally subsidised.
v18	Token Prices Fell 98%. Enterprise AI Bills Tripled. Now the Industry Wants a Standards Body to Explain Why.	The Next Web	2026-06	Reports that GPT-4-equivalent token costs fell 98% while enterprise AI bills rose an estimated 320%, with average enterprise AI budgets growing from $1.2M in 2024 to $7M in 2026; documents Uber exhausting its full 2026 AI coding budget by April and the Linux Foundation's Tokenomics Foundation launch.
v19	Uber, Microsoft, and Others Burning Through AI Budgets. Now What?	SmarterX	2026-06	Synthesises WSJ and Axios reporting on enterprise token-budget crises; cites Goldman Sachs projection that global token consumption will multiply 24x by 2030 and documents Uber's full-year AI coding budget exhausted in three months as the defining enterprise anecdote.
v20	DeepSeek's Release of an Open-Weight Frontier AI Model	International Institute for Strategic Studies (IISS)	2025-04	Authoritative independent analysis of DeepSeek V3/R1's economic significance: V3 reportedly trained for approximately $6M, with inference costs 12.5x cheaper than Claude 3.5 Sonnet and 15x cheaper than GPT-4o, reframing enterprise model economics.
v21	Open-Weight Models H1 2026: DeepSeek, Qwen, Llama Recap	Digital Applied	2026-05	Technical retrospective documenting that open-weight inference costs dropped roughly an order of magnitude versus H2 2025; details DeepSeek V4's architectural reset achieving 27% of V3.2's single-token inference FLOPs and sovereign-cloud deployment patterns consolidating around vLLM and air-gapped clusters.
v22	DeepSeek's New Models Offer Big Inference Cost Savings	The Register	2026-04	Technical coverage of DeepSeek V4's hybrid attention mechanism (CSA/HCA), reporting cost-to-serve drops over 10x versus V3.2 with roughly 10x less memory — the most significant open-weight serving efficiency improvement of 2026.
v23	Battery Ventures State of Enterprise Tech Spending: Q4 2025	Battery Ventures	2025-12	Survey of 100 CXOs representing $35B+ in annual tech spend, finding 33% already run agentic AI in production, 42% scaling across functions, and enterprises identifying an average of 88 Gen AI use cases with production deployments growing nearly 4x year over year.
v24	CB Insights: AI Agents Are Transforming Enterprise Operations and Driving Infrastructure Demand	CB Insights (via Crowdfund Insider)	2026-02	CB Insights survey of 59 executives finding that 80% consider AI agent adoption a strategic priority but 40% cannot track or are unaware of actual ROI — the clearest quantification of the measurement gap between agentic enthusiasm and documented value.
v25	Agentic AI Market Funding Trends 2026	New Market Pitch	2026-05	Tracks every disclosed agentic AI equity round from January 2024 to May 2026; full-year 2025 funding nearly doubled to $2.9B across 50 deals versus $1.5B in 2024, with January-May 2026 already at $1.1B — 2x the comparable 2025 period.

Blogs & Independent Thinkers

ID	Title	Outlet	Date	Significance
b1	The problem with agentic AI in 2025	Platforms, AI, and the Economics of BigTech (Substack)	2025-10	Advances the thesis that most enterprise agentic deployments are being governed by RPA-era thinking, producing only incremental efficiency rather than the coordination gains that agentic architecture makes possible.
b2	Microsoft Is The Canary In The AI-Adoption Coal Mine	ProductMind (Substack)	2026-06	Documents Microsoft's internal Claude Code rollout in December 2025 and licence cancellations by May 2026 due to budget overruns, synthesising METR productivity data and Microsoft Research findings on AI delegation quality degradation.
b3	Why Agentic AI Is Stalling Inside Most Enterprises	The AI Economy (Substack)	2026-05	Synthesises Harvard Business Review/Hyland survey data showing only 27 percent of enterprises have connected data needed for agentic AI, grounding the PoC-opalypse in data infrastructure failure rather than model capability.
b4	The Agentic Stack Wars: Part Three — EXTRACTION	Haverin (Substack)	2026-06	Provides detailed cost modelling of multi-model orchestration versus single-model RAG, citing McKinsey and IBM ROI data to argue value capture is structurally uneven and the pricing shift from flat to consumption billing is arriving before the ROI case is settled.
b5	The real, embarrassing state of enterprise AI adoption	Next Word (Substack)	2026-06	Names Uber's budget burnout directly and argues AI adoption outcomes follow a skill distribution, with talent density and incentive alignment separating genuine production use from expensive experimentation.
b6	75% of Enterprise AI Fails. The Fix Isn't a Better Model.	Product Impact Pod (Substack)	2026-03	Argues governance and structured knowledge (ontology, knowledge graphs) are the production differentiators, citing WEF data showing organisations with strong AI governance see 20 percentage points higher positive outcomes.
b7	The Claude Code Leak: 10 Agentic AI Harness Patterns That Change Everything	Agentic AI (Substack — Ken Huang)	2026-04	Analyses the leaked Claude Code codebase to extract ten harness engineering patterns, making the case that harness design, not model intelligence, is the production differentiator in agentic systems.
b8	CLAUDE CODE ORCHESTRATION	Agentic AI (Substack — Ken Huang)	2026-06	Documents Claude Code's May 2026 Dynamic Workflows feature — JavaScript-generated orchestration scripts fanning work across up to 1,000 parallel subagents — as evidence that deterministic control flow is now the primary performance lever, not model capability.
b9	Scaling the harness: The next major bottleneck in agentic AI	TechTalks (Substack — Ben Dickson)	2026-06	Reports a UC Berkeley paper arguing system scaling — harness design — has replaced model scaling as the binding constraint on agentic AI performance, and details Claude Code's five-tier context compaction system as a concrete implementation.
b10	Decode the Buzzword: Why Harness Engineering Matters Now	Next Signal Prediction (Substack)	2026-03	Maps a three-era evolution of AI engineering methodology — prompt engineering, context engineering, harness engineering — and identifies November 2025 as the inflection point where using models well overtook making models better.
b11	Claude Code's Secrets Revealed	AI Changes Everything (Substack — Patrick McGuinness)	2026-04	Documents the leaked Claude Code codebase architecture: 29,000-line tool system with three-tier permissions, 46,000-line query engine for LLM orchestration, and coordinator-worker multi-agent patterns for large-scale codebase operations.
b12	Agentic Harness — OpenClaw, Claude Code, and More	SolomonChrist AI (Substack)	2026-03	Provides comparative analysis of eight agentic harnesses — Claude Code, OpenCode, OpenClaw, Manus, Codex and others — identifying MCP standardisation and the local-versus-cloud deployment split as the two structural forces shaping the harness market.
b13	The Agentic Harness: Why the Orchestration Layer Is the Product	Veso AI Blog	2026-04	Argues that deterministic layers reduce per-query LLM token use by 60–80 percent versus naive full-model approaches, and that data-layer constraints — unlike prompt-based guardrails — cannot fail under adversarial or edge-case inputs.
b14	Agent & Harness & Micro-Orchestrator, Oh My!	Scaling DataOps (Substack)	2026-05	Practitioner account of outgrowing Claude Code and building custom orchestrators, documenting the shift from agentic harnesses for single-agent UX to orchestrators for repeated multi-step workflows at scale.
b15	How an Agent Harness Made My Claude Code Setup 10x More Reliable	AI Maker (Substack)	2026-05	Practitioner case study showing how memory, hooks, and review-agent loops around Claude Code produce repeatable quality that raw model access does not, illustrating the harness-as-governance pattern in a personal workflow context.
b16	Agentic Engineering Patterns — Simon Willison's Newsletter	Simon Willison's Newsletter (Substack)	2026-02	Willison, who coined 'prompt injection' and 'agentic engineering,' synthesises the November 2025 inflection point and documents practical patterns for deterministic control within agentic loops, including the layered orchestration architecture in OpenClaw.
b17	LLM predictions for 2026, shared with Oxide and Friends	Simon Willison's Newsletter (Substack)	2026-01	Willison argues November 2025 was the decisive inflection where coding agents became reliable daily drivers, and frames the Jevons paradox as the central unresolved question about whether lower code production costs expand or destroy engineering demand.
b18	I think 'agent' may finally have a widely enough agreed upon definition to be useful jargon now	Simon Willison's Newsletter (Substack)	2025-09	Willison formalises 'an LLM agent runs tools in a loop to achieve a goal' as a working definition, distinguishing deterministic from non-deterministic agents and framing LLMs as a non-deterministic layer added atop existing deterministic functions.
b19	Enterprise AI Inference: Open Models, Local AI, and the New Economics of Control	AI Realized Now (Substack)	2026-05	Applies Stanford HAI 2025 AI Index data — performance gap between open-weight and closed models narrowed from 8 percent to 1.7 percent by February 2025 — to argue enterprise teams should now treat open-weight routing as a first-class cost-control strategy.
b20	Open-Weight Models H1 2026: DeepSeek, Qwen, Llama Recap	Digital Applied	2026-05	Provides the most detailed public retrospective on H1 2026 open-weight model releases across DeepSeek V4, Qwen 3.x, and Llama 4 families, documenting an order-of-magnitude inference cost drop and the three distinct vendor release strategies that emerged.
b21	DeepSeek V4 and Qwen 3.5: Open-Source AI Is Rewriting the Rules in 2026	Particula Tech Blog	2026-03	Reports practitioner-measured 60–70 percent infrastructure cost reductions from routing 80 percent of requests to open-weight models, with DeepSeek V4 at roughly $0.14 per million input tokens versus GPT-5 at $2.50, making the unit economics concrete.
b22	The Big LLM Architecture Comparison	Ahead of AI (Substack — Sebastian Raschka)	2026-04	Raschka, a researcher and ML educator, provides the most comprehensive architectural comparison of 2025–2026 open-weight models, documenting the rise of MoE architectures and the efficiency gains that underpin open-weight cost advantages.
b23	Tech Philosophy and AI Opportunity	Stratechery (Ben Thompson)	2025-11	Thompson argues Anthropic's task is to build not just state-of-the-art models but all the deterministic computing scaffolding around them, explicitly naming the harness and orchestration layer as the enterprise product, not the model alone.
b24	Microsoft and Software Survival	Stratechery (Ben Thompson)	2026-02	Analyses how per-seat licensing — built around human identity via Active Directory — becomes structurally problematic as agent adoption shrinks human headcount, framing the token-pricing transition as an existential business model shift for enterprise software.
b25	Deterministic AI Orchestration: A Platform Architecture for Autonomous Development	Praetorian Blog	2026-02	Shows that replacing raw MCP connections with on-demand typed wrappers and Zod schema validation reduces token consumption by up to 98 percent per multi-tool operation, providing the most precise quantification of deterministic scaffolding's economic value.

Tech Industry & Practitioner

ID	Title	Outlet	Date	Significance
p1	2025 DORA State of AI-Assisted Software Development Report	Google Cloud / DORA	2025-09	Primary empirical survey of nearly 5,000 practitioners establishing that AI amplifies existing engineering conditions rather than creating new capability, and introducing the DORA AI Capabilities Model as a framework for contextualising adoption.
p2	AI Is Amplifying Software Engineering Performance, Says the 2025 DORA Report	InfoQ	2026-03	Authoritative InfoQ synthesis of the 2025 DORA report findings, documenting that AI adoption continues to have a negative relationship with software delivery stability absent strong automated testing and feedback loops.
p3	Thoughtworks Technology Radar Vol. 33 — November 2025	Thoughtworks	2025-11	Canonical biannual practitioner signal report documenting the shift from prompt engineering and RAG toward context engineering, MCP-driven agent orchestration, and emerging AI antipatterns such as AI-accelerated shadow IT.
p4	Thoughtworks Technology Radar Highlights The Rapid Evolution of AI Assistance in 2025	Thoughtworks	2025-11	CTO Rachel Laycock's statement that vibe coding has 'practically disappeared' and the industry has moved to serious work on context, infrastructure, and security — a practitioner-level signal about the end of the PoC-as-theatre phase.
p5	Macro Trends in the Tech Industry — November 2025	Thoughtworks	2025-11	Expanded commentary on Radar Vol. 33 themes, documenting MCP's rapid proliferation to thousands of servers in under a year and the structural challenge of GPU cost management in AI inference at scale.
p6	2025 Stack Overflow Developer Survey — AI Section	Stack Overflow	2025-07	Large-scale survey of 49,000-plus developers across 177 countries showing 84% AI tool adoption alongside declining trust (only 29% trust AI outputs), with 66% citing near-miss outputs as the top frustration — grounding the trust-gap problem empirically.
p7	Stack Overflow's 2025 Developer Survey Reveals Trust in AI at an All Time Low	Stack Overflow	2025-07	Official press release providing the headline finding that positive developer sentiment towards AI tools has fallen from above 70% in 2023–24 to 60% in 2025, with 45% reporting that debugging AI-generated code is more time-consuming than writing it.
p8	Developers Remain Willing but Reluctant to Use AI: The 2025 Developer Survey Results Are Here	Stack Overflow Blog	2025-12	Contextual analysis of 2025 survey data noting that 72% of developers say vibe coding is not part of their professional work, reinforcing that AI is being adopted as a tool layer rather than a replacement for deterministic engineering practice.
p9	Why 88 to 95 Percent of Enterprise AI Pilots Never Reach Production	SoftwareSeni	2026-03	Consolidates data points from IDC/Lenovo, MIT NANDA, McKinsey, S&P Global, PwC, and Gartner into a comparative analysis of different PoC failure measurements, providing the most cited statistical overview of the enterprise adoption gap.
p10	MIT Report: 95% of Generative AI Pilots at Companies Are Failing	Fortune	2025-08	Fortune coverage of the MIT NANDA GenAI Divide report documenting the misalignment between where AI budgets are spent (sales and marketing) versus where ROI has been documented (back-office automation), providing context for PoC stall rates.
p11	AI PoCs to Production: A Balanced Perspective	Omdia	2025-11	Independent analyst survey showing 40% of enterprises run 6–20 simultaneous PoCs, offering a more nuanced counter-reading to alarmist PoC failure statistics and noting the path to production is iterative rather than linear.
p12	Agentic AI Enterprise Token Cost	EY	2026-06	First edition of EY's Total Cost of Agents series, framing token costs as only the visible surface of agentic spend and recommending 'Agent FinOps' as a discipline with hard kill switches, per-task benchmarks, and centralised cost ownership.
p13	AI Token Economics for CFOs	Deloitte	2026-04	Based on a survey of 550 US enterprise leaders, documents that many companies already generate above 10 billion tokens per month and that agentic capabilities are shifting pricing from per-seat to usage-based models with material forecasting implications for CFOs.
p14	The Token Bill Comes Due: Inside the Industry Scramble to Manage AI's Runaway Costs	TechCrunch	2026-06	Documents enterprises hitting annual AI budgets in three months, per-developer token consumption rising 18.6x in nine months, and the Linux Foundation's creation of the Tokenomics Foundation as a FinOps-style standards body for AI spend governance.
p15	Uber, Microsoft, and Others Burning Through AI Budgets. Now What?	SmarterX	2026-06	Named-company analysis of Uber's CTO burning the entire 2026 Claude Code budget in four months and Microsoft cancelling most Claude Code licences over cost, with Google I/O data showing a 7x token volume jump at Google in a single year.
p16	The Real Cost of Agentic AI	InfoWorld	2026-06	Practitioner cost modelling showing all-in operating costs for agentic systems are two to five times raw token costs, and that a deterministic workflow with a single model call can handle classification, extraction, and summarisation at a fraction of the cost and risk.
p17	How Are AI Agents Spending Your Tokens?	Stanford Digital Economy Lab	2026-05	Based on a paper co-authored by Erik Brynjolfsson finding that agentic coding tasks consume 1000x more tokens than code reasoning tasks and that agents cannot predict their own costs — costs vary up to 30x on the same task — establishing the fundamental unpredictability problem.
p18	Microsoft Reports Are Exposing AI's Real Cost Problem	Fortune	2026-05	Fortune reporting on cases where AI compute costs exceed the cost of the human labour being replaced, citing Goldman Sachs forecasts of a 24-fold token consumption increase by 2030 alongside a Gartner finding that cheaper tokens will not translate to cheaper enterprise AI.
p19	Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems	arXiv (VILA-Lab)	2026-04	Systematic architectural analysis of the Claude Code harness documenting that 98.4% of the system is deterministic infrastructure — permission gates, context management, tool routing, recovery — with the AI loop itself a simple while-loop.
p20	Effective Harnesses for Long-Running Agents	Anthropic Engineering	2025	First-party Anthropic documentation demonstrating that even frontier models fail to build production applications without structured harness design, requiring initialiser agents, progress files, and explicit browser-automation testing to bridge context-window gaps.
p21	A Harness for Every Task: Dynamic Workflows in Claude Code	Anthropic / Claude Blog	2026-06	Anthropic's own practitioner guide to dynamic workflows, documenting agentic laziness, self-preferential bias, and goal drift as structural failure modes that deterministic workflow orchestration mitigates through isolated subagent context windows.
p22	Natural-Language Agent Harnesses	arXiv	2026-03	Academic paper establishing that harness scaffold differences can dominate outcomes even under fixed base models, reframing 'prompt engineering' as the broader practice of 'context engineering' — deciding what state should be available at each step of a long run.
p23	The Architectural Shift: AI Agents Become Execution Engines While Backends Retreat to Governance	InfoQ	2025-10	InfoQ analysis of enterprise agentic architecture patterns, citing Gartner's prediction that 40% of enterprise applications will include task-specific agents by 2026, and outlining a three-tier governance framework where trust must precede autonomy.
p24	Open-Weight Models H1 2026: DeepSeek, Qwen, Llama Recap	Digital Applied	2026-05	Technical retrospective documenting that inference cost on leading open-weight stacks dropped by roughly an order of magnitude in H1 2026 versus H2 2025, with sovereign-cloud deployment consolidating around on-prem vLLM, managed Llama 4, and air-gapped quantised DeepSeek/Qwen.
p25	A Comprehensive Review of Qwen and DeepSeek LLMs	Preprints.org	2025	Drawing on 32 benchmarks and 18 peer-reviewed studies, documents that open-source models now achieve 89–92% of proprietary capabilities at 5–15% of operational cost, with MoE architectures delivering up to 4.3x faster inference at equivalent parameter count.