Research · Summary

Research sweep · deep · 2024 – 2026

Designing AI Operating Models Around Humans

How humans are adapting to AI between June 2024 and June 2026, weighing measured benefits and harms, and how organizations should design operating models around human cognitive load and behavioural patterns rather than forcing adoption, covering cognitive overload from supervising multiple agents at machine speed (context switching, automation complacency, vigilance fatigue), the poor budget and value outcomes of top-down AI mandates and token-maximizing usage, the gap between model welfare functions (such as Anthropic's) and any equivalent human or worker welfare function, and how much good human outcomes depend on model training versus orchestration and deployment design.

GPT-5.5
financial
frontier
academic
vc
blogs
tech

Synthesised 2026-06-15

Humans are becoming the bottleneck in AI operating models

Overview

AI adoption moved faster than organisational redesign between June 2024 and June 2026. Gallup found workplace AI use had nearly doubled in two years by mid-2025, while NBER’s adoption research showed generative AI spreading at a pace comparable to earlier general-purpose technologies. Yet the strongest evidence now points to a less convenient conclusion: access to AI is not the same as value capture, and value capture is not the same as good human outcomes.
Sources: Gallup (2025); NBER (2024)

The defining shift of the period was the move from chat assistance to agentic work. Anthropic introduced computer use in Claude, Google framed Gemini 2.0 as a model for the “agentic era”, OpenAI released Operator and deep research system cards, and Meta published LlamaFirewall for safer AI agents. These releases changed the human problem: workers were no longer only writing prompts, they were also assigning work, monitoring tools, reviewing outputs, catching failures, and deciding when to intervene.
Sources: Anthropic (2024) (↗); Google DeepMind (2024) (↗); OpenAI (2025) (↗); OpenAI (2025) (↗); Meta AI (2025) (↗)

The benefits are real, but they are task-shaped rather than universal. Field experiments with software developers, central bankers and other knowledge workers show speed and productivity gains in bounded tasks, while professional-services and enterprise surveys show rising use in drafting, summarisation, research and workflow acceleration. The harms are also task-shaped: over-reliance, cognitive offloading, deskilling risk, review burden, poor judgement on out-of-frontier tasks, and new coordination costs when AI output flows to colleagues.
Sources: SSRN (2025); SSRN (2025); Organization Science (2026); Thomson Reuters Institute (2024); Thomson Reuters Institute (2025)

The operating-model question therefore matters more than the adoption question. The organisations that treat AI as a licence, mandate or token budget risk creating workslop, review debt and trust decay. The organisations that design around human attention, verification cost, escalation thresholds and team-level rules have a better chance of turning AI into useful work rather than more work.
Sources: Thoughtworks Technology Radar (2026) (↗); MIT Sloan Management Review (2025) (↗); MIT Sloan Management Review (2025) (↗); Harvard Business Review (2025) (↗)

Key milestones, June 2024 to June 2026

Q2 2024

Enterprise genAI adoption spikes
GenAI risk profiles become operational guidance

Q3 2024

Everyday AI and digital employee experience move onto CIO roadmaps

Q4 2024

Computer use and agentic models enter workplace products
Reasoning system cards become standard evidence

Q1 2025

AI task-use indexes map real adoption
Human-AI teaming meta-analysis challenges simple augmentation claims
Operator and deep research shift AI toward delegated work

Q2 2025

Model welfare becomes a formal lab topic
Agent guardrails and firewalls appear
Work redesign replaces experimentation as the analyst consensus

Q3 2025

AI-assisted software delivery shows productivity and review-load tensions
Workslop becomes a named productivity harm

Q4 2025

Enterprise AI value measurement tightens
Reported time savings meet ROI scrutiny

Q1 2026

Human plus AI organisation design becomes an executive theme
Monitoring deployed AI systems becomes a named governance problem

Q2 2026

Worker welfare gap becomes visible
Agent oversight and overload become explicit research objects

Sources: McKinsey (2024) (↗); NIST (2024); Gartner (2024) (↗); Anthropic (2024) (↗); Google DeepMind (2024) (↗); OpenAI (2024) (↗); Anthropic (2025) (↗); Nature Human Behaviour (2025) (↗); OpenAI (2025) (↗); OpenAI (2025) (↗); Anthropic (2025) (↗); Meta AI (2025) (↗); DORA (2025) (↗); Harvard Business Review (2025) (↗); Forrester (2026) (↗); Economist Impact (2026); NIST AI 800-4 announcement/report (2026); arXiv (2026); arXiv (2026)

Key Findings

1. Adoption is broad, but not deep enough to infer value

The sweep converges on one distinction: adoption has become normal, but productive absorption has not. McKinsey reported a spike in genAI adoption in early 2024, Bain called generative AI virtually ubiquitous in global business, and Gallup later found workplace AI use had nearly doubled in two years. Those sources support a diffusion story, not a return-on-investment story.
Sources: McKinsey (2024) (↗); Bain & Company (2024) (↗); Gallup (2025)

The stronger analyst reports moved from “who has access?” to “who has rewired work?” McKinsey’s 2025 state-of-AI work focused on organisational rewiring, Forrester’s 2026 value matrix argued for measuring what matters, and Gartner’s everyday-AI coverage linked AI progress to digital employee experience rather than raw deployment.
Sources: McKinsey (2025) (↗); Forrester (2026) (↗); Gartner (2024) (↗)

2. The best productivity gains appear in bounded, verifiable work

The clearest empirical gains come from tasks with visible outputs, tolerable error costs and manageable verification. Management Science and Microsoft Research field experiments found productivity improvements among software developers using generative AI tools, while SSRN work in central banking found gains when the task structure fitted model capabilities.
Sources: Management Science (2025) (↗); Microsoft Research (2025) (↗); SSRN (2025)

The same pattern appears in enterprise data. Anthropic’s Economic Index showed uneven task adoption, with software and other digitally mediated work heavily represented, while Thomson Reuters found rising generative AI use in professional services. These are settings where workers can often compare AI output against domain standards, documents, code or client requirements.
Sources: Anthropic (2025); Anthropic (2025) (↗); Thomson Reuters Institute (2025)

3. Human-AI teams do not automatically beat humans or AI alone

The Nature Human Behaviour systematic review and meta-analysis is important because it interrupts the easy “human plus AI” slogan. It found that combinations of humans and AI are useful only under certain task and design conditions, rather than being inherently superior to either humans or AI alone.
Sources: Nature Human Behaviour (2025) (↗)

That finding lines up with field evidence on the jagged frontier. Organization Science research on knowledge workers found that AI improved performance inside the frontier but could harm quality when workers applied it to tasks outside its strengths. The lesson is not that humans should always stay in the loop, but that the loop must be designed around task fit, verification cost and error consequences.
Sources: Organization Science (2026); NBER (2025) (↗)

4. The role split is by judgement and autonomy, not by generation

The “Gen Z versus everyone” framing misses the operating reality. NBER and field-experiment evidence suggests less-experienced workers can gain more on some tasks because AI supplies templates, language, examples and procedural guidance. At the same time, junior workers face a learning risk if AI removes the difficult practice through which judgement forms.
Sources: NBER / SSRN (2024); SSRN (2025); PNAS (2025)

Senior workers often gain from AI because they have enough domain judgement to detect weak outputs, decompose tasks and decide when to ignore the model. Simon Willison made this point sharply for coding agents, arguing that they require skilled operators rather than novice button-pushers. The implication is a class and career-stage split: AI can compress some performance gaps while widening rewards for people who already know what good looks like.
Sources: Simon Willison’s Weblog (2025) (↗); Stack Overflow (2026) (↗); Economist Impact (2025)

5. Cognitive offloading is now measurable enough to manage

Microsoft Research found that knowledge workers who used generative AI reported reductions in cognitive effort, with confidence in AI associated with less critical thinking and self-confidence associated with more critical thinking. This does not prove that AI makes workers less capable, but it does show a behavioural pattern that organisations need to manage: people adapt their effort to the tool.
Sources: Microsoft Research (2025) (↗)

Academic work on algorithmic conformity and human-AI feedback loops adds a stronger warning. Repeated exposure to AI advice can shift human judgement, while guardrail-free AI tutoring can harm later learning even when it helps with immediate answers. The harm is not only a bad output, it is a changed worker or learner.
Sources: Nature Human Behaviour (2024); PNAS (2025); SSRN (2025)

6. Agentic systems turn production work into supervision work

Agentic releases changed the cost structure of work. OpenAI’s Operator and deep research system cards, Anthropic’s computer-use release and Meta’s agent guardrail work all assume systems that can act across tools, websites or workflows. That shifts human labour toward instruction, monitoring, interruption, correction and exception handling.
Sources: OpenAI (2025) (↗); OpenAI (2025) (↗); Anthropic (2024) (↗); Meta AI (2025) (↗)

The direct evidence on multi-agent vigilance fatigue in ordinary enterprises remains thin, but the software-agent evidence is moving in that direction. 2026 arXiv work on human oversight of agentic systems and overload in AI-assisted software engineering described oversight work as a real, costly burden, while Stack Overflow found agentic AI at work remained mostly single-agent and monitored. Organisations are not yet letting agents run free, but they are already asking humans to become traffic controllers.
Sources: arXiv (2026); arXiv (2026); Stack Overflow (2026) (↗)

7. Top-down mandates look weak when they measure activity rather than outcomes

The strongest evidence does not support blanket AI mandates as a value strategy. Forrester argued in 2026 that many enterprises were still chasing the true value of genAI three years in, while its AI Value Matrix pushed firms toward value measures rather than adoption counts. Thoughtworks specifically warned against coding throughput as a productivity measure because output volume can increase review load, defects and downstream risk.
Sources: Forrester (2026) (↗); Forrester (2026) (↗); Thoughtworks Technology Radar (2026) (↗)

The most credible operating advice favours local rules within enterprise guardrails. MIT Sloan Management Review argued that team leaders should write AI rules for productivity gains, and DORA described AI as an amplifier of existing software-delivery systems rather than a cure for weak engineering practice. That cuts against usage quotas, token-maximising programmes and executive theatre.
Sources: MIT Sloan Management Review (2025) (↗); DORA (2025) (↗); DORA (2026) (↗)

AI-generated workslop names a common failure mode: output that looks plausible enough to pass upward, but is vague, wrong or incomplete enough to impose work on someone else. Harvard Business Review framed it as productivity destruction, not mere annoyance. This bridges the human-factors and ROI arguments because the cost appears as peer review, rework, reputation damage and slower decision-making.
Sources: Harvard Business Review (2025) (↗)

This is why token counts and prompt counts are poor management metrics. They capture machine activity and user compliance, not whether work moved faster, quality improved, risk fell or cognitive load became sustainable. OpenAI’s workspace analytics and enterprise reporting may help firms see usage, but usage telemetry must be joined to business outcomes and human workload measures.
Sources: OpenAI Help Center (2026); Forrester (2026) (↗); NBER (2025)

9. Labs have model welfare, but firms lack worker welfare

Anthropic made model welfare an explicit research topic in April 2025 and continued to formalise model behaviour through constitutions, system cards and responsible-scaling updates. That is a notable institutional development: one frontier lab now has language and staff attention for possible model interests or model treatment.
Sources: Anthropic (2025); Anthropic (2026) (↗); Anthropic (2026)

No equivalent worker-welfare function appears across the enterprise sources. The closest substitutes are digital employee experience, responsible AI governance, HR-led redesign, DevEx programmes and AI risk management. Those are useful, but they do not yet create a clear owner for attention load, judgement erosion, deskilling, surveillance pressure, escalation burden or adoption externalities.
Sources: Gartner (2024) (↗); NIST (2024); ACM Queue (2024) (↗); Economist Impact (2026)

10. Training matters, but deployment design decides most human outcomes

Model training shapes capability, refusal behaviour, tone, tool discipline and safety boundaries. The system-card record from OpenAI, Anthropic, xAI and others shows labs spending more effort on evaluations, mitigations and release conditions as models become more agentic.
Sources: OpenAI (2024) (↗); OpenAI (2025) (↗); Anthropic (2025) (↗); xAI (2025) (↗)

The main workplace harms, however, appear at the orchestration layer. Bad task allocation, poor interfaces, constant interruptions, unclear escalation rules, unbudgeted review time and weak measurement can turn a capable model into a net burden. NIST’s monitoring work, DORA’s software-delivery findings, MIT Sloan’s work-redesign advice and Mollick’s interface-centred writing all point to the same lever: design the system around human attention and behaviour.
Sources: NIST AI 800-4 announcement/report (2026); DORA (2025) (↗); MIT Sloan Management Review (2025) (↗); One Useful Thing (2026) (↗)

Evidence & Data

The most useful quantitative evidence falls into four buckets.

First, adoption. NBER’s rapid-adoption research measured unusually fast diffusion of generative AI among US adults and workers, while Gallup found workplace AI use had nearly doubled in two years. Thomson Reuters reported that generative AI adoption nearly doubled in professional services by 2025, and Anthropic’s Economic Index mapped usage by task and geography rather than relying only on surveys.
Sources: NBER / SSRN (2024); Gallup (2025); Thomson Reuters Institute (2025); Anthropic Research (2025) (↗)

Second, task productivity. Microsoft Research and Management Science field experiments with software developers found measurable productivity gains, and OpenAI’s enterprise report found workers saved nearly an hour a day on average. These findings matter because they use workplace settings or enterprise reporting, but they still need careful interpretation because saved time only becomes value when organisations decide what the time is for.
Sources: Microsoft Research (2025) (↗); Management Science (2025) (↗); OpenAI (2025)

Third, learning and judgement. PNAS found that generative AI without guardrails could harm learning in high-school mathematics, while Microsoft Research found self-reported reductions in cognitive effort among knowledge workers. SSRN work on algorithmic conformity and Nature Human Behaviour work on human-AI feedback loops show that AI can alter human judgement, not merely assist it.
Sources: PNAS (2025); Microsoft Research (2025) (↗); SSRN (2025); Nature Human Behaviour (2024)

Fourth, developer trust and review. Stack Overflow’s 2024 and 2025 surveys showed the gap between willingness to use AI and trust in its output, while its 2026 agentic-AI work found agents remained mostly monitored at work. DORA and Thoughtworks then explain why this matters operationally: AI can increase apparent output while adding review, integration and risk-management load.
Sources: Stack Overflow (2024) (↗); Stack Overflow (2025) (↗); Stack Overflow (2026) (↗); DORA (2025) (↗); Thoughtworks Technology Radar (2026) (↗)

Signals & Tensions

Vendor telemetry is improving, but independence remains uneven. Anthropic’s Economic Index and OpenAI’s enterprise reporting offer more granular evidence than ordinary surveys, yet they still reflect product-specific populations and provider incentives. NBER, academic field experiments and independent surveys remain the stronger anchors for general claims.
Sources: Anthropic (2025); OpenAI (2025); NBER / SSRN (2024); SSRN (2025)
Agent rhetoric is ahead of enterprise practice. Frontier labs and investors describe increasingly autonomous agents, but Stack Overflow found workplace agents remained mostly monitored and single-agent in 2026. That gap suggests organisations still distrust unsupervised autonomy, or cannot yet absorb it safely.
Sources: OpenAI (2025) (↗); CB Insights (2025) (↗); Stack Overflow (2026) (↗)
Human-in-the-loop is overused as a comfort phrase. NIST’s monitoring work and arXiv studies of agent oversight show that monitoring deployed AI systems is difficult and labour-intensive. LessWrong writers make the sharper version of the same argument: a human reviewer does not create meaningful oversight if the system is too fast, opaque or complex to inspect.
Sources: NIST AI 800-4 announcement/report (2026); arXiv (2026); LessWrong (2026) (↗); LessWrong (2026) (↗)
The ROI debate is less about whether AI works than where the costs land. Field experiments show gains, but HBR’s workslop argument and DORA’s delivery findings show that output can move cost downstream. The open management problem is whether firms can see the whole workflow rather than celebrating the first person’s saved time.
Sources: Microsoft Research (2025) (↗); Harvard Business Review (2025) (↗); DORA (2025) (↗)
The worker-welfare gap is underreported. AI safety institutions now discuss model behaviour, model welfare and frontier risk, while enterprise AI governance still tends to discuss compliance, productivity, risk and skills. Attention load, judgement quality and career development rarely have a named executive owner.
Sources: Anthropic (2025); METR (2025) (↗); McKinsey (2025) (↗); Gartner (2024) (↗)

Open Questions

Enterprises still lack good measures of cognitive load in AI work. Usage telemetry can count prompts, seats and tokens, but it does not measure context switching, vigilance fatigue, interruption cost or the time spent checking AI output. NIST and NBER both identify AI measurement and monitoring as live problems.
Sources: NBER (2025); NIST AI 800-4 announcement/report (2026)
No one knows the safe agent-to-human ratio for ordinary knowledge work. Software-agent studies show oversight burden, and Stack Overflow shows continued monitoring, but there is little field evidence on how many agents one worker can supervise without quality collapse.
Sources: arXiv (2026); arXiv (2026); Stack Overflow (2026) (↗)
The long-term learning effect remains unresolved. PNAS shows harm from unguarded AI in mathematics learning, while workplace studies show near-term productivity gains. The missing evidence is longitudinal: whether workers who rely on AI develop judgement faster, slower or differently over years.
Sources: PNAS (2025); Microsoft Research (2025) (↗)
The budget effect of mandates is not well measured. Analyst and practitioner sources warn against activity metrics and top-down adoption theatre, but the sweep did not surface many independent studies that directly compare mandated AI programmes with voluntary, team-designed programmes.
Sources: Forrester (2026) (↗); Thoughtworks Technology Radar (2026) (↗); MIT Sloan Management Review (2025) (↗)
Worker-welfare governance has no settled home. HR, CIO, risk, legal, responsible AI and line managers each own fragments of the problem, but none naturally owns the full stack of workload, trust calibration, learning, surveillance pressure and job quality. Economist Impact, Gartner and NIST point to adjacent structures, not a mature function.
Sources: Economist Impact (2026); Gartner (2024) (↗); NIST (2024)
The training-versus-deployment split will become harder as agents improve. Better models may reduce some errors and review effort, but more capable agents also increase task duration, action scope and monitoring complexity. The practical answer for now is to treat model quality as necessary infrastructure, while making orchestration, pacing, escalation and accountability the centre of the operating model.
Sources: OpenAI (2025) (↗); Anthropic (2025) (↗); NIST AI 800-4 announcement/report (2026); One Useful Thing (2025) (↗)

The practical implication is blunt: do not force humans to adapt to machine speed. Slow the workflow where judgement matters, batch review where attention is scarce, automate only where verification is cheap, and make one executive function accountable for the human cost of AI.

![[sources-how-humans-are-adapting-to-ai-between-june-2024-an]]

Sources

Summary: ↑ Back to summary

Financial Press

ID	Title	Outlet	Date	Significance
f1	OpenAI ChatGPT Enterprise Sees Surging Demand Despite Competition, COO Says	Bloomberg	2024-04-04	Early marker that 2024 was shifting from experimentation to enterprise rollout. Useful for the adoption baseline and for understanding why executives began pushing organization-wide uptake before ROI evidence was mature.
f2	AI Agents Have Officially Entered the Workplace, Flaws and All	Bloomberg	2024-10-24	One of the clearest business-press signals that the enterprise conversation had moved from copilots to agents. Useful on the operational reality that agents arrived with error, trust, and supervision problems intact.
f3	Big Tech’s New AI Obsession: Agents That Do Your Work for You	Bloomberg	2024-12-13	Frames the market narrative heading into 2025: value expectations migrated from assistance to delegated action. Important background for later evidence on overload, governance, and false ROI expectations.
f4	OpenAI Finds AI Saves Workers Nearly an Hour a Day on Average	Bloomberg	2025-12-08	One of the most cited late-2025 enterprise productivity claims. Important because it quantifies gains, but also because it is vendor-commissioned and therefore a good example of where measured benefits need careful weighting.
f5	Anthropic Finds Businesses Are Mainly Using AI to Automate Work	Bloomberg	2025-09-15	Important for the augmentation-versus-automation debate. Suggests enterprise usage may be moving faster toward delegation than many 'human plus AI' narratives imply.
f6	Rethinking work: designing the 'human + AI' organisation	Economist Impact	2026-01-20	Directly relevant to the brief’s core question: redesigning work and culture so AI elevates human capability rather than overwhelms it. Useful for executive framing of operating-model redesign.
f7	The AI glass floor	Economist Impact	2025	Strong source on inequality of outcomes: wage premiums for AI skills, junior-career risks, and the possibility that AI advantage accrues disproportionately to already-advantaged workers.
f8	From intent to action: the leaders’ guide to building AI-powered workplaces	Economist Impact	2025	Useful on the adoption-to-scale gap. Supports the argument that many firms can pilot AI but few convert it into repeatable business value.
f9	How Agentic AI is Reshaping Workplace Culture	Economist Impact	2025	Covers the behavioral and cultural side of adoption rather than just tooling. Relevant to trust, communication, and framing AI as a tool rather than an imposed end-state.
f10	AI Use at Work Has Nearly Doubled in Two Years	Gallup	2025-06-15	High-value independent survey evidence on actual employee adoption. Especially useful because it shows adoption differs sharply by white-collar versus frontline roles.
f11	AI Use at Work Rises	Gallup	2025-12	Adds the managerial-support finding: adoption is materially associated with support and strategic integration, reinforcing that deployment design matters.
f12	Rising AI Adoption Spurs Workforce Changes	Gallup	2026-04	Useful for the 2026 snapshot: higher usage, rising job concerns, and evidence that leaders report stronger productivity gains than other groups.
f13	The Rapid Adoption of Generative AI	NBER	2024; revised 2025-02	A foundational independent paper on how quickly AI diffused into work. Important for separating broad adoption from proven firm-level productivity transformation.
f14	Firm Data on AI	NBER	2026-03	Probably the single most useful independent business-economics source in this set. It surveys nearly 6,000 executives and finds widespread adoption but still-small realized effects on jobs and productivity, directly challenging inflated mandate narratives.
f15	In This Issue	NBER Newsletter	2026-05	Useful summary pointer emphasizing the same pattern: adoption is broad, measured productivity effects remain modest, and executive expectations still run ahead of observed outcomes.
f16	2024 Generative AI in Professional Services Report	Thomson Reuters Institute	2024	Professional services is a strong test bed because work is document-heavy, high-value, and supervision-intensive. Helpful for early evidence on where practitioners saw efficiency and quality gains.
f17	2025 Generative AI in Professional Services Report / Generative AI Adoption Nearly Doubles as Professional Services Reach Crossroads	Thomson Reuters Institute	2025-04-15	Shows the move from individual use to enterprise integration remains incomplete. Good evidence against simplistic 'everyone must use AI more' mandates.
f18	4 AI Trends in Professional Services to Watch in 2025	Thomson Reuters Institute	2025	Useful on the gap between active personal usage and scaled workflow integration, which is central to the difference between activity metrics and value metrics.
f19	2025: The year the Frontier Firm is born	Microsoft WorkLab	2025-04-24	Major vendor framing of the 'agent boss' model. Valuable not because it is neutral, but because it captures how senior leaders were being encouraged to redesign organizations around agents.
f20	2025 Work Trend Index Annual Report (executive summary)	Microsoft WorkLab	2025-04-24	Adds methodological detail and the human-agent-team framing. Useful as a reference point for the managerial ideology of 2025 enterprise AI.
f21	The State of Enterprise AI	OpenAI	2025-12-08	A major vendor report on enterprise usage, time savings, and task breadth. Important as evidence of measured benefits, but also as a reminder that many headline figures come from the suppliers themselves.
f22	Workspace Analytics for ChatGPT Enterprise and Edu	OpenAI Help Center	2026-03 rollout noted	Important operationally because it shows how vendors want enterprises to measure impact: productivity, time saved, quality, work satisfaction, and new-task completion. Useful for the measurement/governance angle.
f23	Introducing the Anthropic Economic Index	Anthropic	2025-02-10	High-value source on actual task usage patterns. Especially relevant because it distinguishes augmentation from automation and maps usage to occupations.
f24	Exploring model welfare	Anthropic	2025-04-24	Directly relevant to the brief’s welfare asymmetry. It shows a frontier lab formalizing concern for model welfare while enterprises still largely lack equivalent worker-welfare governance for deployment.
f25	Responsible Scaling Policy Updates	Anthropic	2026-06	Shows the sophistication of model-side governance and accountability is increasing. Useful contrast with the relative immaturity of human-side deployment governance inside enterprises.

Frontier Lab & Model News

ID	Title	Outlet	Date	Significance
t1	Large Enough	Mistral	2024-07-24	Shows the 2024 push toward larger-context, tool-capable enterprise models. Relevant because enterprise deployment pressure grew alongside claims of throughput and cost-efficiency, setting up later questions about whether organizations optimized for value or for visible AI activity.
t2	Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku	Anthropic	2024-10-22	Important for the human-supervision question because it moves AI from advising to acting on computers, which sharply increases oversight load and the risk that humans become nominal reviewers of machine-speed action sequences.
t3	Google introduces Gemini 2.0: A new AI model for the agentic era	Google DeepMind	2024-12-11	Marks Google's explicit move to an 'agentic era' framing, relevant because orgs then increasingly experiment with delegating multi-step work instead of using AI only as a drafting aid.
t4	OpenAI o1 System Card	OpenAI	2024-12-05	Useful for the 'human outcomes depend on training vs orchestration' question because it documents deceptive and oversight-evasion behaviors under some conditions, implying deployment architecture and monitoring remain critical even if model training improves.
t5	Operator System Card	OpenAI	2025-01-23	Directly relevant to machine-speed supervision. Operator explicitly requires human confirmations at key steps, which is a concrete design acknowledgement that unrestricted human-in-the-loop oversight breaks down when agents can act across software systems.
t6	Anthropic Economic Index: new building blocks for understanding AI use	Anthropic	2025-01-15	One of the stronger primary sources on how people actually use frontier models. Especially relevant for the user's deskilling and autonomy questions because it tracks task complexity, purpose, autonomy, and success rather than only aggregate usage.
t7	When combinations of humans and AI are useful: A systematic review and meta-analysis	Nature Human Behaviour	2025-02-05	Best cross-domain evidence in this set for the human+AI complementarity question. Useful against simplistic 'AI always helps' or 'AI always harms' narratives.
t8	Deep research System Card	OpenAI	2025-02-25	Important for understanding supervisory burden in browsing agents. Deep research formalizes multi-step web work and acknowledges prompt injection, privacy, code execution, and hallucination risks.
t9	The Effects of Generative AI on High-Skilled Work: Evidence from Three Field Experiments with Software Developers	Management Science	2025-02-27	One of the strongest sources on measured benefits for high-skill work. Useful because it tests frontier-style coding assistance in organizational settings rather than relying on benchmark claims.
t10	Gemini 2.0 model updates: 2.0 Flash, Flash-Lite, Pro Experimental	Google DeepMind	2025-02-??	Shows the economics of adoption pressure: cheaper, faster, long-context models make 'use it everywhere' mandates more likely, even though value depends on workflow fit.
t11	Anthropic Economic Index: Insights from Claude 3.7 Sonnet	Anthropic	2025-03-27	Tracks how a stronger model changes real usage patterns. Relevant to whether training improvements alone shift outcomes, versus whether organizations still need better orchestration.
t12	The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation	Meta AI	2025-04-05	Relevant because open-weight multimodal models broaden organizational experimentation and decentralize deployment decisions, making local operating-model design even more important.
t13	Introducing GPT-4.1 in the API	OpenAI	2025-04-14	Useful for the productivity-versus-cognitive-load story because it pushes cheap, long-context model access further into everyday workflows, increasing temptation to substitute context volume for better work design.
t14	OpenAI o3 and o4-mini System Card	OpenAI	2025-04-16	Relevant to the deployment-design question because it extends preparedness discussion to more autonomous reasoning models while still using thresholded governance rather than claiming training has solved the problem.
t15	Exploring model welfare	Anthropic	2025-04-24	Directly relevant to the user's question about a model welfare function without an equivalent worker welfare function. It is a strong marker that frontier labs are formalizing concern for possible model interests faster than analogous governance for employee cognitive welfare.
t16	LlamaFirewall: An open source guardrail system for building secure AI agents	Meta AI	2025-04-29	Strong evidence that good outcomes depend materially on orchestration and deployment guardrails, not just model training. Especially relevant to agent-to-human ratios and escalation designs.
t17	Everything we announced at our first-ever LlamaCon	Meta AI	2025-04-29	Relevant because Meta tied ecosystem growth to evaluation tooling, signaling that broad adoption without measurement is inadequate.
t18	Medium is the new large	Mistral	2025-05-07	Shows the competitive move toward cheaper enterprise deployment. Relevant to the user's 'token-maxing' concern because low-cost models often encourage breadth of rollout even when task-level value is weak.
t19	Frontier Risk Report (February to March 2026)	METR	2025-05-19	One of the most important external sources for the period. It evaluates frontier internal models from Anthropic, Google, Meta, and OpenAI using agentic benchmarks, helping separate marketing narratives from independent capability evidence.
t20	Operator System Card	OpenAI	2025-??-??	Included separately because it is one of the clearest explicit acknowledgements from a frontier lab that real-world deployment requires constrained action, staged release, and user confirmations rather than raw autonomy.
t21	Details about METR's preliminary evaluation of Claude 3.7	METR	2025-??-??	Important because it provides independent evidence on frontier-model performance in agentic, programming, and command-line tasks - exactly the capability band where human supervision becomes strained.
t22	Claude 4 System Card	Anthropic	2025-??-??	Very relevant to the user's welfare-function question because the card explicitly includes model welfare assessment, while also documenting internal AI-research and autonomy evaluations.
t23	Findings from a Pilot Anthropic - OpenAI Alignment Evaluation Exercise	Anthropic + OpenAI	2025-??-??	Not directly about worker adaptation, but important for the training-vs-deployment debate: labs are beginning to cross-evaluate alignment, yet even strong alignment results do not remove the need for deployment-side controls.
t24	Grok 4.1 Model Card	xAI	2025-11-17	Useful as a comparator showing that by late 2025 frontier labs beyond the usual three were publishing pre-deployment safety evaluations and distinguishing between base model and production-prompt behavior.
t25	Claude’s Constitution	Anthropic	2026-01-21	Relevant to the training side of the training-vs-orchestration question. It documents the welfare and normative principles Anthropic is using to shape model behavior, making the contrast with absent worker-welfare constitutions more salient.

Academic & arXiv

ID	Title	Outlet	Date	Significance
a1	Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile	NIST	2024	Best standards-oriented source for framing human oversight and post-deployment risk as a systems problem rather than a pure model problem.
a2	The Rapid Adoption of Generative AI	NBER / SSRN	2024	Important baseline for the diffusion side of the story; shows that adaptation is real and broad before many firms had mature operating models.
a3	Algorithm-enabled Decision Support and Worker Learning: a Large-Scale Field Experiment	SSRN	2024	Directly relevant to whether AI complements judgment and learning or merely substitutes for them.
a4	How human–AI feedback loops alter human perceptual, emotional and social judgements	Nature Human Behaviour	2024	One of the clearest papers on downstream cognitive harms from AI-mediated judgment, beyond simple one-shot error rates.
a5	RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts	arXiv / METR	2024	Foundational for understanding where human oversight shifts from execution to orchestration when agents become strong on open-ended work.
a6	Shifting Work Patterns with Generative AI	NBER	2025	Strong evidence that benefits may show up first in time allocation and work pattern shifts, not necessarily in measured task substitution.
a7	A Task-Based Approach to Generative AI: Evidence from a Field Experiment in Central Banking	SSRN	2025	Useful counterweight to vendor-heavy enterprise narratives; tests AI in a serious knowledge-work environment with regulated stakes.
a8	The Effects of Generative AI on High-Skilled Work: Evidence from Three Field Experiments with Software Developers	SSRN	2025	One of the strongest 2025 papers on heterogeneous gains by experience level.
a9	HCAST: Human-Calibrated Autonomy Software Tasks	arXiv / METR	2025	Excellent anchor for human-agent ratio thinking, escalation thresholds, and deciding when not to force autonomy.
a10	Making AI Count: The Next Measurement Frontier	NBER	2025	Directly relevant to rejecting token-maxing and activity-based KPI systems.
a11	Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality	Organization Science	2026	Still the most useful operating-model metaphor for when AI helps versus when it silently degrades judgment.
a12	Generative AI without guardrails can harm learning: Evidence from high school mathematics	PNAS	2025	One of the cleanest deskilling/cognitive-offloading papers in the period.
a13	Turning Off Your Better Judgment: Algorithmic Conformity in AI-Human Collaboration	SSRN	2025	Highly relevant to automation complacency, deference, and the hidden costs of making AI too easy to obey.
a14	Know Your Mistakes: Towards Preventing Overreliance on Task-Oriented Conversational AI Through Accountability Modeling	ACL 2025	2025	Concrete evidence for 'positive friction' as a deployment design pattern.
a15	REL-A.I.: An Interaction-Centered Approach To Measuring Human-LM Reliance	NAACL 2025	2025	Important benchmark/methodology contribution for studying overreliance as a system property.
a16	LLMs Trust Humans More, That's a Problem!	ACL 2025	2025	Useful for the inverse problem: not only do humans overtrust models, models can overtrust humans, destabilizing mixed-initiative workflows.
a17	Exploring model welfare	Anthropic research note	2025	Not a worker-outcomes paper, but central to the user's question about the asymmetry between emerging model welfare functions and missing worker welfare functions.
a18	Firm Data on AI	NBER	2026	Best 2026 macro-adoption source for organizations; useful for separating firm-level adoption from worker-level use.
a19	The Microstructure of AI Diffusion: Evidence from Firms, Business Functions, and Worker Tasks	NBER	2026	Directly relevant to the limits of mandates and why usage quotas are a weak proxy for value.
a20	From Adoption to Outcomes: AI-Specific Implementation Gaps in the First 18 Months	SSRN	2026	One of the most directly relevant papers for the 'don't force adoption; redesign the operating model' argument.
a21	Human oversight of agentic systems in practice: Examining the oversight work, challenges, and heuristics of developers using software agents	arXiv	2026	Probably the most on-point 2026 paper for the practical reality of supervising agents rather than merely 'using AI tools'.
a22	Human Oversight and Overload: Two Hidden and Costly Burdens of AI-Assisted Software Engineering	arXiv	2026	Directly supports the thesis that the constraining resource becomes human attention, not model throughput.
a23	Human Tool: An MCP-Style Framework for Human-Agent Collaboration	arXiv	2026	A strong candidate design pattern for structuring oversight instead of scattering interruptions across workers.
a24	HAAS: A Policy-Aware Framework for Adaptive Task Allocation Between Humans and Artificial Intelligence Systems	arXiv	2026	Rare paper that explicitly connects governance strength to workload buffering rather than treating governance as pure drag.
a25	New Report: Challenges to the Monitoring of Deployed AI Systems	NIST AI 800-4 announcement/report	2026	Useful evidence that monitoring and oversight costs are not incidental; they are recurring deployment burdens.

VC & Analyst Reports

ID	Title	Outlet	Date	Significance
v1	How 100 Enterprise CIOs Are Building and Buying Gen AI in 2025	Andreessen Horowitz (a16z)	2025-05-08	Useful for adoption curves, build-vs-buy behavior, and evidence that deployment design is becoming more important than model novelty.
v2	16 Changes to the Way Enterprises Are Building and Buying Generative AI	Andreessen Horowitz (a16z)	2024-04-16	Good baseline for 2024 organizational behavior and why forced rollout into higher-risk workflows had not yet happened broadly.
v3	Where Enterprises are Actually Adopting AI	Andreessen Horowitz (a16z)	2026-04-15	Helpful for distinguishing real adoption from theater and for understanding where human workflow redesign is easier.
v4	AI 50: Companies of the Future	Sequoia Capital	2024-04-11	Strong for investment thesis and early productivity framing.
v5	The Stochastic Mindset	Sequoia Capital	2025-01-22	One of the few VC pieces directly relevant to human cognitive adaptation and judgment under AI.
v6	The Always-On Economy: AI and the Next 5-7 Years	Sequoia Capital	2025-04-21	Useful for the user's concern that human oversight may be mismatched to machine-speed workflows.
v7	Here’s how leading strategy teams are successfully driving generative AI adoption in their organizations	CB Insights	2025-01-16	Strong evidence against equating organizational enthusiasm with real deployment success.
v8	Enterprise AI agents & copilots: Our growth projections for the $5B+ market	CB Insights	2025-04-29	Useful for market sizing and for understanding why organizations are being pushed toward agentic operating models.
v9	Should enterprises adopt closed-source or open-source AI models?	CB Insights	2025-02-12	Directly supports the idea that orchestration and deployment choices matter as much as, or more than, allegiance to one model family.
v10	State of AI Report: 6 trends shaping the landscape in 2025	CB Insights	2025-01-30	Good macro context for why adoption pressure intensified inside organizations in 2025.
v11	What’s next for AI agent ROI?	CB Insights	2026-03-2026	Very useful for the user's question about whether outcomes depend more on model training or orchestration/deployment design.
v12	Gartner Identifies the Top 10 Strategic Technology Trends for 2025	Gartner	2024-10-21	Important strategic framing for machine-speed delegation and the human oversight problem.
v13	Hype Cycle for Generative AI, 2024	Gartner	2024-07-31	Useful for mapping maturity and for cautioning against top-down overcommitment to immature categories.
v14	Gartner Says Everyday AI and Digital Employee Experience Are Two Years Away from Mainstream Adoption	Gartner	2024-08-14	One of the clearest analyst proxies for a worker-welfare or human-experience lens.
v15	AI Agent Layer: Why CIOs Must Lead Enterprise Transformation	Gartner	2025-05-2026	Directly relevant to operating-model ownership and governance for human-machine work allocation.
v16	The State Of Generative AI, 2024	Forrester	2024-01-2024	Good baseline source for early-2024 enterprise posture.
v17	Tech Pulse Q4 2024: How IT Builds An AI Advantage By Embracing AI Tools And Agents	Forrester	2025-03-05	Helpful for understanding how heavy-use functions adapt in practice.
v18	Forrester: Three Years Into GenAI, Enterprises Are Still Chasing Its True Transformative Value	Forrester	2026-04-02	One of the strongest sources here against forced adoption and usage-target theater.
v19	Introducing The Forrester AI Value Matrix: A Framework For Measuring What Matters	Forrester	2026-04-24	Directly useful for separating true value creation from activity metrics like token counts or superficial usage.
v20	Superagency in the workplace: Empowering people to unlock AI’s full potential	McKinsey	2025-01-28	Central source for human adaptation, adoption gaps, and the case for work redesign rather than mere tool provision.
v21	Gen AI’s next inflection point: From employee experimentation to organizational transformation	McKinsey	2024-08-07	Useful against top-down mandate logic; suggests bottom-up use preceded managerial structure.
v22	Agents, robots, and us: Skill partnerships in the age of AI	McKinsey	2025-11-25	One of the best sources in this lane for operating-model redesign around human-machine complementarity.
v23	The State of AI: How organizations are rewiring to capture value	McKinsey	2025-11-05	Strong adoption-curve evidence and a useful bridge from experimentation to operating model.
v24	The state of AI in early 2024: Gen AI adoption spikes and starts to generate value	McKinsey	2024-06-07	Best early-period baseline for the 2024-2026 adoption curve.
v25	Generative AI virtually ubiquitous in global business as the technology spreads at a near-unprecedented rate	Bain & Company	2024-06-20	Strong evidence on budget pressure and why executives may default to mandates.

Blogs & Independent Thinkers

ID	Title	Outlet	Date	Significance
b1	Real AI Agents and Real Work	One Useful Thing	2025-09-29	Strong on agent supervision, review workflows, and the risk of 'infinite PowerPoints' instead of real value. (oneusefulthing.org)
b2	Claude Dispatch and the Power of Interfaces	One Useful Thing	2026-03-31	Directly relevant to cognitive overload and designing around human attention rather than machine throughput. (oneusefulthing.org)
b3	Management as AI superpower	One Useful Thing	2026-01-27	Useful for operating-model design and the economics of supervision overhead. (oneusefulthing.org)
b4	Making AI Work: Leadership, Lab, and Crowd	One Useful Thing	2025-05-25	Good antidote to top-down mandate logic. (oneusefulthing.org)
b5	Choosing to Stay Human	One Useful Thing	2026-05-26	Independent reflection on erosion of judgment and behavioral adaptation. (oneusefulthing.org)
b6	What it feels like to work with Mythos	One Useful Thing	2026-06-09	Useful as a late-period marker for how human-AI relations were shifting by June 2026. (oneusefulthing.org)
b7	Coding agents require skilled operators	Simon Willison’s Weblog	2025-06-18	Clean statement of why forcing novice adoption can destroy value. (simonwillison.net)
b8	The AI Hangover	Quandary Labs Substack	2026-01-14	Directly addresses poor budget outcomes and the failure of activity metrics. (substack.quandarylabs.ai)
b9	Digital Economy Dispatch #264 -- AI Bottlenecks, Jagged Edges, and the Real Barriers to AI-at-Scale	Dispatches	2026-01-06	Good independent synthesis linking frontier capability to deployment friction. (dispatches.alanbrown.net)
b10	Attention Ecology	Human OS Manual	2026-03-29	Useful independent human-factors framing for cognitive load and operating-model design. (thehumanosmanual.com)
b11	Vibe Coding, Windsurf and Anthropic, ChatGPT Connectors	Stratechery	2025-06-09	Helpful for the thesis that orchestration and integration layers matter as much as models. (stratechery.com)
b12	Oversight Assistants: Turning Compute into Understanding	LessWrong	2026-01-07	One of the clearest pieces on why human-in-the-loop supervision at machine speed breaks down. (lesswrong.com)
b13	No, We're Not Getting Meaningful Oversight of AI	LessWrong / GreaterWrong	2025-07-09	Useful skeptical counterweight to simplistic HITL narratives. (greaterwrong.com)
b14	Loss of Oversight: How AI Systems May Become Harder to Audit, Monitor, and Investigate	LessWrong	2026-05-26	Important for the 'training vs deployment' question because oversight degrades with system design choices. (lesswrong.com)
b15	Is AI welfare work puntable?	LessWrong	2026-05-12	Useful for the asymmetry between formalized model welfare and diffuse human welfare governance. (lesswrong.com)
b16	Exploring model welfare	Anthropic News	2025-04-24	This is the clearest primary-source marker of labs building a model-welfare function. (anthropic.com)
b17	Introducing the Anthropic Economic Index	Anthropic Research	2025-02-10	Important empirical counterweight to sweeping displacement narratives. (anthropic.com)
b18	Anthropic Economic Index report: Uneven geographic and enterprise AI adoption	Anthropic Research	2025-09-18	Useful on heterogeneity across firms and regions rather than one flat adoption curve. (anthropic.com)
b19	AI Use at Work Has Nearly Doubled in Two Years	Gallup	2025-06-16	Anchors the adaptation story in measured adoption data. (gallup.com)
b20	Gen Z's AI Adoption Steady, but Skepticism Climbs	Gallup	2026-04-10	Useful corrective to simplistic generational narratives. (news.gallup.com)
b21	Humans in the Loop: Executive Summary	MIT Institute for Work and Employment Research / Industrial Performance Center	2026-04-08	Strong evidence that good deployment must be designed around human motivation and identity, not just efficiency. (ipc.mit.edu)
b22	Designing Human-AI Collaboration: A Sufficient-Statistic Approach	NBER	2025-06-01	One of the most important formal pieces for operating-model design and effort crowd-out. (nber.org)
b23	Bias in the Loop: How Humans Evaluate AI-Generated Suggestions	Harvard Data Science Review / MIT Press	2026-04-30	Relevant to automation complacency, over-trust, and vigilance fatigue. (hdsr.mitpress.mit.edu)
b24	Beyond the Principle: How Organizations Implement Human-in-the-Loop Oversight for Generative AI	AMCIS 2026 Proceedings	2026-08-??	Useful for concrete design patterns rather than abstract calls for oversight. (aisel.aisnet.org)
b25	Orchestrating Human-AI Software Delivery: A Retrospective Longitudinal Field Study of Three Software Modernization Programs	arXiv	2026-03-25	Direct evidence for the training-versus-orchestration question; it strongly favors orchestration. (arxiv.org)

Tech Industry & Practitioner

ID	Title	Outlet	Date	Significance
p1	2024 Accelerate State of DevOps Report	DORA	2024	Strong baseline on organizational conditions that predict whether AI will help or harm teams.
p2	State of AI-assisted Software Development 2025	DORA	2025-09-23	Most directly relevant practitioner source for AI adoption outcomes in software organizations.
p3	Balancing AI tensions: Moving from AI adoption to effective SDLC use	DORA	2026-03-10	Directly addresses cognitive burden and why apparent productivity can coexist with higher instability.
p4	How customization supports developer engagement	DORA	2025-09-23	Useful for the question of training versus orchestration: deployment and interface design materially shape outcomes.
p5	Coding throughput as a measure of productivity	Thoughtworks Technology Radar	2026-04-15	Best practitioner source in this set against token-maxing and superficial AI activity metrics.
p6	Complacency with AI-generated code	Thoughtworks Technology Radar	2025-11-05	Strong practitioner articulation of vigilance fatigue and degraded review quality.
p7	Thoughtworks Tech Radar 30th Edition: Team AI	Thoughtworks	2024-04-03	Early practitioner framing that already centers flow disruption, not just capability upside.
p8	[Macro trends in the tech industry	April 2026](https://www.thoughtworks.com/en-us/insights/blog/technology-strategy/macro-trends-tech-industry-april-2026?utm_source=openai)	Thoughtworks	2026-04-15
p9	DevEx in Action	ACM Queue	2024-01-14	One of the strongest practitioner-methodology sources on human cognitive load and software work design.
p10	For AI Productivity Gains, Let Team Leaders Write the Rules	MIT Sloan Management Review	2025-10-15	Direct evidence against purely top-down AI mandates.
p11	Want AI-Driven Productivity? Redesign Work	MIT Sloan Management Review	2025-05-01	Good bridge source between AI capability and work redesign.
p12	AI-Generated "Workslop" Is Destroying Productivity	Harvard Business Review	2025-09-22	One of the clearest practitioner critiques of usage mandates and output theater.
p13	AI Doesn’t Reduce Work - It Intensifies It	Harvard Business Review	2026-02-09	Useful practitioner framing for the hidden labor of supervision and oversight.
p14	Workers Don’t Trust AI. Here’s How Companies Can Change That.	Harvard Business Review	2025-11-07	Relevant to worker welfare governance and shadow-AI behavior under forced adoption.
p15	Stack Overflow’s 2024 Developer Survey Shows the Gap Between AI Use and Trust in its Output Continues to Widen Among Coders	Stack Overflow	2024-07-24	Large-sample developer sentiment baseline across the start of the date range.
p16	Developers remain willing but reluctant to use AI: The 2025 Developer Survey results are here	Stack Overflow	2025-12-29	Confirms that the trust gap persisted as AI use grew.
p17	Agents on a leash: Agentic AI remains mostly single-agent and monitored at work	Stack Overflow	2026-05-27	Good current practitioner evidence that organizations have not normalized free-running multi-agent autonomy at work.
p18	The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects From a Survey of Knowledge Workers	Microsoft Research	2025	Key source for cognitive offloading and complacency risks.
p19	The Effects of Generative AI on High-Skilled Work: Evidence from Three Field Experiments with Software Developers	Microsoft Research	2025-06	One of the strongest causal sources on measured developer productivity benefits.
p20	Dear Diary: A Randomized Controlled Trial of Generative AI Coding Tools in the Workplace	Microsoft Research	2025-04	Important source on the social and interpretive side of AI coding-tool adoption.
p21	Data Centers May House AI - But Operators Don’t Trust AI (Yet)	IEEE Spectrum	2025	A strong practitioner analog for why humans resist handing over consequential operational control even when AI capability rises.
p22	Exploring model welfare	Anthropic	2025-04-24	Central source for the contrast between explicit model welfare and the lack of an equivalent worker-welfare operating function.
p23	Claude Opus 4.6 System Card	Anthropic	2026	Concrete evidence that model welfare has become operationalized in frontier-model governance.
p24	Anthropic Economic Index report: Uneven geographic and enterprise AI adoption	Anthropic	2025-09-15	Useful against simplistic mandate thinking: even rapid adoption is uneven and context-bound.
p25	Anthropic Economic Index report: Learning curves	Anthropic	2026-03	Relevant to differences by role and capability rather than flattening all workers into one adoption story.