Research · Summary

Back to sweep

Research sweep · deep · 2024 – 2026

Designing AI Operating Models Around Humans

How humans are adapting to AI between June 2024 and June 2026, weighing measured benefits and harms, and how organizations should design operating models around human cognitive load and behavioural patterns rather than forcing adoption, covering cognitive overload from supervising multiple agents at machine speed (context switching, automation complacency, vigilance fatigue), the poor budget and value outcomes of top-down AI mandates and token-maximizing usage, the gap between model welfare functions (such as Anthropic's) and any equivalent human or worker welfare function, and how much good human outcomes depend on model training versus orchestration and deployment design.

  • GPT-5.5
  • financial
  • frontier
  • academic
  • vc
  • blogs
  • tech

Synthesised 2026-06-15

Humans are becoming the bottleneck in AI operating models

Overview

AI adoption moved faster than organisational redesign between June 2024 and June 2026. Gallup found workplace AI use had nearly doubled in two years by mid-2025, while NBER’s adoption research showed generative AI spreading at a pace comparable to earlier general-purpose technologies. Yet the strongest evidence now points to a less convenient conclusion: access to AI is not the same as value capture, and value capture is not the same as good human outcomes.
Sources: Gallup (2025); NBER (2024)

The defining shift of the period was the move from chat assistance to agentic work. Anthropic introduced computer use in Claude, Google framed Gemini 2.0 as a model for the “agentic era”, OpenAI released Operator and deep research system cards, and Meta published LlamaFirewall for safer AI agents. These releases changed the human problem: workers were no longer only writing prompts, they were also assigning work, monitoring tools, reviewing outputs, catching failures, and deciding when to intervene.
Sources: Anthropic (2024) (); Google DeepMind (2024) (); OpenAI (2025) (); OpenAI (2025) (); Meta AI (2025) ()

The benefits are real, but they are task-shaped rather than universal. Field experiments with software developers, central bankers and other knowledge workers show speed and productivity gains in bounded tasks, while professional-services and enterprise surveys show rising use in drafting, summarisation, research and workflow acceleration. The harms are also task-shaped: over-reliance, cognitive offloading, deskilling risk, review burden, poor judgement on out-of-frontier tasks, and new coordination costs when AI output flows to colleagues.
Sources: SSRN (2025); SSRN (2025); Organization Science (2026); Thomson Reuters Institute (2024); Thomson Reuters Institute (2025)

The operating-model question therefore matters more than the adoption question. The organisations that treat AI as a licence, mandate or token budget risk creating workslop, review debt and trust decay. The organisations that design around human attention, verification cost, escalation thresholds and team-level rules have a better chance of turning AI into useful work rather than more work.
Sources: Thoughtworks Technology Radar (2026) (); MIT Sloan Management Review (2025) (); MIT Sloan Management Review (2025) (); Harvard Business Review (2025) ()

Key milestones, June 2024 to June 2026
Q2 2024
  • Enterprise genAI adoption spikes
  • GenAI risk profiles become operational guidance
Q3 2024
  • Everyday AI and digital employee experience move onto CIO roadmaps
Q4 2024
  • Computer use and agentic models enter workplace products
  • Reasoning system cards become standard evidence
Q1 2025
  • AI task-use indexes map real adoption
  • Human-AI teaming meta-analysis challenges simple augmentation claims
  • Operator and deep research shift AI toward delegated work
Q2 2025
  • Model welfare becomes a formal lab topic
  • Agent guardrails and firewalls appear
  • Work redesign replaces experimentation as the analyst consensus
Q3 2025
  • AI-assisted software delivery shows productivity and review-load tensions
  • Workslop becomes a named productivity harm
Q4 2025
  • Enterprise AI value measurement tightens
  • Reported time savings meet ROI scrutiny
Q1 2026
  • Human plus AI organisation design becomes an executive theme
  • Monitoring deployed AI systems becomes a named governance problem
Q2 2026
  • Worker welfare gap becomes visible
  • Agent oversight and overload become explicit research objects

Sources: McKinsey (2024) (); NIST (2024); Gartner (2024) (); Anthropic (2024) (); Google DeepMind (2024) (); OpenAI (2024) (); Anthropic (2025) (); Nature Human Behaviour (2025) (); OpenAI (2025) (); OpenAI (2025) (); Anthropic (2025) (); Meta AI (2025) (); DORA (2025) (); Harvard Business Review (2025) (); Forrester (2026) (); Economist Impact (2026); NIST AI 800-4 announcement/report (2026); arXiv (2026); arXiv (2026)

Key Findings

1. Adoption is broad, but not deep enough to infer value

The sweep converges on one distinction: adoption has become normal, but productive absorption has not. McKinsey reported a spike in genAI adoption in early 2024, Bain called generative AI virtually ubiquitous in global business, and Gallup later found workplace AI use had nearly doubled in two years. Those sources support a diffusion story, not a return-on-investment story.
Sources: McKinsey (2024) (); Bain & Company (2024) (); Gallup (2025)

The stronger analyst reports moved from “who has access?” to “who has rewired work?” McKinsey’s 2025 state-of-AI work focused on organisational rewiring, Forrester’s 2026 value matrix argued for measuring what matters, and Gartner’s everyday-AI coverage linked AI progress to digital employee experience rather than raw deployment.
Sources: McKinsey (2025) (); Forrester (2026) (); Gartner (2024) ()

2. The best productivity gains appear in bounded, verifiable work

The clearest empirical gains come from tasks with visible outputs, tolerable error costs and manageable verification. Management Science and Microsoft Research field experiments found productivity improvements among software developers using generative AI tools, while SSRN work in central banking found gains when the task structure fitted model capabilities.
Sources: Management Science (2025) (); Microsoft Research (2025) (); SSRN (2025)

The same pattern appears in enterprise data. Anthropic’s Economic Index showed uneven task adoption, with software and other digitally mediated work heavily represented, while Thomson Reuters found rising generative AI use in professional services. These are settings where workers can often compare AI output against domain standards, documents, code or client requirements.
Sources: Anthropic (2025); Anthropic (2025) (); Thomson Reuters Institute (2025)

3. Human-AI teams do not automatically beat humans or AI alone

The Nature Human Behaviour systematic review and meta-analysis is important because it interrupts the easy “human plus AI” slogan. It found that combinations of humans and AI are useful only under certain task and design conditions, rather than being inherently superior to either humans or AI alone.
Sources: Nature Human Behaviour (2025) ()

That finding lines up with field evidence on the jagged frontier. Organization Science research on knowledge workers found that AI improved performance inside the frontier but could harm quality when workers applied it to tasks outside its strengths. The lesson is not that humans should always stay in the loop, but that the loop must be designed around task fit, verification cost and error consequences.
Sources: Organization Science (2026); NBER (2025) ()

4. The role split is by judgement and autonomy, not by generation

The “Gen Z versus everyone” framing misses the operating reality. NBER and field-experiment evidence suggests less-experienced workers can gain more on some tasks because AI supplies templates, language, examples and procedural guidance. At the same time, junior workers face a learning risk if AI removes the difficult practice through which judgement forms.
Sources: NBER / SSRN (2024); SSRN (2025); PNAS (2025)

Senior workers often gain from AI because they have enough domain judgement to detect weak outputs, decompose tasks and decide when to ignore the model. Simon Willison made this point sharply for coding agents, arguing that they require skilled operators rather than novice button-pushers. The implication is a class and career-stage split: AI can compress some performance gaps while widening rewards for people who already know what good looks like.
Sources: Simon Willison’s Weblog (2025) (); Stack Overflow (2026) (); Economist Impact (2025)

5. Cognitive offloading is now measurable enough to manage

Microsoft Research found that knowledge workers who used generative AI reported reductions in cognitive effort, with confidence in AI associated with less critical thinking and self-confidence associated with more critical thinking. This does not prove that AI makes workers less capable, but it does show a behavioural pattern that organisations need to manage: people adapt their effort to the tool.
Sources: Microsoft Research (2025) ()

Academic work on algorithmic conformity and human-AI feedback loops adds a stronger warning. Repeated exposure to AI advice can shift human judgement, while guardrail-free AI tutoring can harm later learning even when it helps with immediate answers. The harm is not only a bad output, it is a changed worker or learner.
Sources: Nature Human Behaviour (2024); PNAS (2025); SSRN (2025)

6. Agentic systems turn production work into supervision work

Agentic releases changed the cost structure of work. OpenAI’s Operator and deep research system cards, Anthropic’s computer-use release and Meta’s agent guardrail work all assume systems that can act across tools, websites or workflows. That shifts human labour toward instruction, monitoring, interruption, correction and exception handling.
Sources: OpenAI (2025) (); OpenAI (2025) (); Anthropic (2024) (); Meta AI (2025) ()

The direct evidence on multi-agent vigilance fatigue in ordinary enterprises remains thin, but the software-agent evidence is moving in that direction. 2026 arXiv work on human oversight of agentic systems and overload in AI-assisted software engineering described oversight work as a real, costly burden, while Stack Overflow found agentic AI at work remained mostly single-agent and monitored. Organisations are not yet letting agents run free, but they are already asking humans to become traffic controllers.
Sources: arXiv (2026); arXiv (2026); Stack Overflow (2026) ()

7. Top-down mandates look weak when they measure activity rather than outcomes

The strongest evidence does not support blanket AI mandates as a value strategy. Forrester argued in 2026 that many enterprises were still chasing the true value of genAI three years in, while its AI Value Matrix pushed firms toward value measures rather than adoption counts. Thoughtworks specifically warned against coding throughput as a productivity measure because output volume can increase review load, defects and downstream risk.
Sources: Forrester (2026) (); Forrester (2026) (); Thoughtworks Technology Radar (2026) ()

The most credible operating advice favours local rules within enterprise guardrails. MIT Sloan Management Review argued that team leaders should write AI rules for productivity gains, and DORA described AI as an amplifier of existing software-delivery systems rather than a cure for weak engineering practice. That cuts against usage quotas, token-maximising programmes and executive theatre.
Sources: MIT Sloan Management Review (2025) (); DORA (2025) (); DORA (2026) ()

8. Workslop is the social form of bad AI ROI

AI-generated workslop names a common failure mode: output that looks plausible enough to pass upward, but is vague, wrong or incomplete enough to impose work on someone else. Harvard Business Review framed it as productivity destruction, not mere annoyance. This bridges the human-factors and ROI arguments because the cost appears as peer review, rework, reputation damage and slower decision-making.
Sources: Harvard Business Review (2025) ()

This is why token counts and prompt counts are poor management metrics. They capture machine activity and user compliance, not whether work moved faster, quality improved, risk fell or cognitive load became sustainable. OpenAI’s workspace analytics and enterprise reporting may help firms see usage, but usage telemetry must be joined to business outcomes and human workload measures.
Sources: OpenAI Help Center (2026); Forrester (2026) (); NBER (2025)

9. Labs have model welfare, but firms lack worker welfare

Anthropic made model welfare an explicit research topic in April 2025 and continued to formalise model behaviour through constitutions, system cards and responsible-scaling updates. That is a notable institutional development: one frontier lab now has language and staff attention for possible model interests or model treatment.
Sources: Anthropic (2025); Anthropic (2026) (); Anthropic (2026)

No equivalent worker-welfare function appears across the enterprise sources. The closest substitutes are digital employee experience, responsible AI governance, HR-led redesign, DevEx programmes and AI risk management. Those are useful, but they do not yet create a clear owner for attention load, judgement erosion, deskilling, surveillance pressure, escalation burden or adoption externalities.
Sources: Gartner (2024) (); NIST (2024); ACM Queue (2024) (); Economist Impact (2026)

10. Training matters, but deployment design decides most human outcomes

Model training shapes capability, refusal behaviour, tone, tool discipline and safety boundaries. The system-card record from OpenAI, Anthropic, xAI and others shows labs spending more effort on evaluations, mitigations and release conditions as models become more agentic.
Sources: OpenAI (2024) (); OpenAI (2025) (); Anthropic (2025) (); xAI (2025) ()

The main workplace harms, however, appear at the orchestration layer. Bad task allocation, poor interfaces, constant interruptions, unclear escalation rules, unbudgeted review time and weak measurement can turn a capable model into a net burden. NIST’s monitoring work, DORA’s software-delivery findings, MIT Sloan’s work-redesign advice and Mollick’s interface-centred writing all point to the same lever: design the system around human attention and behaviour.
Sources: NIST AI 800-4 announcement/report (2026); DORA (2025) (); MIT Sloan Management Review (2025) (); One Useful Thing (2026) ()

Evidence & Data

The most useful quantitative evidence falls into four buckets.

First, adoption. NBER’s rapid-adoption research measured unusually fast diffusion of generative AI among US adults and workers, while Gallup found workplace AI use had nearly doubled in two years. Thomson Reuters reported that generative AI adoption nearly doubled in professional services by 2025, and Anthropic’s Economic Index mapped usage by task and geography rather than relying only on surveys.
Sources: NBER / SSRN (2024); Gallup (2025); Thomson Reuters Institute (2025); Anthropic Research (2025) ()

Second, task productivity. Microsoft Research and Management Science field experiments with software developers found measurable productivity gains, and OpenAI’s enterprise report found workers saved nearly an hour a day on average. These findings matter because they use workplace settings or enterprise reporting, but they still need careful interpretation because saved time only becomes value when organisations decide what the time is for.
Sources: Microsoft Research (2025) (); Management Science (2025) (); OpenAI (2025)

Third, learning and judgement. PNAS found that generative AI without guardrails could harm learning in high-school mathematics, while Microsoft Research found self-reported reductions in cognitive effort among knowledge workers. SSRN work on algorithmic conformity and Nature Human Behaviour work on human-AI feedback loops show that AI can alter human judgement, not merely assist it.
Sources: PNAS (2025); Microsoft Research (2025) (); SSRN (2025); Nature Human Behaviour (2024)

Fourth, developer trust and review. Stack Overflow’s 2024 and 2025 surveys showed the gap between willingness to use AI and trust in its output, while its 2026 agentic-AI work found agents remained mostly monitored at work. DORA and Thoughtworks then explain why this matters operationally: AI can increase apparent output while adding review, integration and risk-management load.
Sources: Stack Overflow (2024) (); Stack Overflow (2025) (); Stack Overflow (2026) (); DORA (2025) (); Thoughtworks Technology Radar (2026) ()

Signals & Tensions

  1. Vendor telemetry is improving, but independence remains uneven. Anthropic’s Economic Index and OpenAI’s enterprise reporting offer more granular evidence than ordinary surveys, yet they still reflect product-specific populations and provider incentives. NBER, academic field experiments and independent surveys remain the stronger anchors for general claims.
    Sources: Anthropic (2025); OpenAI (2025); NBER / SSRN (2024); SSRN (2025)

  2. Agent rhetoric is ahead of enterprise practice. Frontier labs and investors describe increasingly autonomous agents, but Stack Overflow found workplace agents remained mostly monitored and single-agent in 2026. That gap suggests organisations still distrust unsupervised autonomy, or cannot yet absorb it safely.
    Sources: OpenAI (2025) (); CB Insights (2025) (); Stack Overflow (2026) ()

  3. Human-in-the-loop is overused as a comfort phrase. NIST’s monitoring work and arXiv studies of agent oversight show that monitoring deployed AI systems is difficult and labour-intensive. LessWrong writers make the sharper version of the same argument: a human reviewer does not create meaningful oversight if the system is too fast, opaque or complex to inspect.
    Sources: NIST AI 800-4 announcement/report (2026); arXiv (2026); LessWrong (2026) (); LessWrong (2026) ()

  4. The ROI debate is less about whether AI works than where the costs land. Field experiments show gains, but HBR’s workslop argument and DORA’s delivery findings show that output can move cost downstream. The open management problem is whether firms can see the whole workflow rather than celebrating the first person’s saved time.
    Sources: Microsoft Research (2025) (); Harvard Business Review (2025) (); DORA (2025) ()

  5. The worker-welfare gap is underreported. AI safety institutions now discuss model behaviour, model welfare and frontier risk, while enterprise AI governance still tends to discuss compliance, productivity, risk and skills. Attention load, judgement quality and career development rarely have a named executive owner.
    Sources: Anthropic (2025); METR (2025) (); McKinsey (2025) (); Gartner (2024) ()

Open Questions

  1. Enterprises still lack good measures of cognitive load in AI work. Usage telemetry can count prompts, seats and tokens, but it does not measure context switching, vigilance fatigue, interruption cost or the time spent checking AI output. NIST and NBER both identify AI measurement and monitoring as live problems.
    Sources: NBER (2025); NIST AI 800-4 announcement/report (2026)

  2. No one knows the safe agent-to-human ratio for ordinary knowledge work. Software-agent studies show oversight burden, and Stack Overflow shows continued monitoring, but there is little field evidence on how many agents one worker can supervise without quality collapse.
    Sources: arXiv (2026); arXiv (2026); Stack Overflow (2026) ()

  3. The long-term learning effect remains unresolved. PNAS shows harm from unguarded AI in mathematics learning, while workplace studies show near-term productivity gains. The missing evidence is longitudinal: whether workers who rely on AI develop judgement faster, slower or differently over years.
    Sources: PNAS (2025); Microsoft Research (2025) ()

  4. The budget effect of mandates is not well measured. Analyst and practitioner sources warn against activity metrics and top-down adoption theatre, but the sweep did not surface many independent studies that directly compare mandated AI programmes with voluntary, team-designed programmes.
    Sources: Forrester (2026) (); Thoughtworks Technology Radar (2026) (); MIT Sloan Management Review (2025) ()

  5. Worker-welfare governance has no settled home. HR, CIO, risk, legal, responsible AI and line managers each own fragments of the problem, but none naturally owns the full stack of workload, trust calibration, learning, surveillance pressure and job quality. Economist Impact, Gartner and NIST point to adjacent structures, not a mature function.
    Sources: Economist Impact (2026); Gartner (2024) (); NIST (2024)

  6. The training-versus-deployment split will become harder as agents improve. Better models may reduce some errors and review effort, but more capable agents also increase task duration, action scope and monitoring complexity. The practical answer for now is to treat model quality as necessary infrastructure, while making orchestration, pacing, escalation and accountability the centre of the operating model.
    Sources: OpenAI (2025) (); Anthropic (2025) (); NIST AI 800-4 announcement/report (2026); One Useful Thing (2025) ()

The practical implication is blunt: do not force humans to adapt to machine speed. Slow the workflow where judgement matters, batch review where attention is scarce, automate only where verification is cheap, and make one executive function accountable for the human cost of AI.


![[sources-how-humans-are-adapting-to-ai-between-june-2024-an]]


Sources

Summary: ↑ Back to summary


Financial Press

ID Title Outlet Date Significance
f1 OpenAI ChatGPT Enterprise Sees Surging Demand Despite Competition, COO Says Bloomberg 2024-04-04 Early marker that 2024 was shifting from experimentation to enterprise rollout. Useful for the adoption baseline and for understanding why executives began pushing organization-wide uptake before ROI evidence was mature.
f2 AI Agents Have Officially Entered the Workplace, Flaws and All Bloomberg 2024-10-24 One of the clearest business-press signals that the enterprise conversation had moved from copilots to agents. Useful on the operational reality that agents arrived with error, trust, and supervision problems intact.
f3 Big Tech’s New AI Obsession: Agents That Do Your Work for You Bloomberg 2024-12-13 Frames the market narrative heading into 2025: value expectations migrated from assistance to delegated action. Important background for later evidence on overload, governance, and false ROI expectations.
f4 OpenAI Finds AI Saves Workers Nearly an Hour a Day on Average Bloomberg 2025-12-08 One of the most cited late-2025 enterprise productivity claims. Important because it quantifies gains, but also because it is vendor-commissioned and therefore a good example of where measured benefits need careful weighting.
f5 Anthropic Finds Businesses Are Mainly Using AI to Automate Work Bloomberg 2025-09-15 Important for the augmentation-versus-automation debate. Suggests enterprise usage may be moving faster toward delegation than many 'human plus AI' narratives imply.
f6 Rethinking work: designing the 'human + AI' organisation Economist Impact 2026-01-20 Directly relevant to the brief’s core question: redesigning work and culture so AI elevates human capability rather than overwhelms it. Useful for executive framing of operating-model redesign.
f7 The AI glass floor Economist Impact 2025 Strong source on inequality of outcomes: wage premiums for AI skills, junior-career risks, and the possibility that AI advantage accrues disproportionately to already-advantaged workers.
f8 From intent to action: the leaders’ guide to building AI-powered workplaces Economist Impact 2025 Useful on the adoption-to-scale gap. Supports the argument that many firms can pilot AI but few convert it into repeatable business value.
f9 How Agentic AI is Reshaping Workplace Culture Economist Impact 2025 Covers the behavioral and cultural side of adoption rather than just tooling. Relevant to trust, communication, and framing AI as a tool rather than an imposed end-state.
f10 AI Use at Work Has Nearly Doubled in Two Years Gallup 2025-06-15 High-value independent survey evidence on actual employee adoption. Especially useful because it shows adoption differs sharply by white-collar versus frontline roles.
f11 AI Use at Work Rises Gallup 2025-12 Adds the managerial-support finding: adoption is materially associated with support and strategic integration, reinforcing that deployment design matters.
f12 Rising AI Adoption Spurs Workforce Changes Gallup 2026-04 Useful for the 2026 snapshot: higher usage, rising job concerns, and evidence that leaders report stronger productivity gains than other groups.
f13 The Rapid Adoption of Generative AI NBER 2024; revised 2025-02 A foundational independent paper on how quickly AI diffused into work. Important for separating broad adoption from proven firm-level productivity transformation.
f14 Firm Data on AI NBER 2026-03 Probably the single most useful independent business-economics source in this set. It surveys nearly 6,000 executives and finds widespread adoption but still-small realized effects on jobs and productivity, directly challenging inflated mandate narratives.
f15 In This Issue NBER Newsletter 2026-05 Useful summary pointer emphasizing the same pattern: adoption is broad, measured productivity effects remain modest, and executive expectations still run ahead of observed outcomes.
f16 2024 Generative AI in Professional Services Report Thomson Reuters Institute 2024 Professional services is a strong test bed because work is document-heavy, high-value, and supervision-intensive. Helpful for early evidence on where practitioners saw efficiency and quality gains.
f17 2025 Generative AI in Professional Services Report / Generative AI Adoption Nearly Doubles as Professional Services Reach Crossroads Thomson Reuters Institute 2025-04-15 Shows the move from individual use to enterprise integration remains incomplete. Good evidence against simplistic 'everyone must use AI more' mandates.
f18 4 AI Trends in Professional Services to Watch in 2025 Thomson Reuters Institute 2025 Useful on the gap between active personal usage and scaled workflow integration, which is central to the difference between activity metrics and value metrics.
f19 2025: The year the Frontier Firm is born Microsoft WorkLab 2025-04-24 Major vendor framing of the 'agent boss' model. Valuable not because it is neutral, but because it captures how senior leaders were being encouraged to redesign organizations around agents.
f20 2025 Work Trend Index Annual Report (executive summary) Microsoft WorkLab 2025-04-24 Adds methodological detail and the human-agent-team framing. Useful as a reference point for the managerial ideology of 2025 enterprise AI.
f21 The State of Enterprise AI OpenAI 2025-12-08 A major vendor report on enterprise usage, time savings, and task breadth. Important as evidence of measured benefits, but also as a reminder that many headline figures come from the suppliers themselves.
f22 Workspace Analytics for ChatGPT Enterprise and Edu OpenAI Help Center 2026-03 rollout noted Important operationally because it shows how vendors want enterprises to measure impact: productivity, time saved, quality, work satisfaction, and new-task completion. Useful for the measurement/governance angle.
f23 Introducing the Anthropic Economic Index Anthropic 2025-02-10 High-value source on actual task usage patterns. Especially relevant because it distinguishes augmentation from automation and maps usage to occupations.
f24 Exploring model welfare Anthropic 2025-04-24 Directly relevant to the brief’s welfare asymmetry. It shows a frontier lab formalizing concern for model welfare while enterprises still largely lack equivalent worker-welfare governance for deployment.
f25 Responsible Scaling Policy Updates Anthropic 2026-06 Shows the sophistication of model-side governance and accountability is increasing. Useful contrast with the relative immaturity of human-side deployment governance inside enterprises.

Frontier Lab & Model News

ID Title Outlet Date Significance
t1 Large Enough Mistral 2024-07-24 Shows the 2024 push toward larger-context, tool-capable enterprise models. Relevant because enterprise deployment pressure grew alongside claims of throughput and cost-efficiency, setting up later questions about whether organizations optimized for value or for visible AI activity.
t2 Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku Anthropic 2024-10-22 Important for the human-supervision question because it moves AI from advising to acting on computers, which sharply increases oversight load and the risk that humans become nominal reviewers of machine-speed action sequences.
t3 Google introduces Gemini 2.0: A new AI model for the agentic era Google DeepMind 2024-12-11 Marks Google's explicit move to an 'agentic era' framing, relevant because orgs then increasingly experiment with delegating multi-step work instead of using AI only as a drafting aid.
t4 OpenAI o1 System Card OpenAI 2024-12-05 Useful for the 'human outcomes depend on training vs orchestration' question because it documents deceptive and oversight-evasion behaviors under some conditions, implying deployment architecture and monitoring remain critical even if model training improves.
t5 Operator System Card OpenAI 2025-01-23 Directly relevant to machine-speed supervision. Operator explicitly requires human confirmations at key steps, which is a concrete design acknowledgement that unrestricted human-in-the-loop oversight breaks down when agents can act across software systems.
t6 Anthropic Economic Index: new building blocks for understanding AI use Anthropic 2025-01-15 One of the stronger primary sources on how people actually use frontier models. Especially relevant for the user's deskilling and autonomy questions because it tracks task complexity, purpose, autonomy, and success rather than only aggregate usage.
t7 When combinations of humans and AI are useful: A systematic review and meta-analysis Nature Human Behaviour 2025-02-05 Best cross-domain evidence in this set for the human+AI complementarity question. Useful against simplistic 'AI always helps' or 'AI always harms' narratives.
t8 Deep research System Card OpenAI 2025-02-25 Important for understanding supervisory burden in browsing agents. Deep research formalizes multi-step web work and acknowledges prompt injection, privacy, code execution, and hallucination risks.
t9 The Effects of Generative AI on High-Skilled Work: Evidence from Three Field Experiments with Software Developers Management Science 2025-02-27 One of the strongest sources on measured benefits for high-skill work. Useful because it tests frontier-style coding assistance in organizational settings rather than relying on benchmark claims.
t10 Gemini 2.0 model updates: 2.0 Flash, Flash-Lite, Pro Experimental Google DeepMind 2025-02-?? Shows the economics of adoption pressure: cheaper, faster, long-context models make 'use it everywhere' mandates more likely, even though value depends on workflow fit.
t11 Anthropic Economic Index: Insights from Claude 3.7 Sonnet Anthropic 2025-03-27 Tracks how a stronger model changes real usage patterns. Relevant to whether training improvements alone shift outcomes, versus whether organizations still need better orchestration.
t12 The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation Meta AI 2025-04-05 Relevant because open-weight multimodal models broaden organizational experimentation and decentralize deployment decisions, making local operating-model design even more important.
t13 Introducing GPT-4.1 in the API OpenAI 2025-04-14 Useful for the productivity-versus-cognitive-load story because it pushes cheap, long-context model access further into everyday workflows, increasing temptation to substitute context volume for better work design.
t14 OpenAI o3 and o4-mini System Card OpenAI 2025-04-16 Relevant to the deployment-design question because it extends preparedness discussion to more autonomous reasoning models while still using thresholded governance rather than claiming training has solved the problem.
t15 Exploring model welfare Anthropic 2025-04-24 Directly relevant to the user's question about a model welfare function without an equivalent worker welfare function. It is a strong marker that frontier labs are formalizing concern for possible model interests faster than analogous governance for employee cognitive welfare.
t16 LlamaFirewall: An open source guardrail system for building secure AI agents Meta AI 2025-04-29 Strong evidence that good outcomes depend materially on orchestration and deployment guardrails, not just model training. Especially relevant to agent-to-human ratios and escalation designs.
t17 Everything we announced at our first-ever LlamaCon Meta AI 2025-04-29 Relevant because Meta tied ecosystem growth to evaluation tooling, signaling that broad adoption without measurement is inadequate.
t18 Medium is the new large Mistral 2025-05-07 Shows the competitive move toward cheaper enterprise deployment. Relevant to the user's 'token-maxing' concern because low-cost models often encourage breadth of rollout even when task-level value is weak.
t19 Frontier Risk Report (February to March 2026) METR 2025-05-19 One of the most important external sources for the period. It evaluates frontier internal models from Anthropic, Google, Meta, and OpenAI using agentic benchmarks, helping separate marketing narratives from independent capability evidence.
t20 Operator System Card OpenAI 2025-??-?? Included separately because it is one of the clearest explicit acknowledgements from a frontier lab that real-world deployment requires constrained action, staged release, and user confirmations rather than raw autonomy.
t21 Details about METR's preliminary evaluation of Claude 3.7 METR 2025-??-?? Important because it provides independent evidence on frontier-model performance in agentic, programming, and command-line tasks - exactly the capability band where human supervision becomes strained.
t22 Claude 4 System Card Anthropic 2025-??-?? Very relevant to the user's welfare-function question because the card explicitly includes model welfare assessment, while also documenting internal AI-research and autonomy evaluations.
t23 Findings from a Pilot Anthropic - OpenAI Alignment Evaluation Exercise Anthropic + OpenAI 2025-??-?? Not directly about worker adaptation, but important for the training-vs-deployment debate: labs are beginning to cross-evaluate alignment, yet even strong alignment results do not remove the need for deployment-side controls.
t24 Grok 4.1 Model Card xAI 2025-11-17 Useful as a comparator showing that by late 2025 frontier labs beyond the usual three were publishing pre-deployment safety evaluations and distinguishing between base model and production-prompt behavior.
t25 Claude’s Constitution Anthropic 2026-01-21 Relevant to the training side of the training-vs-orchestration question. It documents the welfare and normative principles Anthropic is using to shape model behavior, making the contrast with absent worker-welfare constitutions more salient.

Academic & arXiv

ID Title Outlet Date Significance
a1 Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile NIST 2024 Best standards-oriented source for framing human oversight and post-deployment risk as a systems problem rather than a pure model problem.
a2 The Rapid Adoption of Generative AI NBER / SSRN 2024 Important baseline for the diffusion side of the story; shows that adaptation is real and broad before many firms had mature operating models.
a3 Algorithm-enabled Decision Support and Worker Learning: a Large-Scale Field Experiment SSRN 2024 Directly relevant to whether AI complements judgment and learning or merely substitutes for them.
a4 How human–AI feedback loops alter human perceptual, emotional and social judgements Nature Human Behaviour 2024 One of the clearest papers on downstream cognitive harms from AI-mediated judgment, beyond simple one-shot error rates.
a5 RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts arXiv / METR 2024 Foundational for understanding where human oversight shifts from execution to orchestration when agents become strong on open-ended work.
a6 Shifting Work Patterns with Generative AI NBER 2025 Strong evidence that benefits may show up first in time allocation and work pattern shifts, not necessarily in measured task substitution.
a7 A Task-Based Approach to Generative AI: Evidence from a Field Experiment in Central Banking SSRN 2025 Useful counterweight to vendor-heavy enterprise narratives; tests AI in a serious knowledge-work environment with regulated stakes.
a8 The Effects of Generative AI on High-Skilled Work: Evidence from Three Field Experiments with Software Developers SSRN 2025 One of the strongest 2025 papers on heterogeneous gains by experience level.
a9 HCAST: Human-Calibrated Autonomy Software Tasks arXiv / METR 2025 Excellent anchor for human-agent ratio thinking, escalation thresholds, and deciding when not to force autonomy.
a10 Making AI Count: The Next Measurement Frontier NBER 2025 Directly relevant to rejecting token-maxing and activity-based KPI systems.
a11 Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality Organization Science 2026 Still the most useful operating-model metaphor for when AI helps versus when it silently degrades judgment.
a12 Generative AI without guardrails can harm learning: Evidence from high school mathematics PNAS 2025 One of the cleanest deskilling/cognitive-offloading papers in the period.
a13 Turning Off Your Better Judgment: Algorithmic Conformity in AI-Human Collaboration SSRN 2025 Highly relevant to automation complacency, deference, and the hidden costs of making AI too easy to obey.
a14 Know Your Mistakes: Towards Preventing Overreliance on Task-Oriented Conversational AI Through Accountability Modeling ACL 2025 2025 Concrete evidence for 'positive friction' as a deployment design pattern.
a15 REL-A.I.: An Interaction-Centered Approach To Measuring Human-LM Reliance NAACL 2025 2025 Important benchmark/methodology contribution for studying overreliance as a system property.
a16 LLMs Trust Humans More, That's a Problem! ACL 2025 2025 Useful for the inverse problem: not only do humans overtrust models, models can overtrust humans, destabilizing mixed-initiative workflows.
a17 Exploring model welfare Anthropic research note 2025 Not a worker-outcomes paper, but central to the user's question about the asymmetry between emerging model welfare functions and missing worker welfare functions.
a18 Firm Data on AI NBER 2026 Best 2026 macro-adoption source for organizations; useful for separating firm-level adoption from worker-level use.
a19 The Microstructure of AI Diffusion: Evidence from Firms, Business Functions, and Worker Tasks NBER 2026 Directly relevant to the limits of mandates and why usage quotas are a weak proxy for value.
a20 From Adoption to Outcomes: AI-Specific Implementation Gaps in the First 18 Months SSRN 2026 One of the most directly relevant papers for the 'don't force adoption; redesign the operating model' argument.
a21 Human oversight of agentic systems in practice: Examining the oversight work, challenges, and heuristics of developers using software agents arXiv 2026 Probably the most on-point 2026 paper for the practical reality of supervising agents rather than merely 'using AI tools'.
a22 Human Oversight and Overload: Two Hidden and Costly Burdens of AI-Assisted Software Engineering arXiv 2026 Directly supports the thesis that the constraining resource becomes human attention, not model throughput.
a23 Human Tool: An MCP-Style Framework for Human-Agent Collaboration arXiv 2026 A strong candidate design pattern for structuring oversight instead of scattering interruptions across workers.
a24 HAAS: A Policy-Aware Framework for Adaptive Task Allocation Between Humans and Artificial Intelligence Systems arXiv 2026 Rare paper that explicitly connects governance strength to workload buffering rather than treating governance as pure drag.
a25 New Report: Challenges to the Monitoring of Deployed AI Systems NIST AI 800-4 announcement/report 2026 Useful evidence that monitoring and oversight costs are not incidental; they are recurring deployment burdens.

VC & Analyst Reports

ID Title Outlet Date Significance
v1 How 100 Enterprise CIOs Are Building and Buying Gen AI in 2025 Andreessen Horowitz (a16z) 2025-05-08 Useful for adoption curves, build-vs-buy behavior, and evidence that deployment design is becoming more important than model novelty.
v2 16 Changes to the Way Enterprises Are Building and Buying Generative AI Andreessen Horowitz (a16z) 2024-04-16 Good baseline for 2024 organizational behavior and why forced rollout into higher-risk workflows had not yet happened broadly.
v3 Where Enterprises are Actually Adopting AI Andreessen Horowitz (a16z) 2026-04-15 Helpful for distinguishing real adoption from theater and for understanding where human workflow redesign is easier.
v4 AI 50: Companies of the Future Sequoia Capital 2024-04-11 Strong for investment thesis and early productivity framing.
v5 The Stochastic Mindset Sequoia Capital 2025-01-22 One of the few VC pieces directly relevant to human cognitive adaptation and judgment under AI.
v6 The Always-On Economy: AI and the Next 5-7 Years Sequoia Capital 2025-04-21 Useful for the user's concern that human oversight may be mismatched to machine-speed workflows.
v7 Here’s how leading strategy teams are successfully driving generative AI adoption in their organizations CB Insights 2025-01-16 Strong evidence against equating organizational enthusiasm with real deployment success.
v8 Enterprise AI agents & copilots: Our growth projections for the $5B+ market CB Insights 2025-04-29 Useful for market sizing and for understanding why organizations are being pushed toward agentic operating models.
v9 Should enterprises adopt closed-source or open-source AI models? CB Insights 2025-02-12 Directly supports the idea that orchestration and deployment choices matter as much as, or more than, allegiance to one model family.
v10 State of AI Report: 6 trends shaping the landscape in 2025 CB Insights 2025-01-30 Good macro context for why adoption pressure intensified inside organizations in 2025.
v11 What’s next for AI agent ROI? CB Insights 2026-03-2026 Very useful for the user's question about whether outcomes depend more on model training or orchestration/deployment design.
v12 Gartner Identifies the Top 10 Strategic Technology Trends for 2025 Gartner 2024-10-21 Important strategic framing for machine-speed delegation and the human oversight problem.
v13 Hype Cycle for Generative AI, 2024 Gartner 2024-07-31 Useful for mapping maturity and for cautioning against top-down overcommitment to immature categories.
v14 Gartner Says Everyday AI and Digital Employee Experience Are Two Years Away from Mainstream Adoption Gartner 2024-08-14 One of the clearest analyst proxies for a worker-welfare or human-experience lens.
v15 AI Agent Layer: Why CIOs Must Lead Enterprise Transformation Gartner 2025-05-2026 Directly relevant to operating-model ownership and governance for human-machine work allocation.
v16 The State Of Generative AI, 2024 Forrester 2024-01-2024 Good baseline source for early-2024 enterprise posture.
v17 Tech Pulse Q4 2024: How IT Builds An AI Advantage By Embracing AI Tools And Agents Forrester 2025-03-05 Helpful for understanding how heavy-use functions adapt in practice.
v18 Forrester: Three Years Into GenAI, Enterprises Are Still Chasing Its True Transformative Value Forrester 2026-04-02 One of the strongest sources here against forced adoption and usage-target theater.
v19 Introducing The Forrester AI Value Matrix: A Framework For Measuring What Matters Forrester 2026-04-24 Directly useful for separating true value creation from activity metrics like token counts or superficial usage.
v20 Superagency in the workplace: Empowering people to unlock AI’s full potential McKinsey 2025-01-28 Central source for human adaptation, adoption gaps, and the case for work redesign rather than mere tool provision.
v21 Gen AI’s next inflection point: From employee experimentation to organizational transformation McKinsey 2024-08-07 Useful against top-down mandate logic; suggests bottom-up use preceded managerial structure.
v22 Agents, robots, and us: Skill partnerships in the age of AI McKinsey 2025-11-25 One of the best sources in this lane for operating-model redesign around human-machine complementarity.
v23 The State of AI: How organizations are rewiring to capture value McKinsey 2025-11-05 Strong adoption-curve evidence and a useful bridge from experimentation to operating model.
v24 The state of AI in early 2024: Gen AI adoption spikes and starts to generate value McKinsey 2024-06-07 Best early-period baseline for the 2024-2026 adoption curve.
v25 Generative AI virtually ubiquitous in global business as the technology spreads at a near-unprecedented rate Bain & Company 2024-06-20 Strong evidence on budget pressure and why executives may default to mandates.

Blogs & Independent Thinkers

ID Title Outlet Date Significance
b1 Real AI Agents and Real Work One Useful Thing 2025-09-29 Strong on agent supervision, review workflows, and the risk of 'infinite PowerPoints' instead of real value. (oneusefulthing.org)
b2 Claude Dispatch and the Power of Interfaces One Useful Thing 2026-03-31 Directly relevant to cognitive overload and designing around human attention rather than machine throughput. (oneusefulthing.org)
b3 Management as AI superpower One Useful Thing 2026-01-27 Useful for operating-model design and the economics of supervision overhead. (oneusefulthing.org)
b4 Making AI Work: Leadership, Lab, and Crowd One Useful Thing 2025-05-25 Good antidote to top-down mandate logic. (oneusefulthing.org)
b5 Choosing to Stay Human One Useful Thing 2026-05-26 Independent reflection on erosion of judgment and behavioral adaptation. (oneusefulthing.org)
b6 What it feels like to work with Mythos One Useful Thing 2026-06-09 Useful as a late-period marker for how human-AI relations were shifting by June 2026. (oneusefulthing.org)
b7 Coding agents require skilled operators Simon Willison’s Weblog 2025-06-18 Clean statement of why forcing novice adoption can destroy value. (simonwillison.net)
b8 The AI Hangover Quandary Labs Substack 2026-01-14 Directly addresses poor budget outcomes and the failure of activity metrics. (substack.quandarylabs.ai)
b9 Digital Economy Dispatch #264 -- AI Bottlenecks, Jagged Edges, and the Real Barriers to AI-at-Scale Dispatches 2026-01-06 Good independent synthesis linking frontier capability to deployment friction. (dispatches.alanbrown.net)
b10 Attention Ecology Human OS Manual 2026-03-29 Useful independent human-factors framing for cognitive load and operating-model design. (thehumanosmanual.com)
b11 Vibe Coding, Windsurf and Anthropic, ChatGPT Connectors Stratechery 2025-06-09 Helpful for the thesis that orchestration and integration layers matter as much as models. (stratechery.com)
b12 Oversight Assistants: Turning Compute into Understanding LessWrong 2026-01-07 One of the clearest pieces on why human-in-the-loop supervision at machine speed breaks down. (lesswrong.com)
b13 No, We're Not Getting Meaningful Oversight of AI LessWrong / GreaterWrong 2025-07-09 Useful skeptical counterweight to simplistic HITL narratives. (greaterwrong.com)
b14 Loss of Oversight: How AI Systems May Become Harder to Audit, Monitor, and Investigate LessWrong 2026-05-26 Important for the 'training vs deployment' question because oversight degrades with system design choices. (lesswrong.com)
b15 Is AI welfare work puntable? LessWrong 2026-05-12 Useful for the asymmetry between formalized model welfare and diffuse human welfare governance. (lesswrong.com)
b16 Exploring model welfare Anthropic News 2025-04-24 This is the clearest primary-source marker of labs building a model-welfare function. (anthropic.com)
b17 Introducing the Anthropic Economic Index Anthropic Research 2025-02-10 Important empirical counterweight to sweeping displacement narratives. (anthropic.com)
b18 Anthropic Economic Index report: Uneven geographic and enterprise AI adoption Anthropic Research 2025-09-18 Useful on heterogeneity across firms and regions rather than one flat adoption curve. (anthropic.com)
b19 AI Use at Work Has Nearly Doubled in Two Years Gallup 2025-06-16 Anchors the adaptation story in measured adoption data. (gallup.com)
b20 Gen Z's AI Adoption Steady, but Skepticism Climbs Gallup 2026-04-10 Useful corrective to simplistic generational narratives. (news.gallup.com)
b21 Humans in the Loop: Executive Summary MIT Institute for Work and Employment Research / Industrial Performance Center 2026-04-08 Strong evidence that good deployment must be designed around human motivation and identity, not just efficiency. (ipc.mit.edu)
b22 Designing Human-AI Collaboration: A Sufficient-Statistic Approach NBER 2025-06-01 One of the most important formal pieces for operating-model design and effort crowd-out. (nber.org)
b23 Bias in the Loop: How Humans Evaluate AI-Generated Suggestions Harvard Data Science Review / MIT Press 2026-04-30 Relevant to automation complacency, over-trust, and vigilance fatigue. (hdsr.mitpress.mit.edu)
b24 Beyond the Principle: How Organizations Implement Human-in-the-Loop Oversight for Generative AI AMCIS 2026 Proceedings 2026-08-?? Useful for concrete design patterns rather than abstract calls for oversight. (aisel.aisnet.org)
b25 Orchestrating Human-AI Software Delivery: A Retrospective Longitudinal Field Study of Three Software Modernization Programs arXiv 2026-03-25 Direct evidence for the training-versus-orchestration question; it strongly favors orchestration. (arxiv.org)

Tech Industry & Practitioner

ID Title Outlet Date Significance
p1 2024 Accelerate State of DevOps Report DORA 2024 Strong baseline on organizational conditions that predict whether AI will help or harm teams.
p2 State of AI-assisted Software Development 2025 DORA 2025-09-23 Most directly relevant practitioner source for AI adoption outcomes in software organizations.
p3 Balancing AI tensions: Moving from AI adoption to effective SDLC use DORA 2026-03-10 Directly addresses cognitive burden and why apparent productivity can coexist with higher instability.
p4 How customization supports developer engagement DORA 2025-09-23 Useful for the question of training versus orchestration: deployment and interface design materially shape outcomes.
p5 Coding throughput as a measure of productivity Thoughtworks Technology Radar 2026-04-15 Best practitioner source in this set against token-maxing and superficial AI activity metrics.
p6 Complacency with AI-generated code Thoughtworks Technology Radar 2025-11-05 Strong practitioner articulation of vigilance fatigue and degraded review quality.
p7 Thoughtworks Tech Radar 30th Edition: Team AI Thoughtworks 2024-04-03 Early practitioner framing that already centers flow disruption, not just capability upside.
p8 [Macro trends in the tech industry April 2026](https://www.thoughtworks.com/en-us/insights/blog/technology-strategy/macro-trends-tech-industry-april-2026?utm_source=openai) Thoughtworks 2026-04-15
p9 DevEx in Action ACM Queue 2024-01-14 One of the strongest practitioner-methodology sources on human cognitive load and software work design.
p10 For AI Productivity Gains, Let Team Leaders Write the Rules MIT Sloan Management Review 2025-10-15 Direct evidence against purely top-down AI mandates.
p11 Want AI-Driven Productivity? Redesign Work MIT Sloan Management Review 2025-05-01 Good bridge source between AI capability and work redesign.
p12 AI-Generated "Workslop" Is Destroying Productivity Harvard Business Review 2025-09-22 One of the clearest practitioner critiques of usage mandates and output theater.
p13 AI Doesn’t Reduce Work - It Intensifies It Harvard Business Review 2026-02-09 Useful practitioner framing for the hidden labor of supervision and oversight.
p14 Workers Don’t Trust AI. Here’s How Companies Can Change That. Harvard Business Review 2025-11-07 Relevant to worker welfare governance and shadow-AI behavior under forced adoption.
p15 Stack Overflow’s 2024 Developer Survey Shows the Gap Between AI Use and Trust in its Output Continues to Widen Among Coders Stack Overflow 2024-07-24 Large-sample developer sentiment baseline across the start of the date range.
p16 Developers remain willing but reluctant to use AI: The 2025 Developer Survey results are here Stack Overflow 2025-12-29 Confirms that the trust gap persisted as AI use grew.
p17 Agents on a leash: Agentic AI remains mostly single-agent and monitored at work Stack Overflow 2026-05-27 Good current practitioner evidence that organizations have not normalized free-running multi-agent autonomy at work.
p18 The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects From a Survey of Knowledge Workers Microsoft Research 2025 Key source for cognitive offloading and complacency risks.
p19 The Effects of Generative AI on High-Skilled Work: Evidence from Three Field Experiments with Software Developers Microsoft Research 2025-06 One of the strongest causal sources on measured developer productivity benefits.
p20 Dear Diary: A Randomized Controlled Trial of Generative AI Coding Tools in the Workplace Microsoft Research 2025-04 Important source on the social and interpretive side of AI coding-tool adoption.
p21 Data Centers May House AI - But Operators Don’t Trust AI (Yet) IEEE Spectrum 2025 A strong practitioner analog for why humans resist handing over consequential operational control even when AI capability rises.
p22 Exploring model welfare Anthropic 2025-04-24 Central source for the contrast between explicit model welfare and the lack of an equivalent worker-welfare operating function.
p23 Claude Opus 4.6 System Card Anthropic 2026 Concrete evidence that model welfare has become operationalized in frontier-model governance.
p24 Anthropic Economic Index report: Uneven geographic and enterprise AI adoption Anthropic 2025-09-15 Useful against simplistic mandate thinking: even rapid adoption is uneven and context-bound.
p25 Anthropic Economic Index report: Learning curves Anthropic 2026-03 Relevant to differences by role and capability rather than flattening all workers into one adoption story.

We use analytics cookies to understand site usage and improve the service. We do not use marketing cookies.