Research · Academic & arXiv

Back to sweep

Research sweep · deep · 2024 – 2026

Designing AI Operating Models Around Humans

How humans are adapting to AI between June 2024 and June 2026, weighing measured benefits and harms, and how organizations should design operating models around human cognitive load and behavioural patterns rather than forcing adoption, covering cognitive overload from supervising multiple agents at machine speed (context switching, automation complacency, vigilance fatigue), the poor budget and value outcomes of top-down AI mandates and token-maximizing usage, the gap between model welfare functions (such as Anthropic's) and any equivalent human or worker welfare function, and how much good human outcomes depend on model training versus orchestration and deployment design.

  • GPT-5.5
  • financial
  • frontier
  • academic
  • vc
  • blogs
  • tech

Synthesised 2026-06-15

Narrative

The strongest recent academic work has moved away from the simple question of whether AI helps, and toward a harder one: under what conditions do humans and models produce better outcomes together. METR's RE-Bench and HCAST, both published on arXiv in 2024 and 2025, are central here. RE-Bench shows frontier agents can outperform human experts on short research-engineering tasks, yet humans improve more with longer time budgets. HCAST translates this into a practical threshold language by tying agent success rates to tasks that would take humans one minute, one hour, or more than four hours.

The newer human-factors literature is blunt that oversight is not free. Garousi's 2026 paper on software engineering frames review effort and suggestion overload as direct costs, while Zhu, Wang, and Zhang's 2026 experiment on AI-assisted social science shows that architecture matters more than raw model capability in many settings: an unconstrained multi-agent baseline failed in 72% of runs, but a workflow with deterministic execution and three human gates cut failures to 16%. Work on appropriate reliance points in the same direction. Studies by He et al., Kim et al., and Ashktorab et al. find that explanations can raise reliance on wrong answers, while sources, visible inconsistencies, and structured multi-step workflows improve calibration.

The benefits side is real but uneven. Henseke's 2026 cross-European analysis finds only 12% average workplace adoption in 2024, with large country variation and no detectable task restructuring yet, and shows that employee say in organisational decisions steepens the adoption gradient. In software and knowledge work, surveys by Brachman et al., Giray et al., and Gurgul et al. report broad use and perceived cycle-time gains, but also weak objective measurement and patchy governance. Shen and Tamkin's 2026 experiment is a useful counterweight: AI assistance did not deliver significant average efficiency gains in learning a new programming library, but it did impair conceptual understanding, code reading, and debugging.

Across the harms literature, the pattern is overreliance, narrowed cognition, and misplaced responsibility rather than a single dramatic failure mode. Rathi, Jurafsky, and Zhou show overreliance on overconfident models across five languages. Design studies by Wadinambiarachchi et al. and Fu et al. find that AI support can increase fixation or improve the apparent creativity of outputs without reliably strengthening creative thinking itself. Papers on metacognitive prompts, productive friction, and critical-thinking scales suggest a common conclusion: good human outcomes depend less on pushing more AI usage and more on designing review cadence, prompting structures, incentives, and escalation points that preserve human judgement under load.


Sources

ID Title Outlet Date Significance
a1 RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts arXiv 2024-11 METR's RE-Bench gives one of the clearest recent human versus agent comparisons, showing AI agents can move faster than experts on short research-engineering tasks but lose ground as task duration and supervisory demands increase.
a2 HCAST: Human-Calibrated Autonomy Software Tasks arXiv 2025-03 METR's HCAST ties agent performance to human task-time baselines, which is directly useful for judging when oversight remains realistic and when organisations are asking humans to supervise beyond their effective range.
a3 (Human) Attention Is (Still) All You Need: Human oversight makes AI-assisted social science reliable arXiv 2026-06 This preprint isolates workflow design from model quality and finds that human gates plus deterministic execution cut failure rates from 72% to 16% in AI-assisted research.
a4 Human Oversight and Overload: Two Hidden and Costly Burdens of AI-Assisted Software Engineering arXiv 2026-06 Garousi names oversight labour and suggestion overload as hidden costs of coding assistants, making the burden itself part of the productivity calculation rather than an afterthought.
a5 Human-AI Productivity Paradoxes: Modeling the Interplay of Skill, Effort, and AI Assistance arXiv 2026-05 This paper offers a formal account of why more AI help can lower net productivity once skill development, unreliable outputs, and heterogeneous AI literacy are included.
a6 How AI Impacts Skill Formation arXiv 2026-01 Shen and Tamkin provide experimental evidence that delegation to AI can improve throughput for some users while impairing conceptual understanding, debugging ability, and later independent performance.
a7 Generative AI at Work: From Exposure to Adoption across 35 European Countries arXiv 2026-04 Using a 36,600-worker survey across 35 countries, Henseke shows that adoption depends not just on exposure but on skills, organisational voice, and training, with no detectable task restructuring yet.
a8 Fine-Grained Appropriate Reliance: Human-AI Collaboration with a Multi-Step Transparent Decision Workflow for Complex Task Decomposition arXiv 2025-01 This study shows that transparent multi-step workflows can improve reliance calibration on composite fact-checking tasks, especially when AI advice is misleading.
a9 Fostering Appropriate Reliance on Large Language Models: The Role of Explanations, Sources, and Inconsistencies arXiv 2025-02 Kim, Vaughan, Liao, Lombrozo, and Russakovsky show that explanations can raise reliance on both right and wrong answers, while sources and visible inconsistencies help users discount bad outputs.
a10 Emerging Reliance Behaviors in Human-AI Text Generation: Hallucinations, Data Quality Assessment, and Cognitive Forcing Functions arXiv 2024-09 This paper links hallucination handling and cognitive forcing functions to observable reliance patterns in text-generation work, rather than treating verification as a generic best practice.
a11 Human Misperception of Generative-AI Alignment: A Laboratory Experiment arXiv 2025-02 He, Shorrer, and Xia find that people systematically overestimate how closely GenAI choices match human preferences, which matters for welfare claims and delegated decision-making.
a12 Toward Human-AI Complementarity Across Diverse Tasks arXiv 2026-04 This paper finds only modest complementarity gains across realistic tasks and argues that the real bottleneck is routing hard cases to humans in time for them to matter.
a13 Humans overrely on overconfident language models, across languages arXiv 2025-07 Rathi, Jurafsky, and Zhou show that overconfidence and overreliance persist across five languages, suggesting that calibration failures are not a narrow English-only artefact.
a14 When Thinking Pays Off: Incentive Alignment for Human-AI Collaboration arXiv 2025-11 This behavioural experiment shows that overreliance is partly an incentive design problem, and that collaboration quality changes when organisations reward correct dissent rather than passive acceptance.
a15 De-skilling, Cognitive Offloading, and Misplaced Responsibilities: Potential Ironies of AI-Assisted Design arXiv 2025-03 This design-focused study connects practitioner concerns about de-skilling and cognitive offloading to the older automation literature on function allocation and responsibility drift.
a16 The Effects of Generative AI on Design Fixation and Divergent Thinking arXiv 2024-03 This experiment finds that image-generation support can increase fixation and reduce originality and variety, giving concrete evidence that convenience can narrow thought rather than broaden it.
a17 Creativity in the Age of AI: Evaluating the Impact of Generative AI on Design Outputs and Designers' Creative Thinking arXiv 2024-10 This study finds more creative-seeming outputs with AI support but uneven cognitive effects across users, which complicates simple claims that AI either helps or harms creativity.
a18 Controlling Context: Generative AI at Work in Integrated Circuit Design and Other High-Precision Domains arXiv 2025-06 Moss, Watkins, Persaud, Karunaratne, and Nafus show that in high-precision domains the key issue is not just accuracy but preserving enough context control for human vigilance and review.
a19 Reduced AI Acceptance After the Generative AI Boom: Evidence From a Two-Wave Survey Study arXiv 2025-10 This representative Swiss panel finds declining public acceptance after the ChatGPT era and rising demand for human-only decision-making, a direct warning against mandate-led deployment.
a20 Understanding Critical Thinking in Generative Artificial Intelligence Use: Development, Validation, and Correlates of the Critical Thinking in AI Use Scale arXiv 2025-12 This scale paper gives the field a way to measure verification, motivation, and reflection in AI use, which is necessary if organisations want to manage human outcomes rather than token volume.
a21 Enhancing Critical Thinking in Generative AI Search with Metacognitive Prompts arXiv 2025-05 This user study shows that metacognitive prompts can increase follow-up inquiry and perspective-taking during AI search, pointing to a concrete intervention for reducing passive acceptance.
a22 Promoting Critical Thinking With Domain-Specific Generative AI Provocations arXiv 2026-03 von Davier, Lee, Forlizzi, and Das argue that productive friction and domain-specific provocations can support critical thinking better than frictionless assistant behaviour.
a23 Current and Future Use of Large Language Models for Knowledge Work arXiv 2025-03 These surveys of knowledge workers show that adoption is already broad, but desired future use centres on workflow integration, which shifts the design question from access to operating model.
a24 An Empirical Study of Generative AI Adoption in Software Engineering arXiv 2025-12 This empirical study reports widespread use and perceived gains in software engineering, while also finding thin objective measurement and weak institutional emphasis on training and governance.
a25 The State of Generative AI in Software Development: Insights from Literature and a Developer Survey arXiv 2026-03 This review-plus-survey argues that value is strongest in routine coding and documentation, while planning and requirements work remain harder, shifting attention toward oversight and specification quality.

We use analytics cookies to understand site usage and improve the service. We do not use marketing cookies.