Designing AI Operating Models Around Humans

How humans are adapting to AI between June 2024 and June 2026, weighing measured benefits and harms, and how organizations should design operating models around human cognitive load and behavioural patterns rather than forcing adoption, covering cognitive overload from supervising multiple agents at machine speed (context switching, automation complacency, vigilance fatigue), the poor budget and value outcomes of top-down AI mandates and token-maximizing usage, the gap between model welfare functions (such as Anthropic's) and any equivalent human or worker welfare function, and how much good human outcomes depend on model training versus orchestration and deployment design.

GPT-5.5
financial
frontier
academic
vc
blogs
tech

Synthesised 2026-06-16

Narrative

The strongest recent academic work has moved away from the simple question of whether AI helps, and toward a harder one: under what conditions do humans and models produce better outcomes together. METR's RE-Bench and HCAST, both published on arXiv in 2024 and 2025, are central here. RE-Bench shows frontier agents can outperform human experts on short research-engineering tasks, yet humans improve more with longer time budgets. HCAST translates this into a practical threshold language by tying agent success rates to tasks that would take humans one minute, one hour, or more than four hours.

The newer human-factors literature is blunt that oversight is not free. Garousi's 2026 paper on software engineering frames review effort and suggestion overload as direct costs, while Zhu, Wang, and Zhang's 2026 experiment on AI-assisted social science shows that architecture matters more than raw model capability in many settings: an unconstrained multi-agent baseline failed in 72% of runs, but a workflow with deterministic execution and three human gates cut failures to 16%. Work on appropriate reliance points in the same direction. Studies by He et al., Kim et al., and Ashktorab et al. find that explanations can raise reliance on wrong answers, while sources, visible inconsistencies, and structured multi-step workflows improve calibration.

The benefits side is real but uneven. Henseke's 2026 cross-European analysis finds only 12% average workplace adoption in 2024, with large country variation and no detectable task restructuring yet, and shows that employee say in organisational decisions steepens the adoption gradient. In software and knowledge work, surveys by Brachman et al., Giray et al., and Gurgul et al. report broad use and perceived cycle-time gains, but also weak objective measurement and patchy governance. Shen and Tamkin's 2026 experiment is a useful counterweight: AI assistance did not deliver significant average efficiency gains in learning a new programming library, but it did impair conceptual understanding, code reading, and debugging.

Across the harms literature, the pattern is overreliance, narrowed cognition, and misplaced responsibility rather than a single dramatic failure mode. Rathi, Jurafsky, and Zhou show overreliance on overconfident models across five languages. Design studies by Wadinambiarachchi et al. and Fu et al. find that AI support can increase fixation or improve the apparent creativity of outputs without reliably strengthening creative thinking itself. Papers on metacognitive prompts, productive friction, and critical-thinking scales suggest a common conclusion: good human outcomes depend less on pushing more AI usage and more on designing review cadence, prompting structures, incentives, and escalation points that preserve human judgement under load.

Sources

ID	Title	Outlet	Date	Significance
a1	RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts	arXiv	2024-11	METR's RE-Bench gives one of the clearest recent human versus agent comparisons, showing AI agents can move faster than experts on short research-engineering tasks but lose ground as task duration and supervisory demands increase.
a2	HCAST: Human-Calibrated Autonomy Software Tasks	arXiv	2025-03	METR's HCAST ties agent performance to human task-time baselines, which is directly useful for judging when oversight remains realistic and when organisations are asking humans to supervise beyond their effective range.
a3	(Human) Attention Is (Still) All You Need: Human oversight makes AI-assisted social science reliable	arXiv	2026-06	This preprint isolates workflow design from model quality and finds that human gates plus deterministic execution cut failure rates from 72% to 16% in AI-assisted research.
a4	Human Oversight and Overload: Two Hidden and Costly Burdens of AI-Assisted Software Engineering	arXiv	2026-06	Garousi names oversight labour and suggestion overload as hidden costs of coding assistants, making the burden itself part of the productivity calculation rather than an afterthought.
a5	Human-AI Productivity Paradoxes: Modeling the Interplay of Skill, Effort, and AI Assistance	arXiv	2026-05	This paper offers a formal account of why more AI help can lower net productivity once skill development, unreliable outputs, and heterogeneous AI literacy are included.
a6	How AI Impacts Skill Formation	arXiv	2026-01	Shen and Tamkin provide experimental evidence that delegation to AI can improve throughput for some users while impairing conceptual understanding, debugging ability, and later independent performance.
a7	Generative AI at Work: From Exposure to Adoption across 35 European Countries	arXiv	2026-04	Using a 36,600-worker survey across 35 countries, Henseke shows that adoption depends not just on exposure but on skills, organisational voice, and training, with no detectable task restructuring yet.
a8	Fine-Grained Appropriate Reliance: Human-AI Collaboration with a Multi-Step Transparent Decision Workflow for Complex Task Decomposition	arXiv	2025-01	This study shows that transparent multi-step workflows can improve reliance calibration on composite fact-checking tasks, especially when AI advice is misleading.
a9	Fostering Appropriate Reliance on Large Language Models: The Role of Explanations, Sources, and Inconsistencies	arXiv	2025-02	Kim, Vaughan, Liao, Lombrozo, and Russakovsky show that explanations can raise reliance on both right and wrong answers, while sources and visible inconsistencies help users discount bad outputs.
a10	Emerging Reliance Behaviors in Human-AI Text Generation: Hallucinations, Data Quality Assessment, and Cognitive Forcing Functions	arXiv	2024-09	This paper links hallucination handling and cognitive forcing functions to observable reliance patterns in text-generation work, rather than treating verification as a generic best practice.
a11	Human Misperception of Generative-AI Alignment: A Laboratory Experiment	arXiv	2025-02	He, Shorrer, and Xia find that people systematically overestimate how closely GenAI choices match human preferences, which matters for welfare claims and delegated decision-making.
a12	Toward Human-AI Complementarity Across Diverse Tasks	arXiv	2026-04	This paper finds only modest complementarity gains across realistic tasks and argues that the real bottleneck is routing hard cases to humans in time for them to matter.
a13	Humans overrely on overconfident language models, across languages	arXiv	2025-07	Rathi, Jurafsky, and Zhou show that overconfidence and overreliance persist across five languages, suggesting that calibration failures are not a narrow English-only artefact.
a14	When Thinking Pays Off: Incentive Alignment for Human-AI Collaboration	arXiv	2025-11	This behavioural experiment shows that overreliance is partly an incentive design problem, and that collaboration quality changes when organisations reward correct dissent rather than passive acceptance.
a15	De-skilling, Cognitive Offloading, and Misplaced Responsibilities: Potential Ironies of AI-Assisted Design	arXiv	2025-03	This design-focused study connects practitioner concerns about de-skilling and cognitive offloading to the older automation literature on function allocation and responsibility drift.
a16	The Effects of Generative AI on Design Fixation and Divergent Thinking	arXiv	2024-03	This experiment finds that image-generation support can increase fixation and reduce originality and variety, giving concrete evidence that convenience can narrow thought rather than broaden it.
a17	Creativity in the Age of AI: Evaluating the Impact of Generative AI on Design Outputs and Designers' Creative Thinking	arXiv	2024-10	This study finds more creative-seeming outputs with AI support but uneven cognitive effects across users, which complicates simple claims that AI either helps or harms creativity.
a18	Controlling Context: Generative AI at Work in Integrated Circuit Design and Other High-Precision Domains	arXiv	2025-06	Moss, Watkins, Persaud, Karunaratne, and Nafus show that in high-precision domains the key issue is not just accuracy but preserving enough context control for human vigilance and review.
a19	Reduced AI Acceptance After the Generative AI Boom: Evidence From a Two-Wave Survey Study	arXiv	2025-10	This representative Swiss panel finds declining public acceptance after the ChatGPT era and rising demand for human-only decision-making, a direct warning against mandate-led deployment.
a20	Understanding Critical Thinking in Generative Artificial Intelligence Use: Development, Validation, and Correlates of the Critical Thinking in AI Use Scale	arXiv	2025-12	This scale paper gives the field a way to measure verification, motivation, and reflection in AI use, which is necessary if organisations want to manage human outcomes rather than token volume.
a21	Enhancing Critical Thinking in Generative AI Search with Metacognitive Prompts	arXiv	2025-05	This user study shows that metacognitive prompts can increase follow-up inquiry and perspective-taking during AI search, pointing to a concrete intervention for reducing passive acceptance.
a22	Promoting Critical Thinking With Domain-Specific Generative AI Provocations	arXiv	2026-03	von Davier, Lee, Forlizzi, and Das argue that productive friction and domain-specific provocations can support critical thinking better than frictionless assistant behaviour.
a23	Current and Future Use of Large Language Models for Knowledge Work	arXiv	2025-03	These surveys of knowledge workers show that adoption is already broad, but desired future use centres on workflow integration, which shifts the design question from access to operating model.
a24	An Empirical Study of Generative AI Adoption in Software Engineering	arXiv	2025-12	This empirical study reports widespread use and perceived gains in software engineering, while also finding thin objective measurement and weak institutional emphasis on training and governance.
a25	The State of Generative AI in Software Development: Insights from Literature and a Developer Survey	arXiv	2026-03	This review-plus-survey argues that value is strongest in routine coding and documentation, while planning and requirements work remain harder, shifting attention toward oversight and specification quality.