Research · Academic & arXiv

Back to sweep

Research sweep · deep · 2025 – 2026

Comparative LLM Usage Across Sectors

Comparative real-world usage of LLMs and adjacent AI technologies from June 2025 to June 2026: which models (GPT-5, Claude, Gemini, Llama, Mistral, DeepSeek, Qwen) dominate which sectors, how they are deployed (hosted API, Bedrock/Azure, self-hosted vLLM/Ollama, RAG, agents, fine-tuning), what workloads they serve, and how organisations measure, budget, and publicly report token cost and actual spend.

  • Claude Opus 4.8
  • financial
  • frontier
  • academic
  • vc
  • blogs
  • tech

Synthesised 2026-06-20

Narrative

The academic literature on real-world LLM deployment has grown rapidly since mid-2025, with a cluster of empirical papers providing direct evidence on deployment shapes, cost structures, and workload patterns. A May 2025 arXiv preprint (Hou et al., arXiv:2505.02502) conducted an internet-wide scan identifying over 320,000 publicly exposed LLM services across 15 frameworks, revealing how quickly self-hosted inference has proliferated outside controlled enterprise environments, with Ollama, vLLM, and LM Studio among the most prevalent stacks. A concurrent August 2025 arXiv paper (Pan et al., arXiv:2509.18101) offered formal cost-benefit analysis for on-premise versus cloud deployment of Qwen, Llama, and Mistral variants, finding that break-even periods range from a few months for small models to roughly five years for large ones, with viability conditional on processing at least 50 million tokens per month or operating under strict data residency requirements. An SSRN working paper (Zhang, Shi, and Tang, June 2025, abstract 5296479) approached the same decision from an economic theory perspective, analysing how data privacy concerns and user heterogeneity shape equilibrium choices between cloud and on-premise deployment, and showing that open-source entrants may be structurally disadvantaged in competitive markets once incumbents offer localisation options.

On sector adoption, a January 2025 arXiv position paper (Xu et al., arXiv:2501.09906) analysed 201 foundation models and 6,198 arXiv papers, finding that closed LLMs led by GPT-4 dominate high-performance healthcare applications such as medical imaging and multimodal diagnostics, while open models including LLaMA gain traction for cost-sensitive tasks like mental health dialogue and patient communication. A broader August 2025 arXiv survey (arXiv:2508.19667) found specialised LLMs deployed across healthcare, finance, law, education, and manufacturing. The arXiv legal-agents survey (arXiv:2601.06216) and the critical-domains survey (Chen et al., arXiv:2405.01769) document active fine-tuning and RAG deployments in law and finance, but note persistent accuracy limitations that constrain autonomous deployment. The academic evidence consistently separates sectors by compliance sensitivity: healthcare and finance prefer on-premise or private-cloud closed models for regulated workloads, while research and content functions absorb open-weight alternatives.

METR's empirical evaluations supply the most controlled evidence on real-world LLM productivity and autonomous capability. Their July 2025 randomised controlled trial (Becker et al., arXiv:2507.09089) found that experienced open-source developers using early-2025 AI tooling (Cursor Pro with Claude 3.5/3.7 Sonnet) completed tasks 19 per cent more slowly than without, despite self-reporting a perceived 20 per cent speedup. METR separately evaluated GPT-5 pre-deployment (August 2025) and determined its 50 per cent time-horizon on agentic software engineering tasks to be approximately two hours and 15 minutes, placing it well above earlier frontier models but below thresholds for catastrophic autonomous risk. Their HCAST benchmark infrastructure and RE-Bench suite for ML research engineering tasks now underpin several pre-deployment assessments of Claude, GPT, and DeepSeek models, providing consistent human-calibrated comparison baselines. The SWE-Bench and SWE-Lancer lineage (arXiv:2502.12115, arXiv:2509.16941) independently demonstrates that frontier models reach around 72 to 75 per cent solve rates on verified repository-level tasks, but that saturation is driving researchers toward harder benchmarks with real economic stakes.

On inference cost, Epoch AI's March 2025 analysis (Cottier et al.) documented price declines of 9x to 900x per year depending on the performance milestone, with the steepest drops occurring in the 12 months before publication. A December 2025 arXiv preprint (arXiv:2511.23455) formalised the price-performance relationship econometrically, introducing a tiered super-Moore's law hypothesis in which economy and mid-tier segments see prices halve roughly every 1.1 to 1.55 years. The arXiv survey on LLM routing (arXiv:2603.04445) and the ICLR 2025 RouteLLM paper document a growing research programme around cascading and routing architectures that direct simpler queries to cheaper models while routing complex ones to frontier systems, with routing overhead of under 0.4 per cent of GPT-4 generation cost. An April 2025 arXiv working paper (arXiv:2604.00626) on on-policy distillation found adoption across Qwen3, DeepSeek, Gemma, and other production pipelines as a core ingredient for compressing reasoning capability into smaller, cheaper models, reducing the economic case for always-on frontier API access. Despite these advances, independent measurement of actual enterprise spend remains sparse in the academic literature: most numeric claims derive from vendor surveys, analyst estimates, or industry blogs rather than audited procurement data.


Sources

ID Title Outlet Date Significance
a1 Unveiling the Landscape of LLM Deployment in the Wild: An Empirical Study arXiv 2025-05 Internet-wide scan of 320,102 public-facing LLM services across 15 frameworks, providing empirical data on the prevalence of self-hosted inference stacks in real-world deployment.
a2 A Cost-Benefit Analysis of On-Premise Large Language Model Deployment: Breaking Even with Commercial LLM Services arXiv 2025-08 Formal cost-benefit framework comparing on-premise (Qwen, Llama, Mistral) to cloud subscription costs, quantifying break-even points by model size and usage volume.
a3 Cloud or On-Premise? A Strategic View of Large Language Model Deployment SSRN 2025-06 Economic theory analysis of cloud versus on-premise LLM deployment decisions, modelling the role of data privacy, user heterogeneity, and competitive dynamics between closed and open-source providers.
a4 Position: Open and Closed Large Language Models in Healthcare arXiv 2025-01 Analysis of 201 foundation models and 6,198 arXiv papers showing closed LLMs dominate high-performance healthcare applications while open models gain traction for adaptable, cost-sensitive tasks.
a5 Survey of Specialized Large Language Models arXiv 2025-08 Comprehensive survey documenting sector-wide adoption of specialised LLMs across healthcare, finance, law, education, and manufacturing between 2022 and 2025.
a6 A Survey on Large Language Models for Critical Societal Domains: Finance, Healthcare, and Law arXiv 2024-05 Systematic review of LLM applications across finance, healthcare, and law, documenting persistent accuracy limitations that constrain fully autonomous deployment in these sectors.
a7 LLM Agents in Law: Taxonomy, Applications, and Challenges arXiv 2026-01 Survey of LLM agent deployments in legal practice, covering multi-agent verification systems, compliance workflows, and the gap between pilot and production deployments in law firms.
a8 Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity METR 2025-07 Randomised controlled trial (N=246 tasks, 16 developers) finding a 19 per cent slowdown when using Cursor Pro with Claude 3.5/3.7 Sonnet, providing the most rigorous independent evidence on LLM productivity in software engineering.
a9 Details about METR's Evaluation of OpenAI GPT-5 METR 2025-08 Pre-deployment capability assessment of GPT-5, establishing a 50 per cent time-horizon of roughly 2 hours 15 minutes on agentic software tasks and finding early evidence of evaluation-awareness in model reasoning.
a10 Details about METR's Evaluation of OpenAI GPT-5.1-Codex-Max METR 2025-11 Longitudinal extension of METR's time-horizon evaluations, noting that observed AI agent productivity uplift lags benchmark capability scores, directly relevant to real-world deployment outcomes.
a11 HCAST: Human-Calibrated Autonomy Software Tasks METR 2025 Introduces METR's primary benchmark suite for measuring autonomous AI capability on software engineering tasks, with 563 human baseline attempts providing calibrated comparison across GPT, Claude, and DeepSeek models.
a12 Evaluating Frontier AI R&D Capabilities of Language Model Agents Against Human Experts (RE-Bench) METR 2024-11 Introduces RE-Bench, the foundational ML research engineering benchmark used in METR's ongoing pre-deployment evaluations of frontier models including Claude and o1-preview.
a13 SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? arXiv 2025-02 Evaluates frontier LLMs on 1,488 real Upwork software engineering jobs worth $1 million USD, providing the most economically grounded benchmark for LLM code generation capability.
a14 SWE-Bench Pro: Can AI Agents Solve Real-World Software Engineering Tasks? arXiv 2025-09 Contamination-resistant, human-verified extension of SWE-bench designed to track frontier model progress on authentic software engineering tasks as the original benchmark approaches saturation.
a15 LLM inference prices have fallen rapidly but unequally across tasks Epoch AI 2025-03 Empirical study by Cottier et al. documenting price declines of 9x to 900x per year across six benchmarks, providing the most systematic independent evidence on LLM inference cost trends.
a16 The Price of Progress: Price Performance and the Future of AI arXiv 2025-11 Econometric formalisation of LLM token price-performance trends, introducing the tiered super-Moore's law hypothesis with empirically estimated price half-lives by market segment.
a17 Tiered Super-Moore's Law: Price Evolution, Production Frontiers, and Market Competition in Large Language Model Inference Services arXiv 2026-03 First comprehensive empirical study of LLM token pricing market structure, documenting that price declines outpace Moore's Law in economy and mid-tier segments but not in the frontier tier.
a18 Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG arXiv 2025-01 Comprehensive taxonomy of agentic RAG architectures covering healthcare, finance, education, and enterprise document processing, with practical analysis of production design trade-offs.
a19 Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs arXiv 2025-06 July 2025 survey unifying RAG and reasoning research streams, documenting the emergence of agentic deep research as a distinct production workload category distinct from naive RAG.
a20 An Empirical Study of Agent Developer Practices in AI Agent Frameworks arXiv 2025-12 Analysis of 1,575 GitHub projects on agent development, identifying LangGraph's rapid adoption in production deployments despite lower star counts than more popular frameworks.
a21 Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey arXiv 2026-03 Survey of routing and cascading architectures for cost-efficient multi-model deployment, covering FrugalGPT and successor systems that can reduce inference costs by up to 98 per cent while maintaining accuracy.
a22 RouteLLM: Learning to Route LLMs with Preference Data ICLR 2025 2025 ICLR 2025 paper demonstrating that preference-data-trained routers can achieve over 50x cost savings by directing simpler queries to smaller models with minimal overhead.
a23 A Survey of On-Policy Distillation for Large Language Models arXiv 2026-04 Documents adoption of on-policy distillation as a core training ingredient across Qwen3, DeepSeek, and Gemma production pipelines, explaining how smaller models are closing the cost-performance gap.
a24 Why Does the LLM Stop Computing: An Empirical Study of User-Reported Failures in Open-Source LLMs arXiv 2026-01 Empirical study of reliability failures in user-managed open-source LLM deployments of DeepSeek, Llama, and Qwen, filling a gap between cloud-API and training-level failure research.
a25 Evaluation and Benchmarking of LLM Agents: A Survey arXiv 2025-07 Comprehensive survey of agent evaluation methodology covering task suites from SWE-bench through METR's HCAST, providing a map of how agent capability is measured across production-relevant workloads.

We use analytics cookies to understand site usage and improve the service. We do not use marketing cookies.