Explainer Collection

AI Capabilities

Litowitz (2026)

The token economy has a finite energy budget, but the real bottleneck is knowing which questions to ask

Litowitz, Polson and Sokolov treat the AI token as a physical quantity with measurable thermodynamic cost, then build a MacKay-style balance sheet showing that projected 2028 US infrastructure could supply 225,000 tokens per person per day, over 1,000× current usage. The binding constraint, they argue, is not compute but the human capacity to formulate questions worth answering.

5 × 10¹⁹: efficiency gap between actual energy cost per token (1.8 J) and the Landauer thermodynamic floor (3.4 × 10⁻²⁰ J)
225,000: tokens per person per day that projected 2028 US AI energy (326 TWh) could support, roughly a novel's worth of text

Zandieh (2025)

A random rotation turns vector quantization into a solved problem, within a factor of 2.7 of perfection

TurboQuant achieves near-optimal distortion for both MSE and inner product metrics across all bit-widths, requires zero preprocessing, and matches full-precision LLM quality at 3.5 bits per channel.

≈2.7×: factor from information-theoretic optimality for MSE distortion, shrinking to 1.45× at 1-bit width
3.5 bits: per channel achieves absolute quality neutrality with full-precision KV cache on LongBench

Wu, Sun, Li, Welleck & Yang (2025)

Smaller models with smarter inference beat bigger models, and it's not even close

A 7-billion-parameter model paired with a novel tree search algorithm consistently outperforms a model five times its size on maths benchmarks, using half the compute. The trick is spending your budget on thinking harder, not on being bigger.

2× less FLOPs: needed for Llemma-7B with REBASE to match Llemma-34B accuracy on MATH500 and GSM8K benchmarks
7× less compute: used by REBASE with a 7B model to surpass sampling-based weighted voting with 256 samples on MATH500

Wang et al. (2025)

Letting AI models "think longer" hits a wall, but the math tells you exactly where that wall is

A unified probabilistic model shows that both parallel sampling and sequential rethinking strategies for large reasoning models converge to the same saturation formula, letting you calculate the exact point where more compute stops helping.

+23.3 pp: accuracy gain on AIME 2024 for a 1.5B model using parallel scaling with 32 samples, rising from 30.0% to 53.3%
r = 0.803: Pearson correlation between predicted and observed saturation points for the 7B model with parallel scaling on MATH-500

Song, Han & Goodman (2026)

LLMs ace reasoning benchmarks, but they keep failing in ways that should be embarrassingly easy

The first comprehensive survey of LLM reasoning failures catalogues every known way these models break down, from reversing simple facts to misjudging whether a house fits inside a light bulb, and maps the root causes to a two-axis taxonomy of reasoning type versus failure type.

Polo, Somerstep, Choshen, Sun & Yurochkin (2025)

LLM benchmarks are correlated for a reason, and exploiting that lets you predict performance without training the model

Sloth introduces skill-based scaling laws that treat benchmark scores as reflections of three latent abilities, predicting multi-benchmark performance across model families from a single small model per family.

12: benchmarks across Open LLM Leaderboard v1 and v2 used to fit and validate the scaling law, covering reasoning, knowledge, and instruction following
2.9 pp: average mean-absolute-error for Sloth (d=2) on Leaderboard v2, the lowest of any method tested in the leave-one-out evaluation

Prucs, Csutora, Antal & Marosi (2025)

Reasoning models hit a compute ceiling, but sparse architectures keep climbing

A benchmark study of 19 open-source LLMs finds that Mixture-of-Experts models consistently dominate the accuracy-per-FLOP frontier, while all architectures eventually reach a task-dependent saturation point where more thinking time stops helping.

97%: of evaluated models burn more compute on wrong answers than right ones, revealing a costly failure mode where models spiral into longer traces when stuck
19 models: tested across five benchmarks spanning grade-school maths (GSM8K) to graduate-level science (GPQA-Diamond), each with full chain-of-thought decoding

Mohsin et al. (2026)

Scaling LLMs hits five hard ceilings, and more parameters won't break through any of them

A 67-page theoretical synthesis proves that hallucination, context compression, reasoning collapse, retrieval fragility, and multimodal misalignment are mathematical inevitabilities, not engineering problems awaiting bigger budgets.

#1: Hallucination is mathematically inevitable - Theorem 1 proves it via Cantor-style diagonalization
#2: Effective context ≪ nominal window: <5% of SlimPajama pairs span the upper half of 2048 tokens

Chi et al. (2024)

LLMs look like causal reasoners, but they're mostly just remembering

When tested on fresh news articles they couldn't have seen during training, four leading language models showed dramatic accuracy drops on cause-and-effect questions, revealing that their apparent causal reasoning is largely a retrieval trick.

<70%: exact-match accuracy for Claude 3 Opus on CausalProbe-H, the hard version of the fresh benchmark with deliberately misleading answer choices
3,461: unique causal Q&A items in the CausalProbe-2024 benchmark, all built from BBC and Guardian articles published after January 2024

Zheng et al. (2024)

Treat LLM calls like a program; then the cache starts doing real work

SGLang pairs a small Python DSL with a runtime that understands prompt structure, shared prefixes, and batching. That combination makes agent, reasoning, long-document, and vision workloads measurably faster, while also cutting the amount of glue code needed to build them.

30%: Agent workload gain
80%: First-token latency cut

Xia, Lu, Zhu et al. (2025)

Most LLM agent evaluation stops at launch; so the same failures keep recurring in production

A multivocal review of 161 sources reveals that academic evaluation overwhelmingly focuses on pre-deployment benchmarks. The authors propose EDDOps, a process model and reference architecture that make evaluation a continuous, governing function across the entire agent lifecycle.

93.3%: Of academic sources focus evaluation only on pre-deployment (125/134 papers reviewed)
88.1%: Of academic sources rely on AI-only evaluators with no documented human review

Pan, Chodnekar, Roy & Wang (2025)

Running your own LLM can pay for itself in months; but only if you pick the right model size

A cost-benefit analysis of 54 deployment scenarios finds that small open-source models break even against commercial APIs in under three months on a $2,000 GPU, while large models can take years to justify their quarter-million-dollar hardware.

0.3 mo: Fastest break-even: small model (EXAONE 32B) vs Claude-4 Opus on a single RTX 5090
54: Deployment scenarios analysed across 9 open-source models and 5 commercial APIs

Koc, Verre, Blank & Morgan (2025)

Your IDE should watch your AI's metrics; not just your code's syntax

A conceptual framework for wiring real-time LLM telemetry (traces, evaluations, prompt versions) directly into the code editor through the Model Context Protocol, turning prompt engineering from guesswork into a data-driven feedback loop.