Research Explainer · Mohsin et al. (2026)

Scaling LLMs hits five hard ceilings, and more parameters won't break through any of them

A 67-page theoretical synthesis proves that hallucination, context compression, reasoning collapse, retrieval fragility, and multimodal misalignment are mathematical inevitabilities, not engineering problems awaiting bigger budgets.

Key Contribution

This paper provides the first unified, proof-based framework connecting five persistent LLM failure modes to formal impossibility results from computability theory, information theory, and statistical learning. The core argument: diagonalization guarantees every model must fail on some inputs, finite capacity forces compression errors on rare facts, and softmax crowding means effective context scales sub-linearly with nominal window size. These are not bugs to be fixed. They are ceilings to be managed.

Hallucination

Diagonalization proves every enumerable model class must produce incorrect outputs on some inputs. Undecidable problems (halting-style queries) force infinite failure sets. No amount of training data fixes this.

Context Compression

Positional undertraining, sinusoidal/RoPE attenuation, and softmax crowding jointly compress effective context far below nominal window size. A 128K-token model may effectively use only 64K.

Reasoning Degradation

Likelihood training rewards pattern completion over logical entailment. Chain-of-thought traces are often "disposable mediators" with near-zero causal effect on final answers.

Retrieval Fragility

Token budgets force a relevance-coverage trade-off that cannot be optimised simultaneously. Positional bias, semantic drift, and adversarial poisoning (5 documents can achieve ~90% attack success) compound the problem.

Multimodal Misalignment

Language channels dominate gradients (157× more attention to text than visual tokens in VideoLLaMA-7B). Visual representations inherit caption statistics, not perceptual structure.

The paper's most striking contribution is Theorem 1, which applies Cantor-style diagonalization to the set of all computably enumerable LLMs. The construction is elegant: for any list of models, you can always build a computable ground-truth function that disagrees with each model on at least one input. Theorem 2 extends this to show each model fails on infinitely many inputs. This holds regardless of architecture (transformers, RNNs, state-space models), training procedure, or prompt engineering. Even using a second LLM to detect and correct hallucinations cannot eliminate them, because the correcting model is itself subject to the same theorem.

Beyond this computability argument, Theorem 3 tackles undecidability head-on. Any LLM attempting to approximate the Halting Problem must produce infinitely many wrong answers. The proof is a clean contradiction: if the failure set were finite, you could build a Turing machine that decides the Halting Problem, which is impossible. The practical relevance is higher than it sounds. Users routinely ask LLMs questions that reduce to undecidable problems: "Will this loop terminate?" or "Does this axiom set contain a contradiction?" On these, hallucination is not a risk. It is a certainty.

The statistical layer adds a third ceiling. Theorem 4 shows that learning arbitrary, structureless facts (birthdates, numerical constants, rare entity attributes) requires sample complexity that scales linearly with the number of facts. For millions of rare entities, this exceeds any feasible training corpus. Web scrapes contain 2-3% demonstrably false claims, and the next-token prediction objective treats all training text equally, optimising for likelihood rather than veracity. Frequently repeated misinformation dominates the learned distribution.

The paper identifies three mechanisms that compress effective context well below nominal capacity. First, training data is heavily left-skewed: in the SlimPajama corpus, fewer than 20% of training pairs involve distances in the upper half of a 2048-token window, and fewer than 5% involve the extreme end. Lemma 2 formalises this as positional undertraining, proving that attention weights for distant positions remain near their random initialisation because gradient updates scale with position frequency. The model simply never learns to use the far end of its window.

Second, positional encodings saturate. Lemma 3 shows that for sinusoidal encodings, the normalised dot product between two position vectors decays as 2/(Ω·Δ), where Δ is the token separation and Ω is the frequency bandwidth. At large separations, positions become nearly orthogonal. RoPE faces the same issue through phase misalignment. Extending context length without rescaling the base frequency produces steep perplexity increases.

Third, softmax creates a crowding problem. Lemma 4 proves that maintaining constant attention on one relevant token among N candidates requires the relevance score to grow as ln(N). A 128K context window needs dramatically sharper scoring than a 4K window to achieve the same retrieval precision. Without extremely fine-grained query-key alignment, attention diffuses across distractors. Llama 3.1, trained at 128K tokens, effectively leverages only about 64K.

The paper frames LLM reasoning failures through a causal lens. Standard training maximises P(Y|X) by marginalising over chain-of-thought traces Z, which means a fluent but non-causal trace can persist as long as the final answer likelihood stays high. Empirical results cited in the paper show that the indirect effect of Z on Y is approximately zero on many tasks: the chain-of-thought is a "disposable mediator" that the model generates for stylistic reasons, not because it causally contributes to the answer.

This creates a concrete problem for reasoning models like OpenAI's o1 or DeepSeek-R1. The paper introduces a reasoning efficiency metric, η = E[Q/C], measuring quality per unit of compute. Recent reasoning models can "overthink," producing long chains with redundant steps that raise compute cost without commensurate gains in quality. One study cited found that prompting models to "think step-by-step" slows inference by 35-600% while yielding little or no accuracy benefit for stronger models.

The proposed unified objective augments likelihood with verification and cost regularisation, ensuring each intermediate step is causally meaningful. Practical instantiations include solver-based methods that translate problems to symbolic logic, prompt-based methods like Tree-of-Thoughts, and fine-tuning on logic-augmented corpora. The paper frames consistency enforcement as analogous to parity checks in error-correcting codes, iteratively flipping beliefs until all logical constraints are satisfied.

Retrieval-augmented generation was supposed to fix hallucination by grounding models in external knowledge. The paper shows it inherits all five limitations instead. Token budgets force a relevance-coverage dilemma: precision-oriented retrievers omit multi-hop evidence, while recall-oriented retrievers dilute signal-to-noise. The "lost-in-the-middle" effect means tokens near the start and end of the prompt receive disproportionate attention. Relocating gold passages to mid-context measurably reduces answer recall. And adversarial poisoning is strikingly efficient: PoisonedRAG achieves approximately 90% attack success rates by inserting just five poisoned documents per target query into million-scale knowledge bases.

Multimodal models fare no better. The paper documents architectural colonization, where linguistic representations systematically dominate non-linguistic modalities. In VideoLLaMA-7B, output tokens attend to text tokens 157 times more than to visual tokens on a per-token basis. CLIP-based vision encoders learn representations grounded in caption co-occurrence statistics rather than perceptual properties: the embedding for "dog" approximates the linguistic descriptions most frequently associated with dogs in internet text, not the visual entity itself. Proposition 5 proves that perceptual properties not describable in captions remain unrepresented in the model, upper-bounded by the mutual information between captions and those properties.

Multimodal scaling laws are "fractured." The interaction term between modalities can produce anti-scaling: increased data in one modality can paradoxically worsen overall performance if modality imbalance worsens. Alignment noise compounds non-linearly with dataset size, causing signal-to-noise ratios to decay roughly as log(N)⁻¹ at billion-scale datasets. Adding modalities expands the space of failure modes while preserving the theoretical ceilings.

The Bottom Line

The question is not how to make LLMs infallible. It is how to make their fallibility quantifiable, predictable, and aligned with task goals. The paper proposes a paradigm shift: calibrated abstention over confident fabrication, task-aware decoding that modulates entropy when factual accuracy is critical, confidence-aware benchmarks that reward honest uncertainty over guessing, and retrieval used as bounded oracle access under finite token budgets. The future of reliable AI lies not in unbounded scaling, but in designing systems that fail gracefully, predictably, and transparently.

Reference

Mohsin, M. A., Umer, M., Bilal, A., Memon, Z., Qadir, M. I., Bhattacharya, S., Rizwan, H., Gorle, A. R., Kazmi, M. Z., Amir, N., Subhan, A., Rafique, M. U., He, Z., Mehta, P., Han, J., Jamshed, M. A., Hougen, D., & Cioffi, J. M. (2026). On the fundamental limits of LLMs at scale. Under review as submission to TMLR. arXiv:2511.12869v2