Research Explainer · Song, Han & Goodman (2026)

LLMs ace reasoning benchmarks, but they keep failing in ways that should be embarrassingly easy

The first comprehensive survey of LLM reasoning failures catalogues every known way these models break down, from reversing simple facts to misjudging whether a house fits inside a light bulb, and maps the root causes to a two-axis taxonomy of reasoning type versus failure type.

Key Contribution

This paper introduces a structured two-axis framework for understanding LLM reasoning failures: one axis classifies the type of reasoning (informal/intuitive, formal/logical, and embodied/physical), while the other classifies the type of failure (fundamental architectural limits, application-specific limitations, and robustness issues from minor input variations). By unifying hundreds of fragmented studies under this single taxonomy, the survey reveals that many seemingly unrelated failures share the same root causes, and that current mitigations remain shallow and domain-specific.

What the survey covers

LLMs have racked up record scores on mathematical olympiads, coding competitions, and scientific reasoning benchmarks. The headline numbers look commanding. The trouble starts when you ask these same models to do something a five-year-old manages without thinking: count letters in a word, reverse a known fact, or predict that a person who clearly sees popcorn in a transparent bag will not believe it contains chocolate.

Song, Han, and Goodman surveyed the entire landscape of these reasoning failures, pulling together research that had been published piecemeal across cognitive science, formal logic, mathematics, coding, and robotics. Their goal was not to settle whether LLMs "truly reason" (that debate rages on) but to build a map of where reasoning breaks, why it breaks, and what has been tried to fix it. The result is a two-axis taxonomy. The first axis sorts reasoning into informal (intuitive, social), formal (logic, maths, code), and embodied (physical-world interaction). The second axis sorts failures into fundamental (baked into the architecture), application-specific (showing up in particular domains), and robustness (performance collapses under trivial rephrasing).

The reasoning taxonomy: three failure domains

Informal Reasoning

Failures in cognitive skills (working memory, inhibitory control), cognitive biases (confirmation, anchoring, framing), Theory of Mind, moral reasoning, and multi-agent coordination. GPT-4 still struggles with false-belief tasks trivial for human children.

Formal Reasoning

The reversal curse (trained on "A is B," fails on "B is A"), compositional reasoning collapse (handles sub-problems individually, fails when combined), counting and basic arithmetic errors, and benchmark fragility under semantics-preserving perturbations.

Embodied Reasoning

Models lack physical grounding: they misjudge object sizes, spatial relationships, simple physics laws, and tool-use affordances. Even with visual input, VLMs fail at anomaly detection and 2D spatial tasks that humans solve instantly.

The anatomy of failure

The survey's sharpest insight is that the three failure types cut across all reasoning domains. Fundamental failures trace back to intrinsic architectural constraints. The reversal curse, for instance, stems from the unidirectional training objective of autoregressive Transformers: the model's weights encode "A → B" but create no symmetric path from B back to A. Scaling alone cannot fix this, because the asymmetry is structural, not statistical. Similarly, LLMs suffer from proactive interference in working memory (earlier context disrupts retrieval of newer information) far more severely than humans do, a limitation attributed to the self-attention mechanism's dispersal of focus under complex tasks.

Application-specific limitations cluster in predictable places. Theory of Mind remains brittle: GPT-4 can pass many standard ToM tests, yet minor rephrasing causes drastic performance drops. In embodied settings, models produce physically impossible action plans because they lack grounded representations of affordances and spatial dynamics. In mathematics, LLMs solve individual sub-problems correctly but fail the composed version, revealing that what looks like mathematical competence is often shallow pattern-matching across isolated steps.

Robustness issues are the most pervasive and, in practice, the most dangerous. Reordering multiple-choice options, renaming variables in code, swapping known and unknown quantities in a word problem, or adding irrelevant sentences to a prompt can all cause large, unpredictable shifts in model output. The survey finds that perturbation-based stress-testing has proved transferable across domains, making it a promising unified methodology for detecting hidden vulnerabilities.

Three failure types, one architecture

Fundamental

Intrinsic to the Transformer architecture and next-token prediction objective. The reversal curse, limited working memory, and heuristic-driven arithmetic all fall here. Scaling does not resolve them.

Application-Specific

Failures tied to particular domains: brittle Theory of Mind in social tasks, affordance errors in robotics, inability to generalise across novel math problem structures. Domain-specific mitigations are required.

Robustness

Performance collapses under semantics-preserving changes: option reordering, variable renaming, irrelevant context injection, paraphrasing. The most pervasive and the hardest to defend against systematically.

Root causes and what (doesn't) work

The root causes converge on three recurring themes. First, training data patterns: biases in human language are absorbed wholesale, including cognitive biases like confirmation bias and negativity bias. Second, architectural features: causal masking introduces order-based biases independent of the data, and the autoregressive generation process has no built-in mechanism for detecting and correcting earlier mistakes in a reasoning chain. Third, alignment procedures: RLHF amplifies biases by aligning model behaviour with human raters who are themselves biased, producing surface-level ethical compliance that crumbles under adversarial prompts.

Current mitigations fall into three categories, none of which solves the problem. Data-centric approaches curate training corpora to reduce biased content or augment with reversed/permuted facts. In-processing techniques like adversarial training and Chain-of-Thought prompting encourage explicit reasoning steps. Post-processing methods use prompt engineering or output filtering. All share the same weakness: they are task-specific patches that rarely transfer. Chain-of-Thought helps with compositional reasoning but does nothing for the reversal curse. Physics engines paired with LLMs improve embodied reasoning but cannot fix social norm inconsistencies. The survey argues that connecting behavioural errors to specific internal mechanisms (faulty attention heads, insufficient intermediate representation alignment) is the missing step toward general-purpose fixes.

What the field still needs

The survey identifies four concrete gaps. First, root cause analyses remain incomplete for compositional reasoning breakdowns, higher-order Theory of Mind failures, and multi-agent coordination collapse. Second, the field lacks unified, persistent failure benchmarks that span all failure types and track whether new models truly resolve old problems or just overfit to updated test sets. Third, "failure-injection" principles (adding adversarial sections, multi-level difficulty, cross-domain compositions) should be integrated into general reasoning benchmarks, not confined to dedicated robustness suites. Fourth, dynamic and event-driven benchmarks (partially private test sets, annually refreshed competition problems) would combat the increasingly obvious problem of benchmark contamination.

The authors also note a structural blind spot in the literature: multi-turn, interactive contexts remain underrepresented, despite being far closer to real-world deployment conditions than the single-shot prompts that dominate current evaluation. The persistent coordination breakdowns observed in multi-agent simulations suggest that interactive failures may be qualitatively different from, and more consequential than, the static failures the field currently measures.

Why It Matters

Understanding failure is a prerequisite for building resilient systems. This survey reframes LLM reasoning failures not as scattered anecdotes but as a structured, interconnected problem space. The two-axis taxonomy reveals that fundamental architectural limits, domain-specific blind spots, and robustness vulnerabilities share root causes and demand coordinated solutions. As reasoning-specialised models become more prevalent, the authors argue that sustained attention to failure modes will be essential to ensure future LLMs do not just perform better at reasoning tasks, but "fail better": gracefully, transparently, and recoverably.

Reference

Song, P., Han, P., & Goodman, N. (2026). Large language model reasoning failures. Transactions on Machine Learning Research. https://arxiv.org/abs/2602.06176