Research Explainer · Prucs, Csutora, Antal & Marosi (2025)

Reasoning models hit a compute ceiling, but sparse architectures keep climbing

A benchmark study of 19 open-source LLMs finds that Mixture-of-Experts models consistently dominate the accuracy-per-FLOP frontier, while all architectures eventually reach a task-dependent saturation point where more thinking time stops helping.

97% of evaluated models burn more compute on wrong answers than right ones, revealing a costly failure mode where models spiral into longer traces when stuck

19 models tested across five benchmarks spanning grade-school maths (GSM8K) to graduate-level science (GPQA-Diamond), each with full chain-of-thought decoding

<1016 FLOPs is the saturation threshold on easier benchmarks like GSM8K and MATH500, beyond which additional inference compute yields negligible accuracy gains

Accuracy vs. Inference Compute: Dense Models vs. Mixture-of-Experts

Source: Prucs et al. (2025), Figure 1. Average accuracy across five benchmarks plotted against average KV-aware FLOPs per query (log scale). MoE = Mixture of Experts. Bubble size represents approximate total parameter count.

The researchers evaluated 19 open-source LLMs, from the 1-billion-parameter Gemma-3-1B up to 30-billion-parameter Qwen3 variants, on five reasoning benchmarks of escalating difficulty. GSM8K tests grade-school arithmetic. MATH500 and two competition-level sets (AIME 2025, HMMT February 2025) push into Olympiad-style territory. GPQA-Diamond asks graduate-level science questions designed to stump Google searches. Every model ran with full chain-of-thought decoding, a single pass per problem, and no majority-vote tricks.

The twist is in how they scored each model. Rather than reporting accuracy alone, the team estimated the floating-point operations (FLOPs) consumed by every reasoning trace, using a component-aware formula that accounts for grouped-query attention, gated feed-forward layers, and MoE routing. This let them plot each model's accuracy against its actual computational cost and identify the Pareto frontier, the set of models where you cannot improve accuracy without spending more compute.

The headline finding is architectural: Mixture-of-Experts models sit on the Pareto frontier far more often than dense models do. An MoE like Qwen3-30B-A3B activates only 3 billion of its 30 billion parameters per token, which lets it generate much longer reasoning chains within the same FLOP budget. On GPQA-Diamond, AIME 2025, and HMMT February 2025, MoE models consistently dominated the frontier, achieving accuracy that dense models could match only by burning orders of magnitude more compute.

This matters because chain-of-thought reasoning is token-hungry. Models that "think" by generating thousands of intermediate tokens pay a linear cost per token through the feed-forward layers and a quadratic cost through attention. By activating a sparse subset of experts, MoE architectures keep the linear cost low, which effectively decouples the model's total parameter knowledge from its per-token inference price. Sparsity, in other words, is the lever that makes extended reasoning affordable.

Every benchmark shows the same shape: a steep initial climb where more compute buys meaningful accuracy gains, followed by a flattening "knee" after which additional FLOPs return almost nothing. Where that knee sits depends on how hard the task is. On GSM8K, models plateau below 1016 FLOPs because even small models can largely solve grade-school maths. On AIME 2025 and GPQA-Diamond, the knee shifts to much higher compute regimes, yet even there the curve eventually flattens. No amount of thinking time can overcome a model's fundamental "deductive horizon."

Smaller dense models (1B to 8B parameters) exploit this dynamic in an interesting way. By generating substantially longer reasoning traces, they trade parameter capacity for inference-time compute and occasionally reach accuracy parity with 30B+ baselines. The catch is that this strategy has hard limits. Once a model's knowledge runs out, thinking longer produces only longer failures.

Perhaps the most practically useful finding: 97% of evaluated models spend more compute on incorrect answers than on correct ones. When a model lacks the capacity to solve a problem, it does not stop early. It loops, backtracks, hallucinates alternative paths, and frequently exhausts its context window in a futile attempt to converge. The paper calls this "trace length asymmetry," and it means that the hardest-to-solve problems are also the most expensive to fail at.

This creates a vicious efficiency trap for deployment. Problems outside a model's capability boundary are precisely the ones that consume the most resources, with no accuracy payoff. The authors frame this as an open algorithmic challenge: building early-stopping mechanisms that can distinguish between productive reasoning (which genuinely needs time) and unproductive spiralling (which should be terminated) remains unsolved.

Key Takeaway

When choosing a reasoning model for production, accuracy alone is the wrong metric. MoE architectures offer the best compute-to-accuracy conversion, smaller models can punch above their weight by thinking longer, and every architecture hits a task-dependent ceiling where more compute buys nothing. The most expensive inference runs are the ones that fail, not the ones that succeed.

Reference

Prucs, Á., Csutora, M., Antal, M., & Marosi, M. (2025). Compute-accuracy Pareto frontiers for open-source reasoning large language models. arXiv preprint arXiv:2512.24776. https://arxiv.org/abs/2512.24776