Letting AI models "think longer" hits a wall, but the math tells you exactly where that wall is
A unified probabilistic model shows that both parallel sampling and sequential rethinking strategies for large reasoning models converge to the same saturation formula, letting you calculate the exact point where more compute stops helping.
+23.3 pp accuracy gain on AIME 2024 for a 1.5B model using parallel scaling with 32 samples, rising from 30.0% to 53.3%
r = 0.803 Pearson correlation between predicted and observed saturation points for the 7B model with parallel scaling on MATH-500
83.6% accuracy achieved by the tiny 1.5B model with 32-sample scaling on MATH-500, beating the unscaled 7B model's 82.0%
Parallel vs. Sequential Scaling: Accuracy on AIME 2024
Source: Figure 5(a) and Table 1, Wang et al. (2025). Accuracy (%) by number of generations per problem (N) for DeepSeek-R1-Distill-Qwen 1.5B and 7B. Vanilla baselines shown as dashed reference lines.
Modern reasoning models like DeepSeek-R1 and OpenAI's o1 already "think harder" internally by generating long chain-of-thought traces before producing an answer. The natural next question: what happens if you stack more compute on top of that? You can either sample many answers in parallel and pick the best one (parallel scaling), or ask the model to rethink its answer round after round (sequential scaling). The authors call this scaling over scaling.
Wang and colleagues set out to formalise what practitioners already suspected: these strategies work, but they stop working. There is a ceiling, and spending tokens past it is pure waste. The paper introduces TTSPM (the Test-Time Scaling Performance Model), a probabilistic framework that characterises precisely where that ceiling sits.
The core insight is elegant. For parallel scaling, each of N independent samples has some probability p of being correct. The chance that at least one is right follows 1 − (1 − p)ᴺ. For sequential rethinking, modelled as an absorbing Markov chain, the algebra yields the identical formula. Two fundamentally different strategies, one mathematical structure.
From that shared structure, the authors derive the marginal gain of adding one more sample or round: it decays exponentially. They define the scaling plateau as the point N* where that marginal gain drops below a threshold ε. The resulting saturation formula is compact: N* = ⌈ln(ε / F_max·p) / ln(1 − p)⌉. It depends on the model's single-attempt success probability and the task's achievable ceiling, both estimable from a small validation run.
Empirically, this bound tracks reality well. On MATH-500, the predicted saturation points correlate with observed ones at r = 0.803 for the 7B model under parallel scaling and r = 0.768 for the 1.5B model. Sequential scaling correlations are somewhat weaker (r = 0.575 for the 7B case), which the authors attribute to the stronger inter-step dependencies that their independence assumption cannot fully capture.
Across every benchmark and model size, parallel scaling outperformed sequential scaling. On AIME 2024 with the 1.5B model and N = 32, parallel scaling reached 53.3% accuracy while sequential scaling plateaued at 30.0%, identical to the no-scaling baseline. The likely culprit is error propagation: sequential rethinking can get stuck in a wrong groove, whereas parallel sampling explores independent solution paths.
Perhaps more striking, the tiny 1.5B model with 32 parallel samples beat the 7B model running on a single attempt on MATH-500 (83.6% vs. 82.0%). Test-time compute can substitute for model scale, at least up to a point. On GPQA, the 1.5B model's Hit@32 under parallel scaling hit 96.5%, compared to the 7B vanilla baseline of 41.9%. The gap between "ever finding a correct answer among 32 tries" and "reliably picking the right one" remains large, though, since accuracy with majority voting only reached 39.4%.
Inference cost is becoming the dominant line item in production AI deployments. Every API call to a reasoning model already generates thousands of internal thinking tokens. Stacking additional parallel samples or rethinking rounds on top multiplies that cost. The TTSPM framework gives practitioners a formula, not just a heuristic, for deciding when to stop spending.
The model has clear limitations. It assumes independence between scaling units, which holds well for parallel sampling but less well for sequential rethinking where each round conditions on the last. The parameters (success probability p, ceiling F_max) must be estimated per model-task combination. And the framework focuses on verifier-free strategies; adding a strong verifier would change the dynamics entirely. Still, the unified mathematical structure underlying two very different scaling approaches suggests a genuine law of diminishing returns, not an artefact of any particular implementation.
Throwing more test-time compute at a reasoning model works, but the returns decay exponentially. Both parallel sampling and sequential rethinking obey the same saturation formula. You can calculate, rather than guess, the point where extra tokens become waste. For most practical budgets, parallel scaling with a modest N (often 8 to 16) captures nearly all available gains, and a small model scaled at inference can match or beat a larger model running cold.
Wang, J., Zhu, B., Leong, C. T., Li, Y., & Li, W. (2025). Scaling over scaling: Exploring test-time scaling plateau in large reasoning models. arXiv preprint arXiv:2505.20522v2. https://arxiv.org/abs/2505.20522