Research Explainer · Polo, Somerstep, Choshen, Sun & Yurochkin (2025)

LLM benchmarks are correlated for a reason, and exploiting that lets you predict performance without training the model

Sloth introduces skill-based scaling laws that treat benchmark scores as reflections of three latent abilities, predicting multi-benchmark performance across model families from a single small model per family.

12 benchmarks across Open LLM Leaderboard v1 and v2 used to fit and validate the scaling law, covering reasoning, knowledge, and instruction following

2.9 pp average mean-absolute-error for Sloth (d=2) on Leaderboard v2, the lowest of any method tested in the leave-one-out evaluation

30 model families in the dataset (53 counting base and instruct variants separately), the most comprehensive benchmark-scaling dataset assembled to date

Average MAE across benchmarks: Sloth vs. baselines (Open LLM Leaderboard v2)

Source: Figure 1, Polo et al. (2025). MAE in percentage points, averaged across all test families. Lower is better. Sloth (d=2) denotes the full model with trainable link function and two latent skills.

The Problem

Scaling laws are supposed to tell you what happens when you make a model bigger. The trouble is, they usually work within a single model family. Train a LLaMA from 1B to 70B and you can plot a nice curve. Try to use that curve for Qwen or Yi and the predictions fall apart, because each family has its own training data, architecture quirks, and post-training recipe. The alternative, fitting a separate law per family, demands training models of several sizes for every new family you care about. That gets expensive fast.

Existing approaches sit at two extremes. Owen (2024) fits a single law that ignores family identity entirely, gaining generality but losing accuracy. Ruan et al. (2024) fit family-specific slopes and intercepts, which works nicely if you have already trained the large model (defeating the purpose of predicting it). Sloth threads the needle: it uses family-specific efficiency parameters while sharing everything else, so a single small model from a new family is enough to calibrate predictions.

What Sloth Does

The core idea is borrowed from psychometrics and economics. Instead of modelling each benchmark score independently, Sloth assumes that all twelve benchmarks are noisy reflections of a small number of latent skills. It uses factor-analysis-style loadings (a matrix Λ) to project benchmark scores into a low-dimensional skill space, typically d=2 or d=3. Each skill is then modelled as a translog production function of log(parameters) and log(training tokens), with a family-specific intercept that captures how efficiently a given family converts compute into capability.

Because the skill slopes are shared across families while only the intercept varies, Sloth needs far fewer parameters than alternatives. With d=3 and 12 benchmarks, it uses 69 + 3f parameters (where f is the number of families), compared to 36 + 12f for a standard FLOPs baseline. For four or more families, Sloth is already more parameter-efficient. Fitting takes seconds on a laptop using Adam with Huber loss.

What the Skills Look Like

When the authors rotate the factor loadings for interpretability (using Geomin oblique rotation, a standard psychometric technique), three clean dimensions emerge. Reasoning loads heavily on GSM8K, MATH, GPQA, MMLU-PRO, and BBH. It is driven primarily by model size, with training tokens playing a secondary role. Knowledge loads on ARC, HellaSwag, and Winogrande, and responds strongly to both parameters and tokens, suggesting that common-sense knowledge is genuinely data-hungry. Instruction Following is almost entirely captured by IFEval and is the skill most dramatically improved by instruction tuning across every family tested.

Instruction tuning itself produces an interesting pattern: it reliably increases instruction following, has a moderate negative effect on reasoning, and shows mixed effects on knowledge. The instruction-following dimension does not emerge clearly when d=2, which is one practical reason to prefer the three-skill model for interpretation even if prediction accuracy peaks at d=2 on some benchmarks.

Reasoning

Heaviest loadings on GSM8K (3.43) and MATH (3.42). Scales mainly with model size. Instruction tuning can increase or decrease it, with no consistent pattern across families.

Knowledge

Loads on ARC (0.44), HellaSwag (0.55), Winogrande (0.56). Responds strongly to both parameters and tokens. Most sensitive to compute of the three skills; least dependent on family identity.

Instruction Following

Dominated by IFEval (0.63) and TruthfulQA (0.33). Instruction tuning produces large, consistent gains across all tested families. Depends on both model size and training tokens.

Beyond Benchmarks

The most compelling applications go beyond leaderboard prediction. By estimating skills for a hypothetical LLM (say, a 70B model from a family where only an 8B has been evaluated), Sloth can predict performance on downstream tasks it was never trained on. The authors demonstrate this for HumanEval code completion and EQ-Bench emotional intelligence, predicting LLaMA-3-70B performance from skills estimated without ever seeing that model's scores. Reasoning turns out to be by far the most important skill for coding, while emotional intelligence draws on a mixture of reasoning and knowledge.

The test-time compute application is equally striking. By combining Sloth's skill estimates with item response theory (fitting a logistic model per MATH question), the authors predict pass@k curves for models that were held out of training. The predicted curves closely track the ground truth across four orders of magnitude of repeated sampling (k from 1 to 10,000). Unlike conventional scaling laws, Sloth can do this for hypothetical models before any resources are committed to training them.

The compute-optimal analysis adds a practical planning tool. For each skill, the authors derive the optimal allocation of parameters vs. training tokens given a fixed FLOPs budget. Reasoning favours parameters heavily (72B parameters with only 0.45T tokens at 1.9×10²² FLOPs), while knowledge and instruction following prefer a more balanced split. These prescriptions are family-independent, since the optimal point depends only on the shared skill slopes.

Key Implication

Sloth shows that LLM benchmark performance has an exploitable low-dimensional structure. Three latent skills explain most of the cross-benchmark variation, and those skills scale predictably with compute in ways that transfer across model families. If you have benchmark scores from one small model, you can forecast how a much larger sibling will perform, what it will be good at, and how to allocate your training budget to maximise a specific capability. The practical cost of this foresight: seconds of compute on a laptop.

Reference

Polo, F. M., Somerstep, S., Choshen, L., Sun, Y., & Yurochkin, M. (2025). Sloth: scaling laws for LLM skills to predict multi-benchmark performance across families. Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025). arXiv:2412.06540