Skip to main content

Scaling Laws โ€” What More Compute Actually Buys

PM: Skim โ€” 15 min

In 2020, researchers at OpenAI published a paper that changed how the entire industry thinks about model development. They showed that language model quality doesn't improve in unpredictable leaps โ€” it follows smooth, predictable mathematical relationships with compute, data, and model size. That finding rewired how labs allocate resources. And two years later, a follow-up paper from DeepMind overturned one of its central conclusions.

Understanding both is useful. Scaling laws are the closest thing the field has to a theory of how LLMs get better.

The Kaplan et al. Finding (2020)โ€‹

The 2020 paper (often called "the Kaplan scaling laws" or "GPT-3 scaling laws") studied how model performance changes as you vary three things independently: model size (parameters), dataset size (tokens), and compute budget (FLOPs).

The key finding: loss โ€” a measure of how well the model predicts the next token โ€” follows a power law relationship with each of these variables. Double the compute, and you get a predictable, quantifiable improvement in loss. The relationship holds across many orders of magnitude.

This was practically significant because it meant you could run small experiments and extrapolate. Before spending months and millions training a large model, you could run a sweep of smaller experiments, fit the curve, and predict what the big run would produce. Surprise failures in large training runs became rarer because you could see them coming.

The paper also suggested that, given a fixed compute budget, you should prioritize model size over data size. Train a bigger model on less data. This turned out to be the wrong takeaway.

The Chinchilla Correction (2022)โ€‹

Two years later, a DeepMind team led by Jordan Hoffmann ran a more careful set of experiments. They trained over 400 models at different combinations of size and data, all at matched compute budgets, and measured which combinations produced the best final performance.

The result, published in what's now called the "Chinchilla paper," directly contradicted the Kaplan-era intuition: for a fixed compute budget, you should train a smaller model on more data, not a bigger model on less data.

The specific finding: optimal training requires roughly 20 tokens of training data for every parameter in the model. By this measure, GPT-3 โ€” 175 billion parameters, trained on roughly 300 billion tokens โ€” was massively undertrained. An optimally trained model at the same compute budget would be smaller but trained on far more data, and it would perform better.

The team proved this by training Chinchilla: 70 billion parameters, trained on 1.4 trillion tokens, using roughly the same compute budget as Gopher (280 billion parameters, 300 billion tokens). Chinchilla outperformed Gopher on nearly every benchmark.

Why This Matters for Inferenceโ€‹

The Chinchilla insight has a compounding benefit that wasn't the point of the paper but became its most practical consequence: a well-trained smaller model is cheaper to run than a poorly-trained larger model with equivalent quality.

If you can hit a target quality level with a 70B model trained on 1.4T tokens instead of a 175B model trained on 300B tokens, you get that quality for much less per-inference cost โ€” fewer parameters means less compute per forward pass.

This is why Llama 3's 8B model, trained on 15 trillion tokens, frequently outperforms GPT-3 class 175B models trained on far less data. The parameter count is smaller by an order of magnitude; the training data is larger by an order of magnitude. Chinchilla explains the outcome.

The Question of Emergent Capabilitiesโ€‹

One effect associated with scaling is the appearance of qualitatively new capabilities at certain scales. Below some threshold, a model can't do multi-step arithmetic reliably. Above it, it can. The capability appears to "emerge" rather than gradually improve.

This phenomenon generated significant excitement and anxiety โ€” the idea that scaling alone could unlock unexpected new abilities, without any architectural changes.

The picture has become less clear since the original reports. Some researchers argue that apparent emergence is an artifact of the metrics used: switch from a coarse metric to a fine-grained one, and the capability was improving all along, just below the measurement threshold. Whether true emergence (a genuine phase transition) exists in LLMs, or whether it's a measurement artifact, remains an active debate.

For practical purposes: the safest assumption is that capabilities improve smoothly with scale, and that hard capability thresholds are more likely to be measurement artifacts than genuine discontinuities.

What "Compute Efficient" Actually Meansโ€‹

Scaling laws made the phrase "compute-efficient training" meaningful. A model is compute-efficient relative to a scaling law if, at a given compute budget, it achieves lower loss than the curve predicts for a randomly sized model. Models like Chinchilla, Llama, and their successors were explicitly designed to sit on or below the optimal compute-efficient frontier.

This also means that raw parameter count, without knowing training data volume, tells you very little about expected quality. A 7B model trained on 2T tokens and a 7B model trained on 100B tokens are different models. The first is likely much better.

PM Roadmap Tip

Model size is not the primary indicator of quality โ€” training data volume and the ratio of data to model parameters matter just as much. Llama 3 8B, trained on 15T tokens, often outperforms much larger models trained on less data. When evaluating a model, ask both "how many parameters?" and "how many tokens was it trained on?" โ€” neither number alone tells the full story.

Further Readingโ€‹

Kaplan, J. et al. โ€” arXiv 2020 (2020)
Read:Abstract, Section 1, Figure 1.Skip:Sections 2โ€“5 (derivations, training details).
Model performance improves predictably with scale (parameters, data, compute) โ€” enabling the field to reason about how much bigger to go.
Hoffmann, J. et al. โ€” NeurIPS 2022 (2022)
Read:Abstract only.Skip:Everything else.
"Chinchilla" revised GPT-era assumptions: the optimal training data budget is ~20 tokens per parameter โ€” many large models were undertrained on too little data.