Introduction

In this lesson, we will study the relations between language model performance and parameters like model scale, model shape, and compute budget. The lesson is a small summary of extracts from the papers “Scaling Laws for Neural Language Models” and “Training Compute-Optimal Large Language Models.”

A study on language modeling performance

The paper Scaling Laws for Neural Language Models (2020) contains a study of empirical scaling laws for language model performance on the cross-entropy loss, focusing on the Transformer architecture.

The experiments show the test loss scales as a power-law with model size, dataset size, and the amount of compute used for training; with some trends spanning more than seven orders of magnitude. This means simple equations govern the relationships between these variables, and these equations can be used to create an optimally efficient training configuration for training a very large language model. Moreover, it looks like other architectural details, such as network width or depth have minimal effects within a wide range.

As deduced from the experiments and the derived equations, larger models are significantly more sample efficient, i.e., optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.

Experiments

To study language model scaling, a variety of models have been trained with different factors, including:

Model size (N): ranging in size from 768 to 1.5 billion non-embedding parameters.
Dataset size (D): ranging from 22 million to 23 billion tokens.
Model shape: including depth, width, attention heads, and feed-forward dimension.
Context length: 1024 for most runs, with some experiments with shorter contexts.
Batch size: 2^19 for most runs, with some variations to measure the critical batch size. Training at the critical batch size provides a roughly optimal compromise between time and compute efficiency.

Let’s define the following training variables as well:

Let L be the test cross-entropy loss.
Let C be the amount of compute used to train a model.

Key findings

Taking inspiration from section 1.1 of the paper, we summarize the results of the experiments.

Performance depends strongly on model scale, weakly on model shape: Model performance depends most strongly on scale, which consists of three factors: the number of model parameters N (excluding embeddings), the size of the dataset D, and the amount of compute C used for training. Within reasonable limits, performance depends very weakly on other architectural hyperparameters, such as depth vs. width.
Smooth power laws: Performance has a power-law relationship with each of the three scale factors N, D, and C when not bottlenecked by the other two, with trends spanning more than six orders of magnitude.

Language modeling performance improves smoothly as we increase the amount of compute, dataset size, and model size used for training. For optimal performance, all three factors must be scaled up in tandem. Image from

The paper differentiates between embedding and non-embedding parameters because their size correlates differently with model performance. When including embedding parameters, performance appears to depend strongly on the number of layers in addition to the number of parameters. When excluding embedding parameters, the performance of models with different depths converges to a single trend.

Left: When including embedding parameters, performance appears to depend strongly on the number of layers in addition to the number of parameters. Right: When excluding embedding parameters, the performance of models with different depths converges to a single trend. Image from

The universality of overfitting: Performance improves predictably as long as we scale up N and D in tandem but leads to diminishing returns if either N or D is held fixed while the other increases.

The early-stopped test loss depends predictably on the dataset size D and model size N. Left: For large D, performance is a straight power law in N. Performance stops improving for a more minor fixed D as N increases and the model begins to overfit. Right: The extent of overfitting depends predominantly on a relationship between N and D. Image from

The universality of training: Training curves follow predictable power-laws whose parameters are roughly independent of the model size. Extrapolating the early part of a training curve can roughly predict the loss that would be achieved if trained for much longer.
Sample efficiency: Large models are more sample-efficient than small models, reaching the same level of performance with fewer optimization steps and data points.

A series of language model training runs, with models ranging in size from 10^3 to 10^9 parameters (excluding embeddings). Image from

Left: The early-stopped test loss L(N, D) varies predictably with the dataset size D and model size N. Right: After an initial transient period, learning curves for all model sizes N can be fit with an equation parameterized in terms of the number of steps (Smin) when training at large batch size. Image from

Convergence is inefficient: When working within a fixed compute budget C but without any other restrictions on the model size N or available data D, we attain optimal performance by training very large models and stopping significantly short of convergence.

As more computing becomes available, choosing how much to allocate towards training larger models, using larger batches, and training for more steps is possible. This image illustrates this billion-fold increase in computing. Most of the increase should go towards increased model size for optimally compute-efficient training. A relatively small increase in data is needed to avoid reuse. Of the increase in data, most can be used to increase parallelism through larger batch sizes, with only a very small increase in serial training time required. Image from

These results show that language modeling performance improves smoothly and predictably as we appropriately scale up model size, data, and compute. Conversely, we find very weak dependence on many architectural and optimization hyperparameters. Larger language models are expected to perform better and be more sample-efficient than current models.

Considerations

When training large language models, it’s possible to use the relations between N, D, and L to derive the compute scaling, magnitude of overfitting, early stopping step, and data requirements.

The derived scaling relations can be used as a predictive framework. One might interpret these relations as analogs of the ideal gas law, which relates the macroscopic properties of a gas in a universal way, independent of most of the details of its microscopic constituents.

It would be interesting to investigate whether these scaling relations hold in other generative modeling tasks with a maximum likelihood loss and perhaps in other settings and domains (such as images, audio, and video models) as well.

Chinchilla Scaling Laws for Compute-Optimal Training of LLMs

In 2022, Google DeepMind published the paper “Training Compute-Optimal Large Language Models” that further explored the scaling laws of LLMs. The researchers conducted extensive experiments to understand the relationship between model size, the number of training tokens, and the compute budget.

The key finding of this study was that current LLMs, such as GPT-3 (175B), Gopher (280B), and Megatron (530B), are significantly undertrained. While these models have increased the number of parameters, the training data remained constant.

The authors proposed that the number of training tokens and model size must be scaled equally for compute-optimal training. They trained approximately 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens. This extensive experimentation led to the creation of a new LLM, Chinchilla, which outperformed its larger counterparts.

Current LLMs. We show five of the current largest dense transformer models, their size, and the number of training tokens. Other than LaMDA, most models are trained for approximately 300 billion tokens. We introduce Chinchilla, a substantially smaller model, trained for much longer than 300B tokens. Image from

With 70B parameters and four times more training data, Chinchilla was trained using the same compute budget as the 280B Gopher. The results showed that smaller models could deliver better performance if trained on more data. These smaller models are easier to fine-tune and have less latency at inference. Moreover, they do not need to be trained to their lowest possible loss to be compute optimal.

The researchers explored three different approaches to answer the question: "Given a fixed FLOPs budget, how should one trade-off model size and the number of training tokens?" They assumed a power-law relationship between compute and model size.

The first approach involved fixing model sizes and varying the number of training tokens.
The second approach, called IsoFLOP profiles, varied the model size for a fixed set of different training FLOP counts.
The third approach combined the final loss of the above two approaches as a parametric function of model parameters and the number of tokens.

All three approaches suggested that as the compute budget increases, the model size and the training data amount should be of approximately equal proportions. The first and second approaches yielded similar predictions for optimal model sizes, while the third suggested that smaller models would be optimal for larger compute budgets.

Conclusion

This lesson has explored the relationship between language model performance and parameters such as model size, dataset size, and compute budget.

We've learned that performance scales are a power law with these variables and larger models tend to be more sample-efficient. We also explored the Chinchilla Scaling Laws, which suggest that the number of training tokens and model size should be scaled equally for compute-optimal training. This has led to the creation of smaller models, like Chinchilla, that outperform larger counterparts when trained on more data.

These findings provide a predictive framework for training large language models and may have implications for other generative modeling tasks and domains.

Scaling Laws in LLM Training