Emergent Abilities in LLMs

An ability is considered as emergent when larger models exhibit it, but it's absent in smaller models - a key factor contributing to the success of Large Language Models.

Introduction

In this lesson, we’ll dive more into the concept of emergent abilities, the empirical phenomenon of the new abilities that language models get when their size increases over specific thresholds.

Emergent abilities become apparent as we scale up the models and are influenced by factors such as training compute and model parameters.

We'll also explore various instances of these emergent abilities, focusing on scenarios like few-shots and augmented prompting, and examine the reasons behind the emergence of these abilities and whether further scaling could reveal more of them.

What Are Emergent Abilities

Emergent abilities in LLMs are defined as significant improvements in task performance that become apparent as the model size or scale increases. These abilities, which are not present or noticeable in smaller or less complex models, become evident in larger or more complex models. This suggests that the model is learning and generalizing from its pre-training in ways that were not explicitly programmed or expected.

When visualized on a scaling curve, emergent abilities show a pattern where performance is almost random until a certain scale threshold, after which performance increases significantly. This is known as a phase transition, a dramatic change in behavior that could not have been predicted by examining smaller-scale systems.

In the following image, taken from the paper “Emergent Abilities of Large Language Models,” we see several charts showing the emergence of abilities of LLMs (whose performance is shown on the y-axis) with respect to the model scale (shown on the x-axis).

From the paper “Emergent Abilities of Large Language Models”

Language models have been scaled primarily along computation amount, model parameters, and training dataset size. The emergence of abilities may occur with less training computation or fewer model parameters for models trained on higher-quality data. It also depends on factors such as the amount of data, its quality, and the number of parameters in the model.

Emergent abilities in LLMs appear as the models scale up and cannot be predicted by simply extrapolating from smaller models.

Evaluation Benchmarks for Emergent Abilities

Several benchmarks are used to evaluate the emergent abilities of language models. These include the BIG-Bench suite, TruthfulQA, the Massive Multi-task Language Understanding (MMLU) benchmark, and the Word in Context (WiC) benchmark.

The first of these is the BIG-Bench suite, a comprehensive set of over 200 benchmarks that test a model's capabilities across a variety of tasks. These tasks include arithmetic operations where the model is expected to perform the four basic operations (example: “Q: What is 132 plus 762? A: 894), transliteration from the International Phonetic Alphabet (IPA) to measure if the model is able to manipulate and use rare words (example: “English: The 1931 Malay census was an alarm bell. IPA: ðə 1931 ˈmeɪleɪ ˈsɛnsəs wɑz ən əˈlɑrm bɛl.”), word unscrambling that analyzes the model’s ability to work with alphabets. A large number of benchmarks can be found within the Github repository where you can delve into their specific details. The performance of models like GPT-3 and LaMDA on these tasks starts near zero but jumps to significantly above random at a certain scale, demonstrating emergent abilities.
Another benchmark is TruthfulQA, which measures a model's capacity to provide truthful responses when addressing questions. The evaluation consists of two tasks: 1) Generation: The model will be asked to answer a question with 1 or 2 sentences. 2) Multiple-choices: The second task involves multiple-choice questions, where the model must choose the correct answer from either 4 options or True/False statements. When the Gopher model is scaled up to its largest size, its performance jumps to more than 20% above random, indicating the emergence of this ability.
The Massive Multi-task Language Understanding (MMLU) is another key benchmark. The primary objective of this benchmark is to evaluate models for their ability to demonstrate a broad range of world knowledge and problem-solving skills. The test encompasses 57 tasks, spanning areas such as elementary mathematics, US history, computer science, law, and more. GPTs, Gopher, and Chinchilla models of a specific scale do not perform better than guessing on average of all the topics, but scaling up to a larger size enables performance to surpass random, indicating the emergence of this ability.
Finally, the Word in Context (WiC) is a semantic understanding benchmark. WiC is a binary classification task for context-sensitive word embeddings. It involves target words (verbs or nouns) with two provided contexts, aiming to determine if they share the same meaning. Chinchilla fails to achieve the one-shot performance of better than random, even when scaled to its largest model size. Above-random performance eventually emerged when PaLM was scaled to a much larger size, suggesting the emergence of this ability at a larger scale.

Other Factors That Could Give Rise To Emergent Abilities

Multi-step reasoning is a strategy where a model is guided to produce a sequence of intermediate steps before giving the final answer. This strategy, known as chain-of-thought prompting, only surpasses standard prompting when applied to a sufficiently large model.
Instruction following is another strategy that involves fine-tuning a model on a mixture of tasks phrased as instructions. This strategy only improves performance when applied to a model of a specific size.

Risks With Emergent Abilities

As we scale up language models, we also need to be aware of the emergent risks that come with it. These risks could be societal issues related to truthfulness, bias, and toxicity. These risks can be avoided by applying strategies, such as giving model prompts that encourage them to be "helpful, harmless, and honest.”

The WinoGender benchmark, which measures gender bias in occupations, has shown that scaling can improve performance but also increase bias in ambiguous contexts. Larger models were found to be more likely to memorize training data, although deduplication methods can reduce this risk.

Emergent risks also include phenomena that might only exist in future language models or that have not yet been characterized in current models. These could include backdoor vulnerabilities or harmful content synthesis.

A Shift Towards General-Purpose Models

The emergence of abilities has led to sociological changes in how the community views and uses these models. Historically, NLP focused on task-specific models. Scaling models has led to an explosion in research on "general purpose" models that aim to perform a range of tasks not explicitly encoded in the training data.

This shift towards general-purpose models is evident when scaling enables a few-shot prompted general-purpose model to outperform prior state-of-the-art held by fine-tuned task-specific models. For example, GPT-3 achieved a new state-of-the-art on the TriviaQA and PiQA question-answering benchmarks; PaLM achieved a new state-of-the-art on three arithmetic reasoning benchmarks; and the multimodal Flamingo model achieved a new state of the art on six visual question answering benchmarks.

The ability of general-purpose models to perform unseen tasks, given only a few examples, has also led to many new applications of language models outside the NLP research community. For instance, language models have been used by prompting to translate natural language instructions into actions that are executable by robots, interact with users, and facilitate multi-modal reasoning.

Conclusion

Emergent abilities in LLMs are capabilities that appear as the models scale up and are a key factor in their success. These abilities, unpredictable from smaller models, become evident after reaching a certain scale threshold. They have been observed in various contexts, such as in a few-shot prompting and augmented prompting strategies. Scaling up LLMs also introduces emergent risks like increased bias and toxicity, which can be avoided with appropriate strategies. The emergence of these abilities has led to a shift towards general-purpose models and opened up new applications outside the traditional NLP research community.

In the next lesson, we’ll dive into today's most popular proprietary LLMs and describe the tradeoffs between proprietary and open-source LLMs.