A Timeline of Large Language Models

A Timeline of Large Language Models

Introduction

In this lesson, we'll explore the transformative shifts that have reshaped AI, exploring the key features that set LLMs apart from their predecessors. We’ll see how scaling laws, emergent abilities, and innovative architectures have propelled LLMs to tackle complex tasks and define the current landscape of popular LLMs.

From Language Models to Large Language Models

The evolution of language models has undergone a transformative shift from pre-trained language models (LMs) to the emergence of large language models (LLMs).

LMs, like ELMo and BERT, initially captured context-aware word representations through pre-training and fine-tuning for specific tasks. However, the introduction of LLMs, exemplified by GPT-3 and PaLM, demonstrated that scaling model size and data can unlock emergent abilities, exceeding the capabilities of their smaller counterparts. These LLMs can tackle more complex tasks through in-context learning.

The following image shows the trends of the cumulative numbers of arXiv papers containing the keyphrases “language model” and “large language model,” emphasizing the growing interest in them in the last years.

Key Characterizing Features of LLMs

Here are the main characteristics that differentiate LLMs from the previous models:

  1. Scaling Laws for Enhanced Capacity: Scaling laws play a crucial role in LLM development, indicating a relationship between model performance, model size, dataset size, and training compute. The KM scaling laws emphasize the impact of these factors, revealing distinct formulas for their influence on cross-entropy loss. The Chinchilla scaling laws provide an alternate approach, optimizing compute allocation between model and data size.
  2. Emergent Abilities: LLMs possess emergent abilities, defined as capabilities that manifest in large models but are absent in smaller counterparts. One prominent emergent ability is in-context learning (ICL), showcased by models like GPT-3. ICL allows LLMs to generate unexpected outputs based on natural language instructions, eliminating the need for further training.
  3. Instruction following: LLMs can be finetuned to follow text instructions, which further enhances generalization for new tasks.
  4. Step-by-Step Reasoning: LLMs can perform step-by-step reasoning using the chain-of-thought (CoT) prompting strategy. This mechanism enables them to solve complex tasks by breaking them down into intermediate reasoning steps, which is particularly beneficial for tasks involving multiple steps like mathematical word problems.

A Timeline of the Most Popular LLMs

Here’s an overview of the timeline of the most popular LLMs in the last years.

Here is a brief description of some of them.

  • [2018] GPT-1
  • GPT-1 (Generative Pre-Training 1) was introduced by OpenAI in 2018. It laid the foundation for the GPT-series models by employing a generative, decoder-only Transformer architecture. It combined unsupervised pretraining and supervised fine-tuning to predict the next word in natural language text.

  • [2019] GPT-2
  • Building upon the architecture of GPT-1, GPT-2 was released in 2019 with an increased parameter scale of 1.5 billion. This model demonstrated potential for solving a variety of tasks using language text as a unified format for input, output, and task information.

  • [2020] GPT-3
  • Released in 2020, GPT-3 marked a significant capacity leap by scaling the model to 175 billion parameters. It introduced the concept of in-context learning (ICL), enabling LLMs to understand tasks through few-shot or zero-shot learning. GPT-3 showcased excellent performance in numerous NLP tasks, including reasoning and domain adaptation, highlighting the potential of scaling up model size.

  • [2021] Codex
  • Codex was introduced by OpenAI in July 2021 as a fine-tuned version of GPT-3 specifically trained on a large corpus of GitHub code. It demonstrated enhanced ability in solving programming and mathematical problems, showcasing the potential of training LLMs on specialized data.

  • [2021] LaMDA
  • LaMDA (Language Models for Dialog Applications) was introduced by researchers from DeepMind. LaMDA focuses on enhancing dialog applications and dialog generation tasks. It has a significant number of parameters, with the largest model consisting of 137 billion parameters, making it slightly smaller than GPT-3.

  • [2021] Gopher
  • In 2021, DeepMind introduced Gopher, a language model with an impressive parameter scale of 280 billion. Notably, Gopher demonstrated a remarkable capability to approach human expert performance on the Massive Multitask Language Understanding (MMLU) benchmark. However, like its predecessors, Gopher exhibited certain limitations, including tendencies for repetition, biases, and propagation of incorrect information.

  • [2022] InstructGPT
  • In 2022, InstructGPT was proposed as an enhancement to GPT-3 for human alignment. It utilized reinforcement learning from human feedback (RLHF) to improve the model's instruction-following capacity and mitigate issues like generating harmful content. This approach proved valuable for training LLMs to align with human preferences.

  • [2022] Chinchilla
  • Chinchilla, introduced in 2022 by DeepMind, is a family of large language models that build upon the discovered scaling laws of LLMs. With a focus on efficient utilization of compute resources, Chinchilla boasts 70 billion parameters and achieves a remarkable 67.5% accuracy on the MMLU benchmark—a 7% improvement over Gopher.

  • [2022] PaLM
  • Pathways Language Model (PaLM) was introduced by Google Research in 2022, showcasing a leap in model scale with a whopping 540 billion parameters. Leveraging the proprietary Pathways system for distributed computation, PaLM exhibited great few-shot performance across an array of language understanding, reasoning, and code-related tasks.

  • [2022] ChatGPT
  • In November 2022, OpenAI released ChatGPT, a conversation model based on GPT-3.5 and GPT-4. Specially optimized for dialogue, ChatGPT exhibited great abilities in communicating with humans, reasoning, and aligning with human values.

  • [2023] LLaMA
  • LLaMA (Large Language Model Meta AI) emerged in February 2023 from Meta AI. It introduced a family of large language models available in varying sizes from 7 billion to 65 billion parameters. LLaMA's release marked a departure from the limited access trend, as its model weights were made available to the research community under a noncommercial license. Subsequent developments, including Llama 2 and other chat models, further emphasized accessibility, this time with a license for commercial use.

  • [2023] GPT-4
  • In March 2023, GPT-4 was released, extending text input to multimodal signals. With stronger capacities than GPT-3.5, GPT-4 demonstrated significant performance improvements on various tasks.

If you want to dive deeper into these models, I suggest reading the paper “A Survey of Large Language Models.” Here’s a table summarizing the architectural and training details of all the mentioned models (and others).

Moreover, here’s an image showing the evolution of the LLaMA models into other fine-tuned models made by online communities, highlighting the great interest it sparked.

Conclusion

In this lesson, we learned more about the transition from pre-trained language models (LMs) to the emergence of large language models (LLMs). We explored the key differentiating features of LLMs, including the influence of scaling laws and the manifestation of emergent abilities like in-context learning, step-by-step reasoning strategies, and instruction following.

We also saw a brief timeline of the most popular LLMs: from the foundational GPT-1 to the revolutionary GPT-3, the specialized Codex, LaMDA, Gopher, and Chinchilla, to PaLM, ChatGPT, and LLaMA.