The Evolution of Language Modeling up to LLMs

The Evolution of Language Modeling up to LLMs

Introduction

In this lesson, we’ll see the most popular models used for language modeling, starting from statistical ones up to the first Large Language Models (LLMs). This lesson is meant to be more like a narrative on the evolution of the models rather than a technical explanation. Therefore, don’t worry if you can’t understand every model in detail.

The Evolution of Language Modeling

The evolution of NLP models has been a remarkable journey marked by continuous innovation and improvement. It began with the Bag of Words model in 1954, which simply counted word occurrences in documents. This was followed by TF-IDF in 1972, which adjusted these scores based on the rarity or commonality of words.

The advent of Word2Vec in 2013 marked a significant leap forward, introducing the concept of word embeddings that captured semantic relationships between words.

This was then further enhanced by Recurrent Neural Networks (RNNs), which could learn sequence patterns and handle documents of any length.

The introduction of the Transformer architecture in 2017 revolutionized the field, with its attention mechanism allowing the model to focus on the most relevant parts of the input when generating output. This was the foundation for BERT in 2018, which used bidirectional Transformers to achieve impressive results in traditional NLP tasks.

The subsequent years saw a flurry of advancements, with models like RoBERTa, XLM, ALBERT, and ELECTRA each introducing their own improvements and optimizations.

Model’s Timeline

  • [1954] Bag of Words (BOW)
  • BOW is a simple model that counts word occurrences in documents, using these counts as features. It was a basic yet effective way to analyze text. However, it did not account for word order or context.

  • [1972] TF-IDF
  • TF-IDF enhanced BOW by giving more weight to rare words and less to common ones. This improved the model's ability to discern document relevance. However, it still did not account for word context.

  • [2013] Word2Vec
  • Word2Vec introduced word embeddings, high-dimensional vectors that capture semantic relationships. These embeddings were learned by a neural network trained on a large corpus of text. This model marked a significant advancement in capturing semantic meaning in text.

  • [2014] RNNs in Encoder-Decoder architectures
  • RNNs (Recurrent Neural Networks) compute document embeddings, leveraging word context in sentences, which was not possible with word embeddings alone. Later evolved with LSTM [1997] to capture long-term dependencies and to Bidirectional RNN [1997] to capture left-to-right and right-to-left dependencies. Eventually, Encoder-Decoder RNNs [2014] emerged, where an RNN creates a document embedding (i.e., the encoder), and another RNN decodes it into text (i.e., the decoder).

  • [2017] Transformer
  • The Transformer is an encoder-decoder model that leverages attention mechanisms to compute better embeddings and to align output better to input. This model marked a significant advancement in NLP tasks.

  • [2018] BERT
  • BERT is a bidirectional Transformer pre-trained using a combination of Masked Language Modeling and Next Sentence Prediction objectives. It uses global attention.

  • [2018] GPT
  • GPT is the first autoregressive model based on the Transformer architecture. Later, it evolved into GPT-2 [2019], a bigger and optimized version of GPT pre-trained on WebText, and GPT-3 [2020], a further bigger and optimized version of GPT-2, pre-trained on Common Crawl.

  • [2019] CTRL
  • CTRL, similar to GPT, introduced control codes for conditional text generation. This allowed for more control over the generated text.

  • [2019] Transformer-XL
  • Transformer-XL reused previously computed hidden states to attend to a longer context. This allowed the model to handle longer sequences of text.

  • [2019] ALBERT
  • ALBERT is a lighter version of BERT where (1) Next Sentence Prediction is replaced by Sentence Order Prediction, and (2) parameter-reduction techniques are used for lower memory consumption and faster training.

  • [2019] RoBERTa
  • RoBERTa is a better version of BERT, where (1) the Masked Language Modeling objective is dynamic, (2) the Next Sentence Prediction objective is dropped, (3) the BPE tokenizer is employed, and (4) better hyperparameters are used.

  • [2019] XLM
  • XLM, a multilingual Transformer, was pre-trained using objectives like Causal Language Modeling, Masked Language Modeling, and Translation Language Modeling.

  • [2019] XLNet
  • It’s a Transformer-XL with a generalized autoregressive pre-training method that enables learning bidirectional dependencies.

  • [2019] PEGASUS
  • PEGASUS, a bidirectional encoder and left-to-right decoder, was pre-trained with Masked Language Modeling and Gap Sentence Generation objectives.

  • [2019] DistilBERT
  • It is the same as BERT but smaller and faster while preserving over 95% of BERT’s performances. Trained by distillation of the pre-trained BERT model.

  • [2019] XLM-RoBERTa
  • XLM-RoBERTa is a multilingual version of RoBERTa, trained on a multilanguage corpus with the Masked Language Modeling objective.

  • [2019] BART
  • BART, a bidirectional encoder and left-to-right decoder, was trained by corrupting text with an arbitrary noising function and learning a model to reconstruct the original text.

  • [2019] ConvBERT
  • ConvBERT replaced self-attention blocks with new ones that leveraged convolutions to better model global and local contexts.

  • [2020] Funnel Transformer
  • It’s a type of Transformer that gradually compresses the sequence of hidden states to a shorter one, reducing the computation cost.

  • [2020] Reformer
  • Reformer is a more efficient Transformer thanks to local-sensitive hashing attention, axial position encoding, and other optimizations.

  • [2020] T5
  • T5, a bidirectional encoder and left-to-right decoder, was pre-trained on a mix of unsupervised and supervised tasks.

  • [2020] Longformer
  • Longformer replaced the attention matrices with sparse matrices for higher training efficiency. This made the model faster and more memory-efficient.

  • [2020] ProphetNet
  • ProphetNet was trained with the Future N-gram Prediction objective and with a novel self-attention mechanism.

  • [2020] ELECTRA
  • Lighter and better than BERT, ELECTRA was trained with the Replaced Token Detection objective. This made the model more efficient and improved its performance on NLP tasks.

  • [2021] Switch Transformers
  • Switch Transformers introduced a sparsely-activated expert Transformer model, aiming to simplify and improve over Mixture of Experts. This allowed the model to handle a wider range of tasks.

The years 2020 and 2021 are the ones where Large Language Models truly arose. Up to 2020, most language models were able to generate good-looking texts. After this date, the best language models could follow instructions and solve various tasks aside from simple text generation.

The Transformer

The most crucial model of the previous pipeline is, without doubt, the Transformer, introduced in the very popular paper “Attention Is All You Need.” The Transformer is a type of neural network that is used today by all of the best Large Language Models like GPT-4, Claude, and LLaMA.

Central to Transformers is the encoder-decoder structure, which excels at modeling long-range dependencies and capturing contextual information.

The Encoder processes the input text, identifying key elements and creating word embeddings based on their relevance to other words in the sentence. In the original Transformer architecture, designed for text translation, the attention mechanism was employed in two distinct ways: encoding the source language and decoding the encoded embedding back into the target language.

On the other hand, the Decoder takes the encoder's output, an embedding, and transforms it back into text. Some models may opt to use only the decoder, bypassing the encoder entirely. The decoder's attention mechanism differs slightly from the encoder's, functioning more like a conventional language model by focusing on previous words during text processing. This approach is particularly useful for tasks like language generation, which is why models like GPT, primarily designed for text generation in response to an input text sequence, utilize the decoder part of the Transformer.

Later in the course, we’ll learn more about the Transformer architecture.

                                                       Image from the paper 
Image from the paper Attention is All You Need.

Scaling Transformers: What Lead to Large Language Models

The effectiveness of the Transformer models was further improved by scaling, i.e., increasing the number of parameters and training on more data. This scaling led to models with more than 100B parameters that could perform tasks using few-shot or zero-shot approaches, eliminating the need for fine-tuning specific tasks.

The increase in the size of these models and the datasets used for training them (and thus the associated costs) led to the large language models that we see today, like Cohere Command, GPT-4, and LLaMA.

Conclusion

In this lesson, we navigated through the rich history of Natural Language Processing, tracing the path from the rudimentary Bag of Words model to the advanced Transformer family. This timeline underscored the continuous innovation in NLP, spotlighting the progression of models in sophistication and proficiency.

In the next lesson, we’ll continue the timeline of popular models from 2020 (with GPT-3) up to today.