Focus on the GPT Architecture

Introduction

The Generative Pre-trained Transformer (GPT) is a type of transformer-based language model developed by OpenAI. The 'transformer' part of its name refers to its transformer architecture, which was introduced in the research paper "Attention is All You Need" by Vaswani et al.

You should have a good understanding of the fundamental elements comprising the transformer architecture. In this session, we will cover the decoder-only networks that play an essential role in developing large language models. We will explore their unique attributes and the reasons behind their effectiveness.

In contrast to conventional Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks, the transformer architecture departs from recurrence and adopts self-attention mechanisms, resulting in substantial advancements in speed and scalability. An immensely powerful architecture was unleashed by harnessing the potential for parallelization within the network (simultaneously running multiple head attentions) along with the abundant small cores available in a GPU.

The GPT Architecture

The GPT family comprises decoder-only models, wherein each block in the stack is comprised of a self-attention mechanism and a position-wise fully connected feed-forward network.

The self-attention mechanism, also known as scaled dot-product attention, allows the model to weigh the importance of each word in the input when generating the next word in the sequence. It computes a weighted sum of all the words in the sequence, where the weights are determined by the attention scores.

The critical aspect to focus on is the addition of “masking” to the self-attention that prevents the model from attending to certain positions/words.

Illustrating which tokens are attended to by masked self-attention at a particular timestamp. (Image taken from NLPiation) As you see in the figure, we pass the whole sequence to the model, but the model at timestep 5 tries to predict the next token by only looking at the previously generated tokens, masking the future tokens. This prevents the model from “cheating” by predicting tokens leveraging future tokens.

The following code simply implements the “masked self-attention” mechanism.

import numpy as np

def self_attention(query, key, value, mask=None):
    # Compute attention scores
    scores = np.dot(query, key.T)
    
    if mask is not None:
        # Apply mask by setting masked positions to a large negative value
        scores = scores + mask * -1e9
    
    # Apply softmax to obtain attention weights
    attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
    
    # Compute weighted sum of value vectors
    output = np.dot(attention_weights, value)
    
    return output

The first step is to compute a Query, Key, and Value vector for each word in the input sequence using separate learned linear transformations of the input vector. It is a simple feedforward linear layer that the model learns during training.

Then, we can calculate the attention scores by taking the dot product of its Query vector with the Key vector of every other word. Currently, the application of masking is feasible by setting the scores in specific locations to a large negative number. This effectively informs the model that those words are unimportant and should be disregarded during attention. To get the attention weights, apply the SoftMax function to the attention scores to convert them into probabilities. This gives the weights of the input words and effectively turns the significant negative scores to zero. Lastly, multiply each Value vector by its corresponding weight and sum them up. This produces the output of the masked self-attention mechanism for the word.

The provided code snippet illustrates the process of a single self-attention head, but in reality, each layer contains multiple heads, which could range from 16 to 32 heads, depending on the architecture. These heads operate simultaneously to enhance the model's performance.

Causal Language Modeling

LLMs utilize a self-supervised learning process for pre-training. This process eliminates the need to provide explicit labels to the model for learning, making it capable of acquiring knowledge autonomously. For instance, when training a summarization model using supervised learning, it is necessary to provide articles and their corresponding summaries as reference points during the training process. However, LLMs employ the causal language modeling objective to acquire knowledge from any textual data without the explicit need for human-provided labels. Why is it called “causal”? Because the prediction at each step depends only on earlier steps in the sequence and not on future steps.

This process involves feeding a segment of the document to the model and asking it to predict the next word.

Subsequently, the predicted word is concatenated to the original input and fed back to the model to predict a new token. This iterative loop continues, consistently feeding the newly generated token back into the network. During the pre-training process, the network acquires substantial knowledge about language and grammar. We can then fine-tune the pre-trained model using a supervised approach for different tasks or a specific domain.

Compared to other well-known objectives, the advantage of this approach is that it models how humans naturally write or speak. In contrast to other objectives like masked language modeling, where masked tokens are introduced in the input, the causal language modeling approach constructs sentences one word at a time. This key difference ensures that our model's performance is not adversely affected when dealing with real-world passages lacking masking tokens.

Moreover, we can utilize extensive, high-quality, human-generated content spanning centuries. This content can be derived from books, Wikipedia, news websites, and more. Familiar datasets and repositories, such as ActiveLoop and Huggingface, provide convenient access to some well-known datasets. We will cover this topic in more detail in later lessons.

MinGPT

Numerous implementations of the GPT architecture exist, each designed for specific purposes. In upcoming lessons, we will thoroughly explore alternative libraries that are better suited for production environments. However, we are introducing a lightweight repository implemented by Andrej Karpathy, named minGPT. This represents a minimal implementation of OpenAI's GPT-2 model.

In his own words, this serves as an educational implementation that strives to remove all complexities, achieving a length of just 300 lines of code and using the PyTorch library. This valuable resource provides an excellent opportunity to read and enhance your understanding of what's happening under the hood. Abundant comments in the code describe the processes and act as a helpful guide.

Three main files can be found within the repository. First, model.py handles the definition of architecture details. Second, bpe.py is responsible for the tokenization process using the BPE algorithm. Lastly, train.py represents the implementation of a generic training loop for any neural network, not limited to the GPT architecture. Furthermore, the demo.ipynb file contains a notebook that demonstrates the complete utilization of the code, including the inference process. The code can be executed on a MacBook Air, making it accessible for use on your local PC. Alternatively, you can fork the repository and utilize services like Colab.

Conclusion

The decoder-only architecture and GPT-family models have driven the recent advancements in large language models. It is essential to possess a strong grasp of the transformer architecture and comprehend the distinctive features that set the decoder-only models apart, making them well-suited for language modeling. We have explored the shared components and delved deeper into what makes their architecture unique. Subsequent lessons will cover various other aspects of language models.