Understanding Transformers

Introduction

In this lesson, we will dive deeper into Transformers and provide a comprehensive understanding of their various components. We will also cover the network's inner mechanisms.

We will look into the seminal paper “Attention is all you need” and examine a diagram of the components of a Transformer. Last, we see how Hugging Face uses these components in the popular transformers library.

Attention Is All You Need

The Transformer architecture was proposed as a collaborative effort between Google Brain and the University of Toronto in a paper called “Attention is All You Need.” It presented an encoder-decoder network powered by attention mechanisms for automatic translation tasks, demonstrating superior performance compared to previous benchmarks (WMT 2014 translation tasks) at a fraction of the cost. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.

While Transformers have proven to be highly effective in various tasks such as classification, summarization, and, more recently, language generation, their proposal of training highly parallelized networks is equally significant.

The expansion of the architecture into three distinct categories allowed for greater flexibility and specialization in handling different tasks:

The encoder-only category focused on extracting meaningful representations from input data. An example model of this category is BERT.
The encoder-decoder category enabled sequence-to-sequence tasks such as translation and summarization or training multimodal models like caption generators. An example model of this category is BART.
The decoder-only category specializes in generating outputs based on given instructions, as we have in Large Language Models. An example model of this category is GPT.

The Architecture

Now, let's examine the crucial elements of the Transformer model in more detail.

The overview of Transformer architecture. The left component is called the encoder, which is connected to the decoder using a cross-attention mechanism. (Image taken from the “Attention is all you need” paper)

Input Embedding

The initial procedure involves translating the input tokens into embeddings. These embeddings are acquired vectors symbolizing the input tokens, facilitating the model's ability to grasp the semantic meanings of the words. The size of the embedding vector varied based on the model's scale and design preferences. For instance, OpenAI's GPT-3 uses a 12,000-dimensional embedding vector, while smaller models like BERT could have a size as small as 768.

Positional Encoding

Given that the Transformer lacks the recurrence feature found in RNNs to feed the input one at a time, it necessitates a method for considering the position of words within a sentence. This is accomplished by adding positional encodings to the input embeddings. These encodings are vectors that keep the location of a word in the sentence.

Self-Attention Mechanism

At the core of the Transformer model lies the self-attention mechanism, which calculates a weighted sum of the embeddings of all words in a sentence for each word. These weights are determined based on some learned “attention” scores between words. The terms with higher relevance to one another will receive higher “attention” weights.

Based on the inputs, this is implemented using Query, Key, and Value vectors. Here is a brief description of each vector.

Query Vector: It represents the word or token for which the attention weights are being calculated. The Query vector determines which parts of the input sequence should receive more attention. Multiplying word embeddings with the Query vector is like asking, "What should I pay attention to?"
Key Vector: It represents the set of words or tokens in the input sequence that are compared with the Query. The Key vector helps identify the relevant or essential information in the input sequence. Multiplying word embeddings with the Key vector is like asking, "What is important to consider?"
Value Vector: It contains the input sequence's associated information or features for each word or token. The Value vector provides the actual data that will be weighted and combined based on the attention weights calculated between the Query and Key. The Value vector answers the question, "What information do we have?"

Before the advent of the transformer architecture, the attention mechanism was mainly utilized to compare two portions of texts. For example, the model could focus on different parts of the input article while generating the summary for a task like summarization.

The self-attention mechanism enabled the models to highlight the important parts of the content for the task. It is helpful in encoder-only or decoder-only models to create a powerful representation of the input. The text can be transformed into embeddings for encoder-only scenarios, whereas the text is generated for decoder-only models.

The effectiveness of the attention mechanism significantly increases when applied in a multi-head setting. In this configuration, multiple attention components process the same information, with each head learning to focus on distinct aspects of the text, such as verbs, nouns, numbers, and more, throughout the training process.

The Architecture In Action

This section will demonstrate the functioning of the above components from a pre-trained large language model, providing an insight into their inner workings using the transformers Hugging Face library.

To begin, we load the model and tokenizer using AutoModelForCausalLM and AutoTokenizer, respectively. Then, we proceed to tokenize a sample phrase, which will serve as our input in the following steps.


from transformers import AutoModelForCausalLM, AutoTokenizer

OPT = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b", load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")

inp = "The quick brown fox jumps over the lazy dog"
inp_tokenized = tokenizer(inp, return_tensors="pt")
print(inp_tokenized['input_ids'].size())
print(inp_tokenized)

The sample code.

torch.Size([1, 10])
{'input_ids': tensor([[    2,   133,  2119,  6219, 23602, 13855,    81,     5, 22414,  2335]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

The output.

We load Facebook's Open Pre-trained Transformer model with 1.3B parameters (facebook/opt-1.3b) in the 8-bit format, a memory-saving approach to efficiently utilize GPU resources. The tokenizer object loads the required vocabulary to interact with the model and will be used to convert the sample input (inp variable) to the token IDs and attention mask.

Let’s look at the model’s architecture by accessing its .model method.

print(OPT.model)

The sample code.

OPTModel(
  (decoder): OPTDecoder(
    (embed_tokens): Embedding(50272, 2048, padding_idx=1)
    (embed_positions): OPTLearnedPositionalEmbedding(2050, 2048)
    (final_layer_norm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
    (layers): ModuleList(
      (0-23): 24 x OPTDecoderLayer(
        (self_attn): OPTAttention(
          (k_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
          (v_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
          (q_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
          (out_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
        )
        (activation_fn): ReLU()
        (self_attn_layer_norm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear8bitLt(in_features=2048, out_features=8192, bias=True)
        (fc2): Linear8bitLt(in_features=8192, out_features=2048, bias=True)
        (final_layer_norm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
      )
    )
  )
)

The output.

The model is decoder-only, a common characteristic among transformer-based language models. Consequently, we must utilize the decoder key to access its inner components. Furthermore, the examination of the layers key reveals that the decoder component is composed of 24 stacked layers with the same architecture. To begin, we look at the embedding layer.

embedded_input = OPT.model.decoder.embed_tokens(inp_tokenized['input_ids'])
print("Layer:\t", OPT.model.decoder.embed_tokens)
print("Size:\t", embedded_input.size())
print("Output:\t", embedded_input)

The sample code.

Layer:	 Embedding(50272, 2048, padding_idx=1)
Size:	   torch.Size([1, 10, 2048])
Output:	 tensor([[[-0.0407,  0.0519,  0.0574,  ..., -0.0263, -0.0355, -0.0260],
         [-0.0371,  0.0220, -0.0096,  ...,  0.0265, -0.0166, -0.0030],
         [-0.0455, -0.0236, -0.0121,  ...,  0.0043, -0.0166,  0.0193],
         ...,
         [ 0.0007,  0.0267,  0.0257,  ...,  0.0622,  0.0421,  0.0279],
         [-0.0126,  0.0347, -0.0352,  ..., -0.0393, -0.0396, -0.0102],
         [-0.0115,  0.0319,  0.0274,  ..., -0.0472, -0.0059,  0.0341]]],
       device='cuda:0', dtype=torch.float16, grad_fn=<EmbeddingBackward0>)

The output.

The embedding layer is accessible through the .embed_tokens method under the decoder component and passes our tokenized inputs to the layer. As you can see, the embedding layer will transform a list of IDs with [1, 10] size to [1, 10, 2048]. This representation will then be used and passed through the decoder layers.

Subsequently, the positional encoding component utilizes the attention masks to generate a vector that imparts a sense of positioning within the model. The following code uses the .embed_positions method from the decoder to generate the positional embeddings. As seen, the layer generates a distinct vector for each position, which is added to the output of the embedding layer. This process introduces supplementary positional information to the model.

embed_pos_input = OPT.model.decoder.embed_positions(inp_tokenized['attention_mask'])
print("Layer:\t", OPT.model.decoder.embed_positions)
print("Size:\t", embed_pos_input.size())
print("Output:\t", embed_pos_input)

The sample code.

Layer:	 OPTLearnedPositionalEmbedding(2050, 2048)
Size:	   torch.Size([1, 10, 2048])
Output:	 tensor([[[-8.1406e-03, -2.6221e-01,  6.0768e-03,  ...,  1.7273e-02,
          -5.0621e-03, -1.6220e-02],
         [-8.0585e-05,  2.5000e-01, -1.6632e-02,  ..., -1.5419e-02,
          -1.7838e-02,  2.4948e-02],
         [-9.9411e-03, -1.4978e-01,  1.7557e-03,  ...,  3.7117e-03,
          -1.6434e-02, -9.9087e-04],
         ...,
         [ 3.6979e-04, -7.7454e-02,  1.2955e-02,  ...,  3.9330e-03,
          -1.1642e-02,  7.8506e-03],
         [-2.6779e-03, -2.2446e-02, -1.6754e-02,  ..., -1.3142e-03,
          -7.8583e-03,  2.0096e-02],
         [-8.6288e-03,  1.4233e-01, -1.9012e-02,  ..., -1.8463e-02,
          -9.8572e-03,  8.7662e-03]]], device='cuda:0', dtype=torch.float16,
       grad_fn=<EmbeddingBackward0>)

The output.

Lastly, the self-attention component! We use the first layer’s self-attention component by indexing through the layers and accessing the .self_attn method.

embed_position_input = embedded_input + embed_pos_input
hidden_states, _, _ = OPT.model.decoder.layers[0].self_attn(embed_position_input)
print("Layer:\t", OPT.model.decoder.layers[0].self_attn)
print("Size:\t", hidden_states.size())
print("Output:\t", hidden_states)

The sample output.

Layer:	 OPTAttention(
  (k_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
  (v_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
  (q_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
  (out_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
)
Size:	   torch.Size([1, 10, 2048])
Output:	 tensor([[[-0.0119, -0.0110,  0.0056,  ...,  0.0094,  0.0013,  0.0093],
         [-0.0119, -0.0110,  0.0056,  ...,  0.0095,  0.0013,  0.0093],
         [-0.0119, -0.0110,  0.0056,  ...,  0.0095,  0.0013,  0.0093],
         ...,
         [-0.0119, -0.0110,  0.0056,  ...,  0.0095,  0.0013,  0.0093],
         [-0.0119, -0.0110,  0.0056,  ...,  0.0095,  0.0013,  0.0093],
         [-0.0119, -0.0110,  0.0056,  ...,  0.0095,  0.0013,  0.0093]]],
       device='cuda:0', dtype=torch.float16, grad_fn=<MatMul8bitLtBackward>)

The output.

The self-attention component comprises the mentioned query, key, and value layers, culminating in a final projection for the output. It takes the sum of the embedded input and the positional encoding vector as input. In a real-world example, the model also provides the attention mask to the component, enabling it to identify which portions of the input should be disregarded or ignored. (removed from the sample code for simplicity)

The rest of the architecture applies non-linearity (e.g., RELU), feedforward, and batch normalization layers.

💡

If you are interested in learning the Transformer architecture in more detail and implement a GPT-like network from scratch, we recommend watching the following video from Andrej Karpathy: https://www.youtube.com/watch?v=kCc8FmEb1nY.

Conclusion

This lesson provides an overview of the transformer architecture and dives deeper into the model's structure by loading a pre-trained model and extracting its essential components. We also look into what occurs within an LLM under the surface. In particular, the attention mechanism serves as the core component of the model.

In the next lesson, we will cover the diverse architectures of the transformer: encoder-decoder, decoder-only (like the GPTs), and encoder-only (like BERT).

In this Notebook, you can find the code for this lesson.