Transformers Architectures

Introduction

The transformer architecture has demonstrated its versatility in various applications. The original network was presented as an encoder-decoder architecture for translation tasks. The next evolution of transformer architecture began with the introduction of encoder-only models like BERT, followed by the introduction of decoder-only networks in the first iteration of GPT models.

The differences extend beyond just network design and also encompass the learning objectives. These contrasting learning objectives play a crucial role in shaping the model's behavior and outcomes. Understanding these differences is essential for selecting the most suitable architecture for a given task and achieving optimal performance in various applications.

In this lesson, we will explore the distinctions between these architectures by loading pre-trained models. The goal is to dive deeper into each architecture.

The Encoder-Decoder Architecture

Image taken from the “Attention is all you need” paper

The encoder-decoder, also known as the full transformer architecture, comprises multiple stacked encoder components connected to several stacked decoder components through a cross-attention mechanism.

It is notably well-suited for sequence-to-sequence (i.e., handling text as both input and output) tasks such as translation or summarization, mainly when designing models with multi-modality, like image captioning with the image as input and the corresponding caption as the expected output. Cross-attention will help the decoder focus on the most important part of the content during the generation process.

A notable example of this approach is the BART pre-trained model. The architecture incorporates a bi-directional encoder responsible for creating a comprehensive representation of the input, while an autoregressive decoder generates the output one token at a time. The model takes in a randomly masked input along with the input shifted by one token and attempts to reconstruct the original input as a learning objective.

The provided code below loads the BART model so we can examine its architecture.

from transformers import AutoModel, AutoTokenizer

BART = AutoModel.from_pretrained("facebook/bart-large")
print(BART)

The sample code.

BartModel(
  (shared): Embedding(50265, 1024, padding_idx=1)
  (encoder): BartEncoder(
    (embed_tokens): Embedding(50265, 1024, padding_idx=1)
    (embed_positions): BartLearnedPositionalEmbedding(1026, 1024)
    (layers): ModuleList(
      (0-11): 12 x BartEncoderLayer(
        (self_attn): BartAttention(
          (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (activation_fn): GELUActivation()
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      )
    )
    (layernorm_embedding): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  )
  (decoder): BartDecoder(
    (embed_tokens): Embedding(50265, 1024, padding_idx=1)
    (embed_positions): BartLearnedPositionalEmbedding(1026, 1024)
    (layers): ModuleList(
      (0-11): 12 x BartDecoderLayer(
        (self_attn): BartAttention(
          (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (activation_fn): GELUActivation()
        (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (encoder_attn): BartAttention(
          (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      )
    )
    (layernorm_embedding): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  )
)

The output.

We are already familiar with most of the layers in the BART model. The model is comprised of both encoder and decoder components, with each component consisting of 12 layers. Additionally, The decoder component, in particular, contains an additional encoder_attn layer, referred to as cross-attention. The cross-attention component will condition the decoder’s output based on the encoder representations.

We can use the fine-tuned version of this model for summarization using the Transformer’s pipeline functionality.

from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
sum = summarizer("""Gaga was best known in the 2010s for pop hits like “Poker Face” and avant-garde experimentation on albums like “Artpop,” and Bennett, a singer who mostly stuck to standards, was in his 80s when the pair met. And yet Bennett and Gaga became fast friends and close collaborators, which they remained until Bennett’s death at 96 on Friday. They recorded two albums together, 2014’s “Cheek to Cheek” and 2021’s “Love for Sale,” which both won Grammys for best traditional pop vocal album.""", min_length=20, max_length=50)
print(sum[0]['summary_text'])

The sample code.

Bennett and Gaga became fast friends and close collaborators. They recorded two albums together, 2014's "Cheek to Cheek" and 2021's "Love for Sale"

The output.

The Encoder-Only Architecture

Image taken from the “BERT: Pre-training of Deep Bidirectional Transformers for LanguageUnderstanding” paper

As implied by the name, the encoder-only models are formed by stacking multiple encoder components. As the encoder output cannot be connected to another decoder, its output can be directly used as a text-to-vector method, for instance, to measure similarity. Alternatively, it can be combined with a classification head (feedforward layer) on top to facilitate label prediction (it is also known as a Pooler layer in libraries such as Huggingface).

The primary distinction in the encoder-only architecture lies in the absence of the Masked Self-Attention layer. As a result, the encoder can handle the entire input simultaneously. This differs from decoders, where future tokens need to be masked during training to prevent “cheating” when generating new tokens. Due to this property, they are ideally suited for creating representations from a document while retaining complete information.

The BERT paper (or an improved variant like RoBERTa) introduced a widely recognized pre-trained model that significantly improved the state-of-the-art scores on numerous NLP tasks. The model undergoes pre-training with two learning objectives:

Masked Language Modeling: masking random tokens from the input and attempting to predict them.
Next Sentence Prediction: Present sentences in pairs and assess the likelihood of the second sentence in the subsequent sequence of the first sentence.

BERT = AutoModel.from_pretrained("bert-base-uncased")
print(BERT)

The sample code.

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
  (pooler): BertPooler(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (activation): Tanh()
  )
)

The output.

The BERT model adopts the conventional transformer architecture for input embedding and 12 encoder blocks. However, the network’s output will be passed on to a pooler layer, which is a feed-forward linear layer followed by non-linearity that will generate the final representation. This representation will subsequently be utilized for various tasks, such as classification or similarity assessment.

The following code uses the fine-tuned version of the BERT model for sentiment analysis.

classifier = pipeline("text-classification", model="nlptown/bert-base-multilingual-uncased-sentiment")
lbl = classifier("""This restaurant is awesome.""")

print(lbl)

The sample code.

[{'label': '5 stars', 'score': 0.8550480604171753}]

The output.

The Decoder-Only Architecture

Image taken from “Improving language understanding with unsupervised learning” paper

The decoder-only networks continue to serve as the foundation for most large language models today, with slight variations in some instances. Because of the implementation of masked self-attention, their primary use case revolves around the next-token-prediction task, which sparked the concept of prompting.

Research demonstrated that scaling up the decoder-only models can significantly enhance the network's language understanding and generalization capabilities. As a result, they can excel at a diverse range of tasks simply by using different prompts. Large pre-trained models like GPT-4 and LLaMA 2 exhibit the ability to perform tasks such as classification, summarization, translation, etc., by leveraging the appropriate prompt.

The large language models, such as those in the GPT family, undergo pre-training using the Causal Language Modeling objective. This means the model aims to predict the next word, while the attention mechanism can only attend to previous tokens on the left. This implies that the model can solely rely on the previous context to predict the next token and is unable to peek at future tokens, preventing any form of cheating.

gpt2 = AutoModel.from_pretrained("gpt2")
print(gpt2)

The sample code.

GPT2Model(
  (wte): Embedding(50257, 768)
  (wpe): Embedding(1024, 768)
  (drop): Dropout(p=0.1, inplace=False)
  (h): ModuleList(
    (0-11): 12 x GPT2Block(
      (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): GPT2Attention(
        (c_attn): Conv1D()
        (c_proj): Conv1D()
        (attn_dropout): Dropout(p=0.1, inplace=False)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
      (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (mlp): GPT2MLP(
        (c_fc): Conv1D()
        (c_proj): Conv1D()
        (act): NewGELUActivation()
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)

The output.

When examining the architecture, you will notice the standard transformer decoder block with the cross-attention removed. The GPT family also employs different linear layers (Conv1D) that transpose the weights. (Please note that this should not be confused with PyTorch's convolutional layer.) This design choice is specific to OpenAI, while other open-source large language models use the standard linear layer. The provided code illustrates how the pipeline can be used to incorporate the GPT2 model for text generation. It generates four different alternatives to complete the phrase "This movie was a very.”

generator = pipeline(model="gpt2")
output = generator("This movie was a very", do_sample=True, top_p=0.95, num_return_sequences=4, max_new_tokens=50, return_full_text=False)

for item in output:
  print(">", item['generated_text'])

The sample code.

>  hard thing to make, but this movie is still one of the most amazing shows I've seen in years. You know, it's sort of fun for a couple of decades to watch, and all that stuff, but one thing's for sure —
>  special thing and that's what really really made this movie special," said Kiefer Sutherland, who co-wrote and directed the film's cinematography. "A lot of times things in our lives get passed on from one generation to another, whether
>  good, good effort and I have no doubt that if it has been released, I will be very pleased with it."

Read more at the Mirror.
>  enjoyable one for the many reasons that I would like to talk about here. First off, I'm not just talking about the original cast, I'm talking about the cast members that we've seen before and it would be fair to say that none of

The output.

💡

Please be aware that running the above code will yield different outputs due to the randomness involved in the generation process.

Conclusion

In this lesson, we explored the various types of transformer-based models and their areas of maximum effectiveness. While LLMs may appear to be the ultimate solution for every task, it's essential to note that there are instances where smaller, more focused models can produce equally good results while operating more efficiently. Using a small model like DistilBERT on your local server to measure similarity could be more suitable for specific applications while offering a cost-effective alternative to using proprietary models and APIs.

Moreover, the transformer paper introduced an effective architecture. However, various architectures have been experimented with minor code changes, such as different embedding sizes and hidden dimensions. Recent experiments have also shown that relocating the batch normalization layer before the attention mechanism can enhance the model's capabilities. Keep in mind that there could be slight variations in the architecture, especially for proprietary models like GPT-3 that have not released their code.

In this Notebook, you can find the code for this lesson.