Overview of the Training Process

Introduction

In this lesson, we will provide an overview of the multiple steps involved in training LLMs, including preprocessing the training data, the model architecture, and the training process. By the end of this lesson, you will have a solid understanding of the various steps involved in training large language models.

The training process begins with the selection of one or a combination of suitable datasets, proceeds with the initialization of the neural network, and ultimately concludes with the execution of the training loop. We will also discuss the process of saving the weights for future utilization. Although this process may seem challenging, we will break it down into steps and approach each one separately to aid your understanding of the intricacies involved.

The Dataset

Whether you are training a general LLM or a specialized one for a specific domain, curating a comprehensive database containing relevant information is the most crucial step. The progress made in the field of neural networks and Natural Language Processing firmly establishes one architecture as the go-to choice (the transformer), thereby emphasizing the significant impact of dataset size and quality on the model's performance.

There are several well-known databases that you can use as a source of public knowledge. For instance, consider datasets like The Pile, Common Crawl, or Wikipedia, which entail extensive collections of web pages, articles, or books. Each of these datasets is comprised of hundreds of billions of tokens, providing diverse learning material for the model.

These datasets are mostly available publicly through different sources. We prepared a Deep Lake repository containing several datasets that we’ll use in this course; find it here.

The next category of datasets is important only if you are training a model for a specific use case based on the data your organization has at hand or curated. Note that the size of your dataset may vary depending on your application and whether you opt for fine-tuning or training from scratch. The data source can be obtained through web scraping of news websites, forums, or publicly accessible databases, in addition to leveraging your own private knowledge base. It’s also possible to use a foundational LLM to generate a synthetic dataset of your own to be used for training a specialised domain LLM, which may be less expensive and faster than the big foundational model.

Splitting the dataset into training and validation sets is a standard process. The training set is utilized during the training process to optimize the model's parameters. On the other hand, the validation set is used to assess the model's performance and ensure it is not overfitting by evaluating its generalization ability.

The Model

The transformer has been the dominant network architecture for natural language processing tasks in recent years. It is powered by the attention mechanism, which enables the models to accurately identify the relationship between words. This architecture has resulted in state-of-the-art scores for numerous NLP tasks over the years and powered well-known LLMs like the GPT family. Based on the literature, it is evident that increasing the number of parameters in transformer-based networks enhances language generation and comprehensive ability.

With the widespread adoption of transformers, you have the option to utilize libraries such as Tensorflow, PyTorch, and Huggingface to initialize the architecture. Alternatively, you can code it yourself by referring to numerous tutorials available to get a more in-depth understanding.

One of the benefits of utilizing the transformers library developed by Huggingface is the availability of their hub, which simplifies the process of loading open-source LLMs such as Bloom or OpenAssistant.

Training

The first generation of foundational models like BERT were trained with Masked Language Modeling (MLM) learning objectives. This is achieved by randomly masking words from the corpus and configuring the model to predict the masked word. By employing this objective, the model gains the ability to consider the contextual information preceding and following the masked word, enabling it to make informed decisions. This objective may not be the most suitable choice for generative tasks, as ideally, the model should not have access to future words while predicting the current word.

The GPT family models used the Autoregressive learning objective. This algorithm ensures that the model consistently attempts to predict the next word without accessing the future content within the corpus. The training process will be iterative, which feeds back the generated tokens to the model to predict the next word. Masked attention ensures that, at each time step, the model is prevented from seeing future words.

To train or finetune models, you have the option to either implement the training loop using libraries such as PyTorch or utilize the Trainer class provided by Huggingface. The latter option enables us to easily configure different hyperparameters, log, save checkpoints, and evaluate the model.

Conclusion

The training process often involves significant trial and error to achieve optimal results. Using libraries can significantly expedite the training process and save time by eliminating the need to implement various mechanisms manually. The model's capability is influenced by various factors, including its size, the size of the dataset, and the chosen hyperparameters, which collectively contribute to the complexity of the process.

In upcoming lessons, we will explore each training process step with more detailed explanations.