Datasets for Training LLMs

Datasets for Training LLMs

Introduction

In this lesson, we talk about the datasets that fuel LLMs pretraining. We'll explore popular datasets like Falcon RefinedWeb, The Pile, Red Pajama Data, and Stack Overflow Posts, understanding their composition, sources, and usage. We'll also discuss the emerging trend of prioritizing data quality over quantity in pretraining LLMs.

Popular Datasets for Training LLMs

In recent times, a variety of open-source datasets have been employed for pre-training Large Language Models.

Some of the notable datasets include "Falcon RefinedWeb,” "The Pile,” "Red Pajama Data,” and "Stack Overflow Posts," among others. Assembling such datasets typically involves collecting and cleaning vast volumes of text data.

Falcon RefinedWeb

The Falcon RefinedWeb dataset is a large-scale English web dataset developed by TII and released under the ODC-By 1.0 license. It was created through rigorous filtering and extensive deduplication of CommonCrawl, resulting in a dataset that has shown comparable or superior performance to models trained on curated datasets, relying solely on web data.

The dataset is designed to be "multimodal-friendly,” as it includes links and alt texts for images in the processed samples. Depending on the tokenizer used, the public extract of this dataset ranges from 500-650GT and requires about 2.8TB of local storage when unpacked.

Falcon RefinedWeb has been primarily used for training Falcon LLM models, including the Falcon-7B/40B and Falcon-RW-1B/7B models. The dataset is primarily in English, and each data instance corresponds to a unique web page that has been crawled, processed, and deduplicated. It contains around 1 billion instances.

The dataset was constructed using the Macrodata Refinement Pipeline, which includes content extraction, filtering heuristics, and deduplication. The design philosophy of RefinedWeb prioritizes scale, strict deduplication, and neutral filtering. The dataset was iteratively refined by measuring the zero-shot performance of models trained on development versions of the dataset and manually auditing samples to identify potential filtering improvements.

The Pile

The Pile is a comprehensive, open-source dataset of English text designed specifically for training LLMs. Developed by EleutherAI in 2020, it's a massive 886.03GB dataset comprising 22 smaller datasets, 14 of which are new. Prior to the Pile's creation, most LLMs were trained using data from the Common Crawl. However, the Pile offers a more diverse range of data, enabling LLMs to handle a broader array of situations post-training.

The Pile is a carefully curated collection of data handpicked by EleutherAI's researchers to include information they deemed necessary for language models to learn. The Pile covers a wide range of topics and writing styles, including academic writing, a style that models trained on other datasets often struggle with.

All data used in the Pile was sourced from publicly accessible resources and filtered to remove duplicates and non-textual elements like HTML formatting and links. However, individual documents within the sub-datasets were not filtered to remove non-English, biased, or profane text, nor was consent considered in the data collection process.

Originally developed for EleutherAI's GPT-Neo models, the Pile has since been used to train a variety of other models.

RedPajama Dataset

The RedPajama dataset is a comprehensive, open-source dataset that emulates the LLaMa dataset. It comprises 2084 jsonl files, which can be accessed via HuggingFace or directly downloaded. The dataset is primarily in English but includes multiple languages in its Wikipedia section.

The dataset is structured into text and metadata, including the URL, timestamp, source, language, and more. It also specifies the subset of the RedPajama dataset it belongs to, such as Commoncrawl, C4, GitHub, Books, ArXiv, Wikipedia, or StackExchange.

The dataset is sourced from various platforms:

  • Commoncrawl data is processed through the official cc_net pipeline, deduplicated, and filtered for quality.
  • C4 data is obtained from HuggingFace and formatted to suit the dataset's structure.
  • GitHub data is sourced from Google BigQuery, deduplicated, and filtered for quality, with only MIT, BSD, or Apache-licensed projects included.
  • The Wikipedia data is sourced from HuggingFace and is based on a 2023 dump, with hyperlinks, comments, and other formatting removed.
  • Gutenberg and Books3 data are also downloaded from HuggingFace, with near duplicates removed using simhash.
  • ArXiv data is sourced from Amazon S3, with only latex source files included and preambles, comments, macros, and bibliographies removed.
  • Lastly, StackExchange data is sourced from the Internet Archive, with only the posts from the 28 largest sites included, HTML tags removed, and posts grouped into question-answer pairs.

The RedPajama dataset encompasses 1.2 trillion tokens, making it a substantial resource for various language model training and research purposes.

Stack Overflow Posts

If you’re interested more in a specific domain like coding, there are massive datasets available for that, too.

The Stack Overflow Posts dataset comprises approximately 60 million posts submitted to StackOverflow prior to June 14, 2023. The dataset, sourced from the Internet Archive StackExchange Data Dump, is approximately 35GB in size and contains around 65 billion text characters. Each record in the dataset represents a post type and includes fields such as Id, PostTypeId, Body, and ContentLicense, among others.

Data Quality vs. Data Quantity in Pretraining

As we just saw, many of the most used pretraining datasets today are cleaned and more complete versions of other past datasets. There’s recently been a shift in focus from increasing dataset sizes to “increasing dataset size AND dataset quality.”

The paper "Textbooks Are All You Need," published in June 2023, shows this trend. It introduces Phi-1, an LLM designed for code. Phi-1 is a Transformer-based model with 1.3 billion parameters, trained over a period of four days on eight A100s. Despite its relatively smaller scale, it exhibits remarkable accuracy on benchmarks like HumanEval and MBPP. How? It’s been trained on high-quality data (i.e., textbook-quality data; that’s why the paper name is “Textbooks are all you need”).

The training data for Phi-1 comprises 6 billion tokens of "textbook quality" data from the web and 1 billion tokens from synthetically generated textbooks using GPT-3.5. Although Phi-1's specialization in Python coding and lack of domain-specific knowledge somewhat limit its versatility, these limitations are not inherent and can be addressed to enhance its capabilities.

Despite its smaller size, the model's success in coding benchmarks demonstrates the significant impact of high-quality and coherent data on the proficiency of language models, thereby shifting the focus from quantity to quality of data.

Creating Your Own Dataset

Creating your own dataset would involve a whole lesson for it and we won’t cover it in this course in detail. However, if you’re interested in doing so, you can study the creation process of the datasets listed in the above sections, as it’s often a publicly disclosed process.

Conclusion

This lesson provides a comprehensive overview of the datasets that fuel the pretraining of LLMs.

We delved into popular datasets such as Falcon RefinedWeb, The Pile, Red Pajama Data, and Stack Overflow Posts, understanding their composition, sources, and usage. Often derived from larger, less refined datasets, these datasets have been meticulously cleaned and curated to provide high-quality data for training LLMs.

We also discussed the emerging trend of prioritizing data quality over quantity in pretraining LLMs, as exemplified by the Phi-1 model. Despite its smaller scale, Phi-1's high performance on benchmarks underscores the significant impact of high-quality and coherent data on the proficiency of language models. This shift in focus from data quantity to quality is an exciting development in the field of LLMs, highlighting the importance of dataset refinement in achieving superior model performance.