Deep Lake and Data Loaders

Introduction

In this lesson, we focus on Deep Lake, a powerful AI data system that merges the capabilities of Data Lakes and Vector Databases. We'll explore how Deep Lake can be leveraged for training and fine-tuning Large Language Models, with a focus on its efficient data streaming capabilities. We'll also learn how to create a Deep Lake dataset, add data, and load data using both Deep Lake's and PyTorch's data loaders.

Deep Lake

In the following lessons about training and finetuning LLMs, we’ll need to store the training datasets somewhere, especially for pretraining, since their size is usually too big to be memorized in a single computing node. Ideally, we’d store the datasets elsewhere and efficiently download data in batches when needed. This is where Deep Lake is most useful.

Deep Lake is a multi-modal AI data system that merges the capabilities of Data Lakes and Vector Databases. Deep Lake is particularly beneficial for businesses looking to train or fine-tune LLMs on their own data. It efficiently streams data from remote storage to GPUs during model training, making it a powerful tool for deep learning applications.

Data loaders in Deep Lake are essential components that facilitate efficient data streaming and are very useful for training and fine-tuning LLMs. They are responsible for fetching, decompressing, and transforming data, and they can be optimized to improve performance in GPU-bottlenecked scenarios. Once we store our datasets in Deep Lake, it’s possible to easily create a PyTorch Dataloader or a TensorFlow Dataset.

Deep Lake offers two types of data loaders: the Open Source data loader and the Performant data loader. The Performant version, built on a C++ implementation, is faster and optimizes asynchronous data fetching and decompression. It's approximately 1.5 to 3 times faster than the OSS version, depending on the complexity of the transformation and the number of workers available for parallelization, and it supports distributed training.

Creating a Deep Lake Dataset and Adding Data

Now, let's walk through an example of creating a Deep Lake dataset and fetching some data from it. Deep Lake supports a variety of data formats, and you can ingest them directly with a single line of code.

Deep Lake can be installed with pip as follows: pip install deeplake. Please note that the performant version can be used for free up to 200GB of data stored in the cloud, which is more than we’ll need for the course.

Then, create an account at the Activeloop website. Next, you’ll need an Activeloop API token, which will allow you to identify yourself from your Python code. To get it, click on the “Create API token” button that you can see at the top of your webpage once you’re logged in, and then proceed to create one by clicking on the other “Create API token” button inside the page. Remember to check the token's expiration date: once it’s expired, you’ll need to create a new one from this page to continue using Deep Lake with Python linked to your account.

Once you have your Activeloop token, save it into the ACTIVELOOP_TOKEN environmental variable. You can do so by adding it to your .env file, which will then be loaded, executing the following Python code with the dotenv library.

from dotenv import load_dotenv
load_dotenv()

You are now ready to use Deep Lake! The following Python code shows how we can create a dataset using Deep Lake. Make sure to replace <YOUR_ACTIVELOOP_USERNAME> with your username on Activeloop. You can easily find it in the URL of your webpage, which should have the form https://app.activeloop.ai/<YOUR_ACTIVELOOP_USERNAME>/home.

import deeplake

# env variable ACTIVELOOP_TOKEN must be set with your API token

# create dataset on deeplake
username = "<YOUR_ACTIVELOOP_USERNAME>"
dataset_name = "test_dataset"
ds = deeplake.dataset(f"hub://{username}/{dataset_name}")

# create column text
ds.create_tensor('text', htype="text")

# add some texts to the dataset
texts = [f"text {i}" for i in range(1, 11)]
for text in texts:
    ds.append({"text": text})

In the previous code, we created a Deep Lake dataset named test_dataset. We specify that it contains texts, and then we add 10 data samples to it, one by one. Visit the API docs of Deep Lake to learn about the other available methods.

Once done, you should see printed text like the following.

This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/genai360/test_dataset

By clicking on the URL contained in it, you’ll see your dataset directly from the Activeloop website.

Deep Lake dataset version control allows you to manage changes to datasets with commands very similar to Git. It provides critical insights into how your data is evolving, and it works with datasets of any size. Execute the following code to commit your changes to the dataset.

ds.commit("added texts")

Retrieving Data From Deep Lake

Now, let’s get some data from our Deep Lake dataset.

There are two main syntaxes for getting data from Deep Lake datasets:

The first one uses the Deep Lake dataloader. It’s highly optimized and has the fastest data streaming. However, it doesn’t support custom sampling or full-random shuffling. It is possible to use PyTorch datasets and data loaders. If you’re interested in knowing more about how to use the Deep Lake data loader in cases where data shuffling is important, read this guide.
The second one uses plain PyTorch datasets and data loaders, enabling all the customizability that PyTorch supports. However, they have highly sub-optimal streaming using Deep Lake datasets and may result in 5X+ slower performance compared to using Deep Lake data loaders.

The Deep Lake Data Loader for PyTorch

Here’s a code example of creating a Deep Lake data loader for PyTorch. The following code leverages the performant Deep Lake data loader. It’s the fastest and most optimized way of loading data in batches for model training.

# create PyTorch data loader
batch_size = 3
train_loader = ds.dataloader()\
    .batch(batch_size)\
    .shuffle()\
    .pytorch()

# loop over the elements
for i, batch in enumerate(train_loader):
    print(f"Batch {i}")
    samples = batch.get("text")
    for j, sample in enumerate(samples):
        print(f"Sample {j}: {sample}")
    print()
    pass

You should see the following printed output, showing the retrieved batches.

Please wait, filling up the shuffle buffer with samples.
Shuffle buffer filling is complete.

Batch 0
Sample 0: text 1
Sample 1: text 7
Sample 2: text 8

Batch 1
Sample 0: text 2
Sample 1: text 9
Sample 2: text 6

Batch 2
Sample 0: text 10
Sample 1: text 3
Sample 2: text 4

Batch 3
Sample 0: text 5

PyTorch Datasets and PyTorch Data Loaders using Deep Lake

This code enables all the customizability supported by PyTorch at the cost of having highly slower streaming compared to using Deep Lake data loaders. The reason for the slower performance is that this approach does not take advantage of the inherent dataset format that was designed for fast streaming by Activeloop.

First, we create a subclass of the PyTorch Dataset, which stores a reference to the Deep Lake dataset and implements the __len__ and __getitem__ methods.

from torch.utils.data import DataLoader, Dataset

class DeepLakePyTorchDataset(Dataset):
    def __init__(self, ds):
        self.ds = ds

    def __len__(self):
        return len(self.ds)

    def __getitem__(self, idx):
        texts = self.ds.text[idx].text().astype(str)
        return { "text": texts }

Inside the __getitem__ method, we retrieve the strings stored in the text tensor of the dataset at the position idx.

Then, we instantiate it using a reference to our Deep Lake dataset ds, transform it into a PyTorch DataLoader, and eventually loop over the elements just like we did with the Deep Lake dataloader example.

# create PyTorch dataset
ds_pt = DeepLakePyTorchDataset(ds)

# create PyTorch data loader from PyTorch dataset
dataloader_pytorch = DataLoader(ds_pt, batch_size=3, shuffle=True)

# loop over the elements
for i, batch in enumerate(dataloader_pytorch):
    print(f"Batch {i}")
    samples = batch.get("text")
    for j, sample in enumerate(samples):
        print(f"Sample {j}: {sample}")
    print()
    pass

You should see the following output, showing the retrieved batches.

Batch 0
Sample 0: text 8
Sample 1: text 3
Sample 2: text 1

Batch 1
Sample 0: text 4
Sample 1: text 5
Sample 2: text 9

Batch 2
Sample 0: text 7
Sample 1: text 2
Sample 2: text 6

Batch 3
Sample 0: text 10

Getting the Best High-Quality Data for your Models

Recent research, such as from the “LIMA: Less Is More for Alignment” and “Textbooks Are All You Need” papers, suggests that data quality is very important both for training and finetuning LLMs. As a consequence, Deep Lake has several additional features that can help users investigate the quality of the datasets they are using and eventually filter samples.

Deep Lake provides the Tensor Query Language (TQL), an SQL-like language used for Querying in Activeloop Platform as well as in ds.query in the Python API. This allows data scientists to filter datasets and focus their work on the most relevant data.

The following code shows how we can filter our dataset using a TQL query and print all the samples in the resulting view.



# code that creates a data loader and prints the batches
...

Batch 0
Sample 0: text 1
Sample 1: text 10

Now, we can save our dataset view as follows.

ds_view.save_view(id="strings_with_1")

And we can read from it as follows.

ds = deeplake.dataset(f"hub://{username}/{dataset_name}/.queries/strings_with_1")

Another feature is the samplers. Samplers can be used to assign a discrete distribution of weights to the dataset's samples, which are then sampled according to the weight distribution. This can be useful for focusing training on higher-quality data.

Conclusion

In this lesson, we explored some of the capabilities of Deep Lake, a multi-modal AI data system that merges the functionalities of Data Lakes and Vector Databases.

We've learned how Deep Lake can efficiently stream data from remote storage to GPUs during model training, making it an ideal tool for training and fine-tuning Large Language Models. We've also covered the creation of a Deep Lake dataset, adding data to it, and retrieving data using both Deep Lake's data loaders and PyTorch's data loaders.

This will be useful as we continue exploring training and fine-tuning Large Language Models.