Fine-Tune LLMs with AWS and TorchTune

Introduction

In the rapidly evolving field of artificial intelligence, fine-tuning pre-trained models has become an essential technique for achieving high performance on specific tasks. The LLaMA (Large Language Model Meta AI) series has gained importance due to its versatility and effectiveness across various applications. LLaMA 3, the latest iteration, brings improved architecture and capabilities, making it an attractive choice for developers and researchers.

Fine-tuning a model involves adapting a pre-trained model to new data, enabling it to perform better on a specific task. This process requires careful consideration of the dataset, the computational resources, and the fine-tuning framework. In this article, we will explore how to fine-tune the LLaMA 3 model using Torchtune, a popular framework for model tuning, and a dataset hosted on Amazon Web Services (AWS).

Fine-Tuning LLaMA 3 Using Torchtune and AWS

Step 1: Setting Up AWS

Create an S3 Bucket:

Log in to your AWS account.
Navigate to the S3 service and create a new bucket.
Upload your dataset to the S3 bucket.

Step 2: Connecting Data From Your Cloud Using Deep Lake Managed Credentials

Connecting data from your cloud and managing credentials in Deep Lake offers several key benefits:

Access to Performance Features: Utilize the Deep Lake Compute Engine for enhanced performance.
Integration with Deep Lake App : Access datasets stored in your cloud through the Deep Lake App.
Simplified Access via Python API : Easily access Deep Lake datasets stored in your cloud using the Python API.
Credential Management : Avoid repeatedly specifying cloud access keys in your Python code.

For Deep Lake to access datasets or linked tensors stored in the user's cloud, it must authenticate the respective cloud resources. This can be done using access keys or through role-based access. We can also refer to our official guide at the following link for more detailed instructions.

Default storage allows us to map the Deep Lake path hub://org_id/dataset_name to a cloud path of our choosing. This means that all datasets created using the Deep Lake path will be stored at the location you specify and can be accessed using API tokens and managed credentials from Deep Lake. By default, the storage is set to Activeloop Storage, but you can change this through the UI in the Activeloop platform.

Step 3: Preparing the Dataset

If we do not set the Default Storage to our own cloud, we can still connect datasets in our cloud to the Deep Lake App using the Python API below. Once connected to Deep Lake, the dataset is given a Deep Lake path in the format hub://org_id/dataset_name, and can be accessed using API tokens and managed credentials from Deep Lake, without the need to repeatedly specify cloud credentials.

Connecting Datasets in the Python API :

pip install deeplake==3.9.27

# Step 1: Create/load the dataset directly in the cloud using your org_id and
# Managed Credentials (creds_key) for accessing the data (See Managed Credentials above)
ds = deeplake.load('s3://my_bucket/dataset_name',
                    creds={'creds_key': 'managed_creds_key'}, org_id='my_org_id')

# Step 2a: Connect the dataset to Deep Lake, inheriting the dataset_name above
ds.connect()
## ->>> This produces a Deep Lake path for accessing the dataset such as:
## ---- 'hub://my_org_id/dataset_name'

## OR

# Step 2b: Specify your own path and dataset name for future access to the dataset.
# You can also specify different managed credentials, if desired
ds.connect(dest_path = 'hub://org_id/dataset_name', creds_key = 'my_creds_key')

Upload the Dataset to S3:

Use the following Python script to upload our dataset to an S3 bucket. This script uses Deep Lake's API to handle the upload process.

def upload_raft_dataset_to_activeloop(data_to_upload: dict, ds: deeplake.dataset):
    # create column
    with ds:
        ds.create_tensor("id", htype="text", exist_ok=True)
        ds.create_tensor("type", htype="text", exist_ok=True)
        ds.create_tensor("question", htype="text", exist_ok=True)
        ds.create_tensor("oracle_context", htype="text", exist_ok=True)
        ds.create_tensor("cot_answer", htype="text", exist_ok=True)
        ds.create_tensor("instruction", htype="text", exist_ok=True)

        for num_el in tqdm(range(len(data_to_upload))):

            ds.append(
                {
                    "id": data_to_upload["id"][num_el],
                    "type": data_to_upload["type"][num_el],
                    "question": data_to_upload["question"][num_el],
                    "oracle_context": data_to_upload["oracle_context"][num_el],
                    "cot_answer": data_to_upload["cot_answer"][num_el],
                    "instruction": data_to_upload["instruction"][num_el],
                }
            )

In this example we are uploading a dataset specifying which labels we want it to have. The dataset being uploaded in this example is in RAFT format, used to train an LLM as described in this paper.

bucketname = "YOUR_BUCKET_NAME"
dataset_name = "YOUR_DATASET_NAME"
s3_dataset_path = f's3://{bucketname}/{dataset_name}'


# use one of these 3 command to create the dataset
ds = deeplake.empty(s3_dataset_path, creds = {"aws_access_key_id": ..., ...}) # Create dataset stored in your cloud using your own credentials.
ds = deeplake.empty(s3_dataset_path, creds = {"creds_key": "managed_creds_key"}, org_id = "my_org_id") # Create dataset stored in your cloud using Deep Lake managed credentials.
ds = deeplake.empty(s3_dataset_path, overwrite=True) # Overwrite the dataset currently at the path

raft_raw_file = "raft_file.jsonl"

jsonObj = pd.read_json(path_or_buf=raft_raw_file, lines=True)

upload_raft_dataset_to_activeloop(jsonObj, ds)

Load the Dataset:

The deeplake.load function is a versatile tool for loading datasets from various storage locations. By specifying appropriate parameters and credentials, we can access our data whether it’s managed by Deep Lake or stored in our own cloud infrastructure.

ds = deeplake.load("s3://mybucket/my_dataset", creds = {"aws_access_key_id": ..., ...}) # Load dataset stored in your cloud using your own credentials.
# OR
ds = deeplake.load("s3://mybucket/my_dataset", creds = {"creds_key": "managed_creds_key"}, org_id = "my_org_id") # Load dataset stored in your cloud using Deep Lake managed credentials

You can find the official documentation here for detailed instructions.

Step 4: Fine-Tuning the Model with TorchTune

Before diving into fine-tuning the model, we need to understand the PyTorch Dataset class for loading data from ActiveLoop's DeepLake platform. This class will facilitate the efficient loading of data, which is crucial for the fine-tuning process:

class DeepLakeDataloader(Dataset):
    """A PyTorch Dataset class for loading data from ActiveLoop's DeepLake platform.

    This class serves as a data loader for working with datasets stored in ActiveLoop's DeepLake platform.
    It takes a Deep Lake dataset object as input and provides functionality to load data from it
    using PyTorch's DataLoader interface.

    Args:
        ds (deeplake.Dataset): The dataset object obtained from ActiveLoop's DeepLake platform.
    """

    def __init__(self, ds: deeplake.Dataset):
        self.ds = ds

    def __len__(self):
        return len(self.ds)

    def __getitem__(self, idx):
        column_map = self.ds.tensors.keys()

        values_dataset = {}
        for el in column_map:  # {"column_name" : value}
            values_dataset[el] = self.ds[el][idx].text().astype(str)

        return values_dataset

def load_deep_lake_dataset(
    deep_lake_dataset: str, **config_kwargs
) -> DeepLakeDataloader:
    """
    Load a dataset from ActiveLoop's DeepLake platform.

    Args:
        deep_lake_dataset (str): The name of the dataset to load from DeepLake.
        **config_kwargs: Additional keyword arguments passed to `deeplake.dataset`.

    Returns:
        DeepLakeDataloader: A data loader for the loaded dataset.
    """
    ds = deeplake.dataset(deep_lake_dataset, **config_kwargs)
    log.info(f"Dataset loaded from deeplake: {ds}")
    return DeepLakeDataloader(ds)

The DeepLakeDataloader class is specifically designed to interact with the DeepLake dataset object. It implements the __len__ and __getitem__ methods to comply with PyTorch's Dataset class, making it easy to integrate with PyTorch's DataLoader for batch processing during training. The load_deep_lake_dataset function simplifies the process of loading a dataset from DeepLake, ensuring that all necessary configurations and credentials are handled.

Step 5: Fine-Tuning the Model with TorchTune

To be able to train the model with RAFT technique and Deep Lake Dataloader we need to download the repository and install the requirements:

!git clone -b feature/raft-fine-tuning <https://github.com/efenocchi/torchtune.git>
%cd torchtune
!pip install -e .

Now we can specify our dataset by replacing the one already present in the torchtune/datasets/_raft.py file, if you want to fine-tune a specific dataset change the path here torchtune/datasets/_raft.py

To continue, all we have to do is download the Llama3 weights directly from Hugging Face. Please note that before you can access these files you must accept the Meta terms here.

llama3_original_checkpoints_folder = "llama3"
os.makedirs(llama3_original_checkpoints_folder,exist_ok = True)

lora_finetune_output_checkpoints_folder = "lora_finetune_output"
os.makedirs(lora_finetune_output_checkpoints_folder,exist_ok = True)

!tune download meta-llama/Meta-Llama-3-8B --output-dir llama3 --hf-token <YOUR_HF_TOKEN>

In the torchtune/recipes/configs/llama3/8B_lora_single_device_deep_lake_raft.yaml file change /tmp/Meta-Llama-3-8B/original with llama3/original and /tmp/Meta-Llama-3-8B/ with lora_finetune_output so that the file we are going to execute during the training phase is able to point to the correct folder.

Now let's make sure we have the necessary resources for the training phase (an A100 GPU was used in this guide) and proceed with the training command:

!tune run lora_finetune_single_device --config recipes/configs/llama3/8B_lora_single_device_deep_lake_raft.yaml

Since Torchtune performs excellently during the training phase but not as well during the testing phase, we decided to convert the PyTorch weights to the standard Hugging Face format and upload them to our space.

Make sure you are in the root project folder and not inside torchtune and install the following packages:

%cd ..
!pip install git+https://github.com/huggingface/transformers
%cd ..
!git clone https://github.com/huggingface/transformers
!pip install tiktoken blobfile
!pip install accelerate transformers

Move the tokenizer, Llama3 fine-tuned model checkpoint, and params.json file into the weights folder.

weights_folder = "weights"
os.makedirs(weights_folder,exist_ok = True)
!cp llama3/meta_model_0.pt weights/consolidated.00.pth
!cp llama3/original/params.json weights
!cp llama3/original/tokenizer.model weights

Now we can transform the weights into the standard used by Hugging Face and load them into our space:

!python transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py \\
--input_dir torchtune/weights \\
--model_size 8B \\
--output_dir hf_weights \\
--llama_version 3

Step 5: Saving and Deploying the Model

We upload the weights on our space by choosing a suitable name:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("hf_weights")
model = AutoModelForCausalLM.from_pretrained("hf_weights")

hf_repository_name = "llama3_RAFT"

tokenizer.push_to_hub(hf_repository_name)
model.push_to_hub(hf_repository_name)

Conclusion

This guide has shown how to adapt pre-trained models to meet our specific needs, enhancing their performance and utility in various applications. The process involves setting up AWS infrastructure, managing credentials with Deep Lake, uploading and connecting datasets, and using Torchtune to fine-tune the model with custom configurations.

By following these steps, we can achieve a tailored AI solution capable of performing specialized tasks with high efficiency. Fine-tuning allows us to utilize the full potential of the LLaMA 3 model, ensuring it is optimized for our unique requirements. This not only showcases the flexibility and power of modern AI frameworks but also highlights the importance of integrating cloud resources and advanced tuning techniques in AI development.