Deploying an LLM on a Cloud CPU

Introduction

Training a language model can be costly, and expenses associated with deploying it can quickly accumulate over time. Utilizing optimization techniques that enhance the efficiency of the inference process is crucial for minimizing hosting expenses. In this lesson, we will discuss the utilization of the Intel® Neural Compressor library to implement quantization techniques. This approach aims to enhance the cost-effectiveness and speed of models when running on CPU instances (it supports also AMD CPU, ARM CPU, and NVidia GPU through ONNX Runtime, but with limited testing).

Various techniques can be employed for optimizing a network. Pruning involves trimming the parameter count by targeting less important weights, while knowledge distillation transfers insights from a larger model to a smaller one. Lastly, quantization decreases weight precision from 32 bits to 8 bits. It will significantly decrease the memory needed for loading models and generating responses with minimal accuracy loss.

Credit: Deci.ai

The primary focus of this lesson is the quantization technique. We will apply it to an LLM and demonstrate how to perform inference using the quantized model. Ultimately, we will execute several experiments to assess the resulting acceleration.

We'll begin by setting up the necessary libraries. Install the optimum-intel package directly from its GitHub repository.

pip install git+https://github.com/huggingface/optimum-intel.git@v1.11.0
pip install onnx===1.14.1 neural_compressor===2.2.1
pip install deeplake==3.9.27

The sample code.

Simple Quantization (using CLI)

You can utilize the optimum-cli command within the terminal to execute dynamic quantization. Dynamic quantization stands as the recommended approach for transformer-based neural networks. You have the choice to either specify the path to your custom model or select a model from the Huggingface Hub, which will be designated using the --model parameter. The --output parameter determines the name of the resulting model. We are conducting tests on Facebook's OPT model with 1.3 billion parameters.

optimum-cli inc quantize --model facebook/opt-1.3b --output opt1.3b-quantized

The sample code.

The script above will automatically load the model and handle the quantization process. I worth noting that if the script fails to recognize your model, you can employ the --task parameter. You might use --task text-generation for language models. Check the source code for a complete list of supported tasks.

The mentioned approach can be easily incorporated by simply executing a single command. However, for more complex applications where greater control over the process is needed, it might not provide the desired flexibility. In the following section we will use the same Intel Neural Compressor package to perform a more targeted quantization process.

Flexible Quantization (using Code)

As previously discussed, the library includes a constrained quantization method that allows you to specify a precise quantization target. In this approach, we must code and implement the function, which requires more steps while providing more control over the process. For example, you can employ an evaluation function to request quantization of the model while experiencing no more than a 1% decrease in accuracy. To begin, let's install the necessary packages which also includes the Intel neural compressor from the previous section. This will allow us to load the model and carry out the quantization process.

pip install transformers===4.34.0 evaluate===0.4.0 datasets===2.14.5

The sample code.

The mentioned packages will help with loading the large language model (transformers), defining an evaluation metric to measure how close we are to the target (evaluate), and importing a dataset for the evaluation process (datasets). Now, we can load the model's weights and its accompanying tokenizer.

model_name "aman-mehra/opt-1.3b-finetune-squad-ep-0.4-lr-2e-05-wd-0.01"
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir="./opt-1.3b")
model = AutoModelForQuestionAnswering.from_pretrained(model_name, cache_dir="./opt-1.3b")

The sample code.

We are initializing a model specifically tailored for the question-answering task. Keep in mind that the model you chose must be fine-tuned for question answering tasks before performing the quantization, we chose a fine-tuned version of the OPT-1.3 model.

Choosing a task is required to define an objective for our quantization target function and control the process. The task and evaluation metrics can vary widely, ranging from text generation with perplexity, summarization with ROUGE, translation with BLEU, to even classification based on simple accuracy. The next is to define the evaluation metric to assess the model’s accuracy and the benchmark dataset that compliments it.

task_evaluator = evaluate.evaluator("question-answering")

eval_dataset = load_dataset("squad", split="validation", cache_dir="./squad-ds")
eval_dataset = eval_dataset.select(range(64)) # Ues a subset of dataset

The sample code.

The .evaluator() method will load the essential functions required for evaluating the question-answering task. (Further information on various options is available in the Hugging Face documentation.) You can then employ the load_dataset function from the Hugging Face library to bring a dataset into memory. This function allows you to specify parameters such as the dataset name, which splits to download (train, test, or validation), and the location for storing the dataset. Using the mentioned variables to create the evaluation function is now possible.

qa_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer)

def eval_fn(model):
    qa_pipeline.model = model
    metrics = task_evaluator.compute(model_or_pipeline=qa_pipeline, data=eval_dataset, metric="squad")
    return metrics["f1"]

The sample code.

We need to create a pipeline that ties the model with the tokenizer to be able to calculate the model’s performance by calling the task evaluator’s .compute() function along with the mentioned pipeline and the evaluation dataset. The eval_fn function will compute the accuracy and return the percentage. The quantization process requires several configurations for guidance.

# Set the accepted accuracy loss to 1%
accuracy_criterion = AccuracyCriterion(tolerable_loss=0.01)

# Set the maximum number of trials to 10
tuning_criterion = TuningCriterion(max_trials=10)

quantization_config = PostTrainingQuantConfig(
    approach="dynamic", accuracy_criterion=accuracy_criterion, tuning_criterion=tuning_criterion
)

The sample code.

The PostTrainingQuantConfig config variable will set the required parameters for the quantization process. We are employing the dynamic approach for quantization while accepting a maximum of 1% loss in accuracy, controlled by defining the AccuracyCriterion class. Use the TuningCriterion class to set the maximum number of trials before finishing the quantization process. Lastly, we will define a quantizer object using the INCQuantizer class, which accepts both the model and evaluation function. It could initiate the quantization process by calling the .quantize() method.

quantizer = INCQuantizer.from_pretrained(model, eval_fn=eval_fn)

quantizer.quantize(quantization_config=quantization_config, save_directory="opt1.3b-quantized")

The sample code.

Please note that it is impossible to execute the codes in this section on the Google Colab instance due to memory constraints. However, you can replace the model ("facebook/opt-1.3b") with a smaller model like "distilbert-base-cased-distilled-squad".

Inference

Now, the model is ready for inference purposes. In this section, we will focus on how to load these models and present the outcomes of our benchmark tests, highlighting the impact of quantization on the speed of the generation process. Prior to conducting the inference process, it's essential to load the pre-trained tokenizer using the AutoTokenizer class. As the quantization technique doesn't alter the model's vocabulary, we will employ the same tokenizer as the base model.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")

The sample code.

For loading the model, we utilize the INCModelForCasualLM class provided by the Optimum package. Additionally, it offers a range of loaders tailored for various tasks, including INCModelForSequenceClassification for classification and INCModelForQuestionAnswering for tasks involving question answering. The .from_pretrained() method should be provided with the path to the quantized model from the previous section.

from optimum.intel import INCModelForCausalLM

model = INCModelForCausalLM.from_pretrained("./opt1.3b-quantized")

The sample code.

Finally, we can employ the identical .generate method from the Transformers library to input the prompt to the model and get the response.

inputs = tokenizer("<PROMPT>", return_tensors="pt")

generation_output = model.generate(**inputs,
                                   return_dict_in_generate=True,
                                   output_scores=True,
                                   min_length=512,
                                   max_length=512,
                                   num_beams=1,
                                   do_sample=True,
                                   repetition_penalty=1.5)

The sample code.

The last step is to convert the generated token ids from the model to the words. It is the same decoding process as we saw in previous lessons.

print( tokenizer.decode(generation_output.sequences[0]) )

The sample code.

What does life mean? Describe in great details.\nI have no idea. I don't know how to describe it. I don't know what I'm supposed to do with my life. I don't know what I want to do with my life...

The output.

The chosen OPT model is not a instructed tuned model. so it tries to complete the sequence which is fed to it. As evident, it enters into a repetitive loop, reiterating the same words, since we have instructed it to produce exact 512 tokens. Even if the model desires to stop, it is unable to do so!

As mentioned, we force the model to produce 512 tokens by explicitly setting minimum and maximum length parameters. The rationale is maintaining a uniform token count between the standard model and the quantized version, facilitating a valid comparison of their generation times. We also experimented with the beam search decoding strategy.

Decoding Method	Vanilla (seconds)	Quantized (seconds)
Greedy	58.09	26.847
Beam Search (K=4)	144.77	40.73

The most significant enhancement involves implementing beam search with a batch size of 1, which led to a 3.5x acceleration in the inference process. All the mentioned experiments were conducted on a server instance equipped with the 4th Gen Intel® Xeon® Scalable processor with 8 vCPU (4 cores) and 64GB of memory. The utilized instance is the entry tier of the Intel CPUs, featuring the least number of cores among the Intel® Scalable Processor family. This highlights the feasibility of performing inference on CPU instances to mitigate costs and latency effectively.

On the other hand, the highest tier series of 4th Gen Intel Xeon Scalable processors, accessible on the Google Cloud Platform, offers 176 virtual CPUs with 88 cores. It's worth highlighting that the most powerful processor within the SPR family can have up to 112 physical cores. In a recent report released by Intel comparing the performance of their Intel 4th Gen Xeon to the latest Intel® Xeon® Max, the report highlights an impressive speed of 114ms for processing 2K tokens using the latest Max processors with the LLaMA 2 model, which has 13B parameters. Intel takes the lead in advancing and fine-tuning the CPU backend of torch.compile, a prominent feature in PyTorch 2.0. Additionally, Intel provides the Intel® Extension for PyTorch to implement advanced optimizations specifically designed for Intel CPUs before their integration into the official PyTorch distribution.

Llama 2 7B and 13B inference (BFloat16) performance on Intel® Xeon® Scalable Processors. (source: Intel Blog)

Deployment Frameworks

Deploying large language models into production is the final stage in harnessing their capabilities for a diverse array of applications. Creating an API is the most efficient and flexible approach among the various methods available. APIs allow developers to seamlessly integrate these models into their code, enabling real-time interactions with web or mobile applications. There are several ways to create such APIs, each with its advantages and trade-offs.

There exist specialized libraries, such as vLLM and TorchServe, designed for handling specific use cases. These libraries are capable of loading models from various sources and creating endpoints for convenient accessibility. In most cases, these libraries even offer optimization methods to enhance the speed of the inference process, batching income requests, and efficient memory management. On the other hand, there exist standard backend libraries such as FastAPI that facilitate the creation of any endpoints. While it may not be specifically designed for serving AI models, you can effortlessly integrate it into your development process to generate other APIs as needed.

Regardless of the chosen method, a well-designed API ensures that large language models can be deployed robustly, enabling organizations to leverage their capabilities in chatbots, content generation, language translation, and many other applications.

Deploying a model on CPU using Compute Engine with GCP

Follow these steps to deploy a language model on Intel® CPUs using Compute Engine with Google Cloud Platform (GCP):

Google Cloud Setup: Sign in to your Google Cloud account. If you don't have one, create it and set up a new project.
Enable Compute Engine API: Navigate to APIs & Services > Library. Search for "Compute Engine API" and enable it.
Create a Compute Engine instance: Go to the Compute Engine dashboard and click on “Create Instance”. Choose an CPU for your machine type. Here are several machine types that can be used in GCP and sporting Intel CPUs.

Once the instance is up and running:

Deploy the model: SSH into your instance. Install the necessary libraries and dependencies and copy your server code (FastAPI, vLLM, etc) to the machine.
Run the model: Once the setup is complete, run your language model. If it's a web-based model, start your server.

Remember, Google Cloud charges based on the resources used, so make sure to stop your instance when not in use.

A similar process can be done for AWS too using EC2. You can find AWS machine types here.

Conclusion

In this lesson, we explored the potential of harnessing 4th Generation Intel Xeon Scalable Processors for the inference process and the array of optimization techniques available that make it a practical choice. Our focus was on the quantization approach aimed at enhancing the speed of text generation while conserving resources. It is fairly straightforward to perform the optimization process courtesy of a series of libraries from Intel such as Intel Extension for PyTorch and Intel® Extension for Transformers.

The results demonstrate the advantages of applying this technique across various configurations. It is worth noting that there are additional techniques available to optimize the models further. The upcoming chapter will discuss advanced topics within language models, including aspects like multi-modality and emerging challenges.

>> Notebook.

For more information on Intel® Accelerator Engines, visit this resource page. Learn more about Intel® Extension for Transformers, an Innovative Transformer-based Toolkit to Accelerate GenAI/LLM Everywhere here.

Intel, the Intel logo, and Xeon are trademarks of Intel Corporation or its subsidiaries.