Model Quantization

Introduction

As AI models, including large language models, grow more advanced, their increasing number of parameters leads to significant memory usage. This, in turn, increases the costs of hosting and deploying these tools.

In this lesson, we will learn about quantization, a process that can be employed to diminish the memory requirements of these models. We will explore the various types of quantization, such as scalar and product quantization. We will also learn how fine-tuning techniques like QLoRA use quantization.

Finally, we will examine applying these techniques to AI models using a CPU with methods implemented in the Intel® neural compressor library.

Overview of Quantization

In deep learning, quantization is a technique that reduces the numerical precision of model parameters, such as the weights and biases. This reduction helps decrease the model’s memory footprint and computational requirements, enabling easier deployment on resource-constrained devices such as mobile phones, smartwatches, and other embedded systems.

Everyday Example

To understand the concept of quantization, consider an everyday scenario. Imagine two friends, Jay and John. Jay asks John, "What’s the time?" John can reply with the exact time, 10:58 p.m., or he can say it's around 11 p.m. In the latter response, John simplifies the time, making it less precise but easier to communicate and understand. This is a basic example of quantization, which is analogous to the process in deep learning, where the precision of model parameters is reduced to make the model more efficient, albeit at the cost of some accuracy.

Quantization in Machine Learning

In Machine Learning, different floating point data types can be used for model parameters, a characteristic also called precision. The precision of the data types affects the amount of memory required by the model. Defining the parameters in higher precision types, like Float32 or Float64, provides greater accuracy but requires more memory, while lower precision types, like Float16 or BFloat16, use less memory but may result in a loss of accuracy.

In the figure below, you can see the main floating point data types.

From "A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes” blog post.

We can estimate the memory required for an AI model with its number of parameters. For example, consider the Llama2 70B model that uses Float16 precision for its parameters. Each parameter requires two bytes. To calculate the memory required in gigabytes (GB), where 1GB = 1024^3 bytes, the calculation is as follows:

$(70,000,000,000 * 2)/ 1024^3 = 130.385 GB$

Now, let's explore the different basic quantization techniques.

Scalar Quantization

In scalar quantization, each dimension of the dataset is treated independently. The maximum and minimum values are calculated for each dimension across the dataset. The range between the maximum and minimum values in each dimension is then divided into equal-sized bins. Each value in the dataset is mapped to one of these bins, effectively quantizing the data.

For example, consider a dataset of 2000 vectors with 256 dimensions sampled from a Gaussian distribution. The goal is to perform scalar quantization on this dataset.

import numpy as np

dataset = np.random.normal(size=(2000, 256))

# Calculate and store minimum and maximum across each dimension
ranges = np.vstack((np.min(dataset, axis=0), np.max(dataset, axis=0)))

Now, calculate each dimension's start value and step size. The start value is the minimum value, and the step size is determined by the number of discrete bins in the integer type being used. This example uses 8-bit unsigned integers (uint8), providing 256 bins.

starts = ranges[0,:]
steps = (ranges[1,:] - ranges[0,:]) / 255

The quantized dataset is then calculated as follows:

scalar_quantized_dataset = np.uint8((dataset - starts) / steps)

The overall scalar quantization process can be encapsulated in a function:

def scalar_quantisation(dataset):
    # Calculate and store minimum and maximum across each dimension
    ranges = np.vstack((np.min(dataset, axis=0), np.max(dataset, axis=0)))
    starts = ranges[0,:]
    steps = (ranges[1,:] - starts) / 255
    return np.uint8((dataset - starts) / steps)

Product Quantization

In scalar quantization, the data distribution in each dimension should ideally be considered to avoid loss of information. Product quantization can preserve more information by dividing each vector into sub-vectors and quantizing each sub-vector independently.

For example, consider the following array:

array = [ [ 8.2, 10.3, 290.1, 278.1, 310.3, 299.9, 308.7, 289.7, 300.1],
				[ 0.1, 7.3, 8.9, 9.7, 6.9, 9.55, 8.1, 8.5, 8.99] ]

Quantizing this array to a 4-bit integer using scalar quantization results in significant information loss:

quantized_array = [[ 0  0 14 13 15 14 14 14 14]
								 [ 0  0  0  0  0  0  0  0  0]]

In contrast, product quantization involves the following steps:

Divide each vector in the dataset into m disjoint sub-vectors.
For each sub-vector, cluster the data into k centroids (using k-means, for example).
Replace each sub-vector with the index of the nearest centroid in the corresponding codebook.

Let's proceed with the Product Quantization of the given array with m=3 (number of sub-vectors) and k=2 (number of centroids)

from sklearn.cluster import KMeans
import numpy as np

# Given array
array = np.array([
    [8.2, 10.3, 290.1, 278.1, 310.3, 299.9, 308.7, 289.7, 300.1],
    [0.1, 7.3, 8.9, 9.7, 6.9, 9.55, 8.1, 8.5, 8.99]
])

# Number of subvectors and centroids
m, k = 3, 2

# Divide each vector into m disjoint sub-vectors
subvectors = array.reshape(-1, m)

# Perform k-means on each sub-vector independently
kmeans = KMeans(n_clusters=k, random_state=0).fit(subvectors)

# Replace each sub-vector with the index of the nearest centroid
labels = kmeans.labels_

# Reshape labels to match the shape of the original array
quantized_array = labels.reshape(array.shape[0], -1)

# Output the quantized array
quantized_array

# Result
array([[0, 1, 1],
       [0, 0, 0]], dtype=int32)

By quantizing the vectors and storing only the indices of the centroids, the memory footprint is significantly reduced.

This method can help preserve more information than scalar quantization, especially when the distributions of different dimensions are diverse.

Product quantization can significantly reduce memory footprint and speed up the nearest neighbor search but at the cost of accuracy. The tradeoff in product quantization is based on the number of centroids and the number of sub-vectors we use. The more centroids we use, the better the accuracy, but the memory footprint would not decrease and vice versa.

Quantizing Large Models

We learned about two relatively basic quantization techniques that can be used with deep learning models. While these simple techniques can work well enough with models with few parameters, they usually lead to a drop in accuracy for larger models with billions of parameters.

From “LLM.int8(): 8bit Matrix Multiplication for Transformers at Scale” paper

Large models contain a greater amount of information in their parameters. With more neurons and layers, large models can represent more complex functions. They can capture deeper and more intricate relationships in the data, which smaller models might not be able to handle.

Thus, the quantization process, which reduces the precision of these parameters, can significantly lose this information, resulting in a substantial drop in model accuracy and performance.

Optimizing the quantization process for large models is also more difficult due to the larger parameter space. Finding the optimal quantization strategy that minimizes the loss of accuracy while reducing the model size is a more complex task for larger models.

Popular (Post-Training Quantization) Methods for LLMs

Fortunately, more sophisticated quantization techniques have been released to address these problems, aiming to maintain the accuracy of large models while effectively reducing their size.

LLM.int8()

This research paper observes that activation outliers (activation values significantly different from the others) break the quantization of larger models and proposes keeping them in higher precision. By keeping doing that, the performance of the model is not negatively affected.

GPTQ

This technique allows for faster text generation. The quantization is done layer by layer, minimizing the mean squared error (MSE) between the quantized and full-precision weights when given an input.

The algorithm uses a mixed int4-fp16 quantization scheme where weights are quantized as int4 while activations remain in float16. During inference, weights are de-quantized on the fly, and the actual compute is performed in float16. This method makes use of a calibration dataset. The GPTQ algorithm requires calibrating the quantized weights of the model by making inferences on the quantized model.

AWQ

This method is grounded in the observation that not all weights contribute equally to Large Language Models performance. It identifies a small fraction (0.1%-1%) of 'important' or 'salient' weights, the quantization of which, if skipped, can substantially mitigate quantization loss.

Unlike traditional approaches that focus on weight distribution, the AWQ method selects these salient weights based on the magnitude of their activations. This approach leads to a notable enhancement in performance. By maintaining only 0.1%-1% of the weight channels, corresponding to larger activations, in the FP16 format, the method significantly boosts the performance of quantized models.

From “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration” paper

The authors note that retaining certain weights in FP16 format can cause hardware inefficiency due to using mixed-precision data types. To address this, they propose a method where all weights, including the salient ones, are quantized to avoid mixed-precision data types. However, before the quantization process, the weights are scaled. This scaling step is crucial as it helps protect the outlier weight channels during quantization, ensuring that the important information they hold is not lost or significantly altered during the quantization process. This method aims to strike a balance, allowing the model to benefit from the quantization efficiency while preserving the essential information in the salient weights.

Using Quantized models

Many open-source LLMs are available for download in a quantized format. As we learned in this lesson, these models will have reduced memory requirements.

You can look at the model's section on HuggingFace to find and use a quantized model. This platform hosts a variety of models. For instance, you can try the latest Mistral-7B-Instruct model, which has been quantized using the GPTQ method.

Quantizing Your Own LLM

You can use the Intel® Neural Compressor Library to quantize your own Large Language Model. This library offers various techniques for model quantization, some of which have been discussed in this module.

To get started, follow the step-by-step guide provided in the repository. This guide will walk you through quantizing a model, ensuring you have all the necessary components and knowledge to proceed.

Before beginning the quantization process, ensure you have installed the neural-compressor library and lm-evaluation-harness. Inside the cloned neural compressor directory, navigate to the appropriate directory and install the required packages by running the following commands:

cd examples/pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_weight_only
pip install -r requirements.txt

As an example, to quantize the opt-125m model with the GPTQ algorithm, use the following command:

python examples/pytorch/nlp/huggingface_models/language-modeling/quantization/ptq_weight_only/run-gptq-llm.py \
    --model_name_or_path facebook/opt-125m \
    --weight_only_algo GPTQ \
    --dataset NeelNanda/pile-10k \
    --wbits 4 \
    --group_size 128 \
    --pad_max_length 2048 \
    --use_max_length \
    --seed 0 \
    --gpu

This command will quantize the opt-125m model using the specified parameters.

How Quantization is used in QLoRA

We saw in a previous lesson how fine-tuning can be achieved using fewer resources using QLoRa, a popular variant of LoRA that makes fine-tuning large language models even more accessible.

In the course, we saw that QLoRA involves backpropagating gradients through a frozen, 4-bit quantized pre-trained language model into Low-Rank Adapters. To accomplish this, QLoRA employs a novel data type, the 4-bit NormalFloat (NF4), which is theoretically optimal for normally distributed weights.

This optimality stems from quantile quantization, a technique particularly suited for normally distributed values. It ensures that each quantization bin holds an equal number of values from the input tensor, minimizing quantization error and providing a more uniform data representation.

Since pre-trained neural network weights typically exhibit a zero-centered normal distribution with a standard deviation (σ), QLoRA transforms all weights into a unified fixed distribution. This transformation is achieved by scaling σ to ensure the distribution aligns perfectly within the range of the NF4 data type, further enhancing the efficiency and accuracy of the quantization process.

This new fine-tuning technique shows no accuracy degradation in their experiments and matches BFloat16 performance.

From “QLoRA: Efficient Finetuning of Quantized LLMs” paper

Conclusion

In this lesson, we explored the concept of quantization, a technique that can reduce the memory requirements of large models and, in some cases, enhance the text generation speed for language models. We delved into some state-of-the-art quantization techniques suitable for models with billions of parameters, examining the unique contributions of each method.

We also learned how to quantize our own models using the Intel® Neural Compressor Library, which supports many popular quantization methods.

Lastly, we revisited QLoRA, understanding how it leverages quantization to make the fine-tuning of models more accessible to a broader audience.

Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries.

Special thanks to Sahibpreet Singh for contributing to this lesson!