Introduction
In this lesson, we will study the challenges of deploying Large Language Models, with a focus on the importance of latency and memory. We also explore optimization techniques with concepts like quantization and sparsity and how they can be applied using the Hugging Face Optimum and Intel® Neural Compressor libraries. We also discuss the role of Intel's® optimization technologies in efficiently running LLMs. This lesson will provide a deeper understanding of how to optimize LLMs for better performance and user experience.
Importance of Latency and Memory
Latency is the delay before a transfer of data begins following an instruction. It is a crucial factor in LLM applications. High latency in real-time or near-real-time applications can lead to a poor user experience. For instance, in a conversational AI application, a delay in response can disrupt the natural flow of conversation, leading to user dissatisfaction. Therefore, reducing latency is a critical aspect of LLM deployment.
Considering an average human reading speed of ~250 words per minute (translated to ~312 tokens per minute) that is about 5 tokens per second, therefore a latency of 200ms per token. Usually, acceptable latency for near-real-time LLM applications is between 100ms and 200ms per token.
Transformers can be computationally intensive and memory-demanding, due to their complex architecture and large size. However, several optimization techniques can be employed to enhance their efficiency without significantly compromising their performance.
Quantization
Quantization is a technique used for compressing neural network models, including Transformers, by lowering the precision of model parameters and/or activations. This method can significantly reduce memory usage. It leverages low-bit precision arithmetic and decreases the size, latency, and energy consumption.
However, it's important to strike a balance between performance gains through reduced precision and maintaining model accuracy. Techniques such as mixed-precision quantization, which assign higher bit precision to more sensitive layers, can mitigate accuracy degradation.
We’ll learn different quantization methods later in the course.
Sparsity
Sparsity, usually achieved by pruning, is another technique for reducing the computational cost of LLMs by eliminating redundant or less important weights and activations. This method can significantly decrease off-chip memory consumption, the corresponding memory traffic, energy consumption, and latency.
Pruning can be broadly divided into types: weight pruning and activation pruning.
- Weight pruning can be further categorized into unstructured pruning and structured pruning. Unstructured pruning allows any sparsity pattern, and structured pruning imposes an additional constraint on the sparsity pattern. While structured pruning can provide benefits in terms of memory, energy consumption, and latency without additional hardware support, it is known to achieve a lower compression rate than unstructured pruning.
- On the other hand, activation pruning prunes redundant activations during inference, which can be especially effective for Transformer models. However, this requires support to detect and zero out unimportant activations at run-time dynamically.
We’ll study different pruning methods later in the course.
Utilizing Optimum and Intel® Neural Compressor Libraries
The Hugging Face Optimum and the Intel® Neural Compressor libraries provide a suite of tools helpful in optimizing models for inference, especially for Intel® architectures.
- The Hugging Face Optimum library serves as an interface between the Hugging Face transformers and diffuser libraries and the various tools provided by Intel®.
- The Intel® Neural Compressor is an open-source library that facilitates the application of popular compression techniques such as quantization, pruning, and knowledge distillation. It supports automatic accuracy-driven tuning strategies, enabling users to generate quantized models easily. This library allows users to apply static, dynamic, and aware-training quantization approaches while maintaining predefined accuracy criteria. It also supports different weight pruning techniques, allowing for the creation of pruned models that meet a predefined sparsity target.
These libraries provide a practical application of the quantization and sparsity techniques, and their usage will be of great use in optimizing the deployment of LLMs.
Intel® Optimization Technologies for LLMs
Intel's® optimization technologies play a significant role in running LLMs efficiently on CPUs. The 4th Gen Intel® Xeon® Scalable processors are equipped with AI-infused acceleration known as Intel® Advanced Matrix Extensions (Intel® AMX). These processors have built-in BF16 and INT8 GEMM (general matrix-matrix multiplication) accelerators in every core, which significantly accelerate deep learning training and inference workloads.
The Intel® Xeon® Proecssor Max Series offers up to 128GB of high-bandwidth memory, which is particularly beneficial for LLMs, as these models are often memory-bandwidth bound.
By (1) running model optimizations like quantization and pruning and (2) leveraging the Intel® hardware acceleration technologies, it’s possible to achieve a good latency for LLMs too. Take a look at this page to see the performance improvements (better throughput, with less memory size) of several optimized models.
Conclusion
In this lesson, we have explored the challenges of deploying Large Language Models, with a particular focus on latency and memory.
We also discussed optimization techniques like quantization and sparsity, which can significantly reduce LLMs' computational cost and memory usage. We introduced the Hugging Face Optimum and Intel® Neural Compressor libraries, which provide practical tools for applying these techniques. Furthermore, we have highlighted the role of Intel's® optimization technologies, such as the 4th Gen Intel® Xeon Scalable processors and the Intel® Xeon CPU Max Series, in efficiently running neural networks.
By understanding and applying these concepts, we can optimize the deployment of LLMs, achieving better performance and user experience.
For more information on Intel® Accelerator Engines, visit this resource page. Learn more about Intel® Extension for Transformers, an Innovative Transformer-based Toolkit to Accelerate GenAI/LLM Everywhere here.
Intel, the Intel logo, and Xeon are trademarks of Intel Corporation or its subsidiaries.