Introduction

Deep learning has revolutionized various fields, from computer vision to natural language processing. However, one drawback of deep neural networks is their large size and computational demands. These resource-intensive models can significantly hinder deployment, especially in resource-constrained environments like mobile devices and embedded systems. This is where model pruning comes into play as a powerful technique for reducing the size of neural networks without compromising their performance. In this blog post, we will explore what model pruning is, why it's useful, and various methods to achieve it.

What is Model Pruning?

Model pruning reduces the size of a deep neural network by removing certain neurons, connections, or even entire layers. The goal is to create a smaller and more efficient model while preserving its accuracy to the greatest extent possible. This reduction in model size leads to benefits such as faster inference times, lower memory footprint, and improved energy efficiency, making it ideal for deployment in resource-limited scenarios.

Pruned models are smaller and require fewer computational resources during inference. This is crucial for applications like mobile apps, IoT devices, and edge computing, where computational resources are limited. Moreover, pruned models typically execute faster and are more energy-efficient, enabling real-time applications and improving user experience.

Different Types of Model Pruning

There are several techniques and methodologies for model pruning, each with its own advantages and trade-offs. Some of the commonly used methods include:

Magnitude-based Pruning (or Unstructured Pruning)

In this approach, model weights or activations with small magnitudes are pruned. The intuition is that small weights contribute less to the model's performance and can be safely removed.

The paper titled "Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures" presented this approach to optimize deep neural networks by pruning unimportant neurons. This technique, known as network trimming, is based on the observation that a significant number of neurons in a large network produce zero outputs, regardless of the inputs received. These zero activation neurons are considered redundant and are removed without impacting the overall accuracy of the network. The process involves iterative pruning and retraining of the network, with the weights before pruning used as initialization. The authors demonstrate through experiments on computer vision neural netowrks that this approach can achieve a high compression ratio of parameters without compromising, and sometimes even improving, the accuracy of the original network.

The paper "Learning Efficient Convolutional Networks through Network Slimming" presented variations of the pruning scheme for deep convolutional neural networks aimed at reducing the model size, decreasing the run-time memory footprint, and lowering the number of computing operations without compromising accuracy.

The paper “A Simple and Effective Pruning Approach for Large Language Models” introduces a pruning method called Wanda (Pruning by Weights and activations) for pruning Large Language Models. Pruning is a technique that eliminates a subset of network weights to maintain performance while reducing the model's size. Wanda prunes weights based on the smallest magnitudes multiplied by the corresponding input activations, on a per-output basis. This method is inspired by the recent observation of emergent large magnitude features in LLMs. The key advantage of Wanda is that it does not require retraining or weight updates, and the pruned LLM can be used directly.

Illustration of our proposed method Wanda (Pruning by Weights and activations), compared with the magnitude pruning approach. Given a weight matrix W and input feature activations X, Wanda computes the weight importance as the elementwise product between the weight magnitude and the norm of input activations (|W| · ∥X∥2). Weight importance scores are compared on a per-output basis (within each row in W), rather than globally across the entire matrix. Image from

Structured Pruning

Structured pruning targets specific structures within the network, such as channels in convolutional layers or neurons in fully connected layers.

The paper "Structured Pruning of Deep Convolutional Neural Networks" introduces a new method of network pruning that incorporates structured sparsity at different scales, including channel-wise, kernel-wise, and intra-kernel strided sparsity. This approach is beneficial for computational resource savings. The method uses a particle filtering approach to determine the significance of network connections and paths, assigning importance based on the misclassification rate associated with each connectivity pattern. After pruning, the network is re-trained to compensate for any losses.

The Lottery Ticket Hypothesis

The paper “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks” presents an innovative perspective on neural network pruning, introducing the "Lottery Ticket Hypothesis". This hypothesis suggests that within dense, randomly-initialized, feed-forward networks, there exist smaller subnetworks ("winning tickets") that, when trained separately, can achieve test accuracy similar to the original network in a comparable number of iterations. These "winning tickets" are characterized by their initial weight configurations, which make them particularly effective for training.

The authors propose an algorithm to identify these "winning tickets" and present a series of experiments to support their hypothesis. They consistently discover "winning tickets" that are 10-20% the size of several fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10 datasets. Interestingly, these subnetworks not only match the performance of the original network, but often surpass it, learning faster and achieving higher test accuracy.

Intel® Neural Compressor Library

The Intel® Neural Compressor Library is valuable for leveraging already implemented model pruning techniques. Read this page to learn more about the pruning methods implemented. Here are a couple of pruning methods specifically for LLMs.

The paper “A Fast Post-Training Pruning Framework for Transformers” presents a fast post-training pruning framework for Transformer models, designed to reduce the high inference cost associated with these models. Unlike previous pruning methods that necessitate model retraining, this framework eliminates the need for retraining, thus reducing both the training cost and complexity of model deployment. The framework uses structured sparsity methods to automatically prune the Transformer model given a resource constraint and a sample dataset. To maintain high accuracy, the authors introduce three new techniques: a lightweight mask search algorithm, mask rearrangement, and mask tuning.

(a) Prior pruning frameworks require additional training on the entire training set and involve user intervention for hyperparameter tuning. This complicates the pruning process and requires a large amount of time (e.g., ∼30 hours). (b) Our pruning framework does not require retraining. It outputs pruned Transformer models satisfying the FLOPs/latency constraints within considerably less time (e.g., ∼3 minutes), without user intervention. Image from

The paper titled "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot" presents a pruning method, SparseGPT, that can reduce the size of large-scale generative pretrained transformer (GPT) models by at least 50% in a single step, without retraining and with minimal loss of accuracy. The authors demonstrate that SparseGPT can be applied to the very large models OPT-175B and BLOOM-176B, in less than 4.5 hours. The method can achieve 60% unstructured sparsity, meaning that over 100 billion weights can be disregarded during inference without a significant increase in perplexity.

Sparsity-vs-perplexity comparison of SparseGPT against magnitude pruning on OPT-175B, when pruning to different uniform per-layer sparsities. Image from

Conclusion

In conclusion, model pruning is a powerful technique for reducing the size of deep neural networks without significantly compromising their performance. It is a valuable tool for deploying models in resource-constrained environments, such as mobile devices and embedded systems. Various pruning methods exist, including magnitude-based pruning and structured pruning, each with its unique advantages and trade-offs. The Intel® Neural Compressor Library provides a practical implementation of these techniques, with specific methods designed for Large Language Models. By understanding and applying these pruning techniques, we can create smaller, faster, and more efficient models that maintain high accuracy, thereby improving the feasibility and user experience of deploying deep learning models in real-world applications.

For more information on Intel® Accelerator Engines, visit this resource page. Learn more about Intel® Extension for Transformers, an Innovative Transformer-based Toolkit to Accelerate GenAI/LLM Everywhere here.

Intel, the Intel logo, and Xeon are trademarks of Intel Corporation or its subsidiaries.