By Kaustav Tamuly, Published: 11/15/2018, Last Updated: 11/15/2018

Present-day neural networks tend to be deep, with millions of weights and activations. These large models are compute-intensive, which means that even with dedicated acceleration hardware, the inference pass (network evaluation) will take time. You might think that latency is an issue only in certain cases, such as autonomous driving systems. In fact, whenever we interact with our phones and computers, we are sensitive to the latency of the interaction. We don’t like to wait for search results, or for applications or web pages to load, and we are especially sensitive in real-time interactions such as speech recognition. So, inference latency is often something we want to minimize.

This article is an introduction to compression and acceleration of deep neural networks for resource-efficient inference. Topics to be covered include:

- Pruning and sharing
- Low rank factorization
- Compact convolutional filters
- Distillation of knowledge

For a detailed explanation of the methods described, please refer to the original papers listed in the references section at the end of this post.

Pruning is the act of removing superfluous or unwanted parts. In model creation, pruning removes parameters that are not crucial to the model performance and takes advantage of redundancy and scarcity of parameters that make little or no contribution to model performance. These techniques are further classified into three types:

Quantization and Binarization: Quantization is the process of converting a continuous range of values into a finite range of discrete values. Let’s assume we have a grayscale image. Quantization (N-level) will give us N grey levels of that image but binarization will give us only two grey levels of that image (Grey or Not Grey).

Network quantization compresses the original network by reducing the number of bits used to represent the weights. Quantization effectively constrains the number of weights we can use inside our kernels. With N number of bits, we can represent 2^N weights. Our aim is to modify the weights present inside our kernel to take only these 2^N weights. Thus, below four bits, it gets difficult representing the weights without losing much accuracy.

Figure 1. Weights and activations constrained at fixed precision during TESTING - source^{1}

Observe that during training, quantization fortunately does not affect the accuracy significantly. Quantization at inferencing, however, fails miserably with no room for learning.

Let’s assume we have already trained a 32-bit network and want to quantize its weight into four bits as a post-processing step to reduce its size. During forward pass, all the kernel weights will become quantized. But once that happens, they will return flat or zero gradients, which means the network isn’t learning. We can sidestep this issue during backpropagation by using Straight-Through Estimator (STE), which returns the original gradients and updates the underlying float values without modifying the quantized value on top. At that point, the forward pass and backward pass are repeated.

To read some excellent papers detailing 8 bits, 4 bits, 2 bits and 1 bit (binary networks), please explore the references section at the end of this post to gain an in-depth understanding of quantization. For additional information, read Understanding Binary Neural Networks.

Figure 2. Binarization example - source^{2}

Pruning and Sharing: A well-known approach to network pruning and weight sharing starts by finding the standard deviation of each layer in the network to understand the layer’s weight distribution. Once we know the standard deviation of the distribution, we remove lower weights by the process of thresholding. Applied threshold is obtained by multiplying the layer’s standard deviation with a pruning ratio. (Pruning ratios for different layers are derived from experiments that are beyond the scope of this post.) The pruned network is retrained to compensate for the weights that were removed until an optimal accuracy is reached.

During the weight-sharing step, weights with small differences are replaced by representative values, also called centroid values, which vary according to the layer’s weight distribution. These centroid values obtained after linear distance calculation are finalized through retraining.

Figure 3. Pruning

Figure 4. Weight Sharing - source^{1}

Low-Rank Matrix Factorization: The key idea behind low-rank matrix factorization (LRMF) is that latent structures exist in the data; by uncovering them, we can obtain a compressed representation of the data. LRMF factorizes the original matrix into lower rank matrices while preserving latent structures and addressing the issue of sparseness.

Convolution operations typically are compute-intensive in deep convolutional neural networks (CNNs). Thus, reducing the number of convolution operations would help compress the network and increase overall speedup. LRMF works on the assumption that 4-D convolution kernel tensors are highly redundant and can be decomposed to reduce this redundancy. The fully connected layers can be viewed as 2-D tensors.

LRMF methods have been around for quite some time, but they pose severe limitations since decomposition of tensors is usually a computationally expensive task. Moreover, current state-of-the-art methods in LRMF implement layer-by-layer, low-rank approximation – thus they cannot perform global parameter compression, which is important as different layers hold different information.

Low rank matrix factorization

e.g. Singular Value Decomposition

Figure 5. Low-Rank Matrix Factorization

Using compact convolutional filters can directly reduce the associated computational costs. The key purpose of this compression method is to replace over-parametric filters with compact filters to achieve overall speedup while maintaining comparable accuracy. Consider the following SqueezeNet example.

SqueezeNet achieved Alexnet-level performance with 50x fewer parameters and a model size that was less than 0.5 MB. For perspective, that’s a 510x lower memory requirement than Alexnet. Their approach introduced “fire modules” as the building blocks of CNN architecture. SqueezeNet’s design strategies can be broadly divided into three parts:

- Replaced 3x3 filters with 1x1 filters.
- Decreased the number of input channels to 3x3 filters.
- Downsampled late in the network so that convolution layers would have large activation maps. Downsampling (achieved by stride >1) earlier in the network would lead to information loss as the output layers have small activation maps. Delayed downsampling enabled most network layers to have larger activation maps, which contribute to higher task accuracy, all else being equal.

Figure 6. Fire module diagram - source^{3}

But we are concerned only with how decomposing 3x3 filters and input channel into smaller convolutions led to a more compact network, which demonstrates equal performance while being remarkably fast and memory-cheap.

Imagine condensing your ensemble of big Kaggle* or ImageNet models into much smaller models that are as good at runtime. Recent work of Professor Geoffrey Hinton and Jeff Dean from Google Research are aimed at solving this problem. A similar analogy could be found in a classroom where a teacher is the big, cumbersome model that is trained rigorously to beat benchmarks while a student is the smaller network gaining knowledge from the teacher. The Distilling the Knowledge in a Neural Network paper aims at distillation, or the transfer of knowledge from an ensemble of big networks into a much smaller network that learns directly from the cumbersome model’s outputs, and is lighter to deploy. Why does this work well? Because when you are trained to make the same predictions as the cumbersome model, you don’t have to worry about things like overfitting since the cumbersome model has already dealt with it. Another reason this works so well is because the network is trained on “soft” probabilities of the cumbersome model rather than on “hard” targets which are one-hot encoded. For example, consider the image of a car with a few buildings in the background.

The cumbersome model is trained using hard targets such as:

Car — 1; Human — 0; Buildings — 0; Trees — 0

Whereas the distillation model is trained on soft probabilities:

Car — approximately 0.96; Human — approximately 0.00001; Buildings — approximately 0.03; Trees — approximately 0.000002

The distillation model obviously has a lot more “information” to learn from than the cumbersome model. But it might still fail to learn the spatial properties of the buildings since there is only a 0.03 probability to learn from, which isn’t pushing the gradients too much.

The paper offers one solution to tackle this problem: divide the inputs entering the softmax* (the one outputting the probabilities) by a number termed here as “temperature.” Suppose the inputs going into the softmax* were [2, -2, -14, -10], after dividing them by temperature (assume as 5), the new terms entering the softmax are [0.4, -0.4, -2.8, -2]. This will change the output soft probabilities of the cumbersome model so that it’s easier for the distillation model to learn that there is a building. After softmax, the soft probabilities will look something like:

Car — approximately 0.67; Human — approximately 0.04; Buildings — approximately 0.24; Trees — approximately 0.05; (Assume)

However, these probabilities will affect the distillation model at prediction so it’s important to restore the softmax inputs to their initial states during prediction by multiplying them with “temperature.”

Note: The temperature hyperparameter here can be changed depending on the user’s convenience. The original paper suggests using a temperature of 5 for most cases. Decreasing the temperature will push the softmax* inputs toward their original values (we won’t learn about the building) while increasing the temperature value will add more noise to our gradients as the softmax will output higher probabilities for almost all the classes.

To reduce the manual labor of implementing these algorithms, we will shift our attention to the neural network distiller provided by Intel® AI labs. Neural network Distiller is a Python* package for neural network compression research. Distiller provides a PyTorch* environment for fast prototyping and analyzing compression algorithms, such as scarcity-inducing methods and low precision arithmetic. It provides a framework for implementing state-of-the-art compression algorithms like pruning, quantization and knowledge distillation with a vast set of tools for evaluating compression and network performance. For more information, see the original work NervanaSystems distiller.

While there is no golden rule for selecting the best method, pruning and sharing usually give good compression rates without significantly impacting accuracy. These methods, combined with low-rank factorization methods, can help make pre-trained models more compact. But, methods such as pruning channels can reduce the feature map width and result in a thinner, shrunken model. Conversely, knowledge distillation methods can be made much more efficient with more knowledge transfer between the teacher and student network. Most papers written on the topic favor either pruning or sharing, or quantization, but not a combination thereof. While it might work in theory, it’s less known how it would perform in real-world practice. Ongoing research in this field will open up newer possibilities for transferring deep learning results to low-powered, latency-strict mobile devices.

To learn more about Intel® technologies for artificial intelligence, sign up for Intel® AI Developer Program and access essential learning materials, community, tools and technology to boost your AI development. Apply to become an AI Student Ambassador and share your expertise with other student data scientists and developers.

Kaustav Tamuly is a student at the Birla Institute of Technology and Science (BITS), Goa and an Intel® Student Ambassador for AI. His primary interests include computer vision and reinforcement learning, with the goal of working on AGI and AI technologies for medical diagnosis.

- Han, Song et al.
*A Deep Neural Network Compression Pipeline : Pruning , Quantization , Huffman Encoding*(2015). - Lee, Dongsoo and Byeongwook Kim
*Retraining-Based Iterative Weight Quantization for Deep Neural Networks*CoRR abs/1805.11233 (2018) - SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size
- Distilling the Knowledge in a Neural Network
- Notes on Low-rank Matrix Factorization
- Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1
- Pruning Convolutional Neural Networks for Resource Efficient Inference
- Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations
- Universal Deep Neural Network Compression

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804