Published:11/13/2019 Last Updated:11/13/2019

Recently, researchers have demonstrated that deep learning inference can be performed with lower numerical precision, using 8-bit multipliers for inference with minimal to no loss in accuracy. There are two main benefits of lower numerical precision. First, many operations are memory bandwidth bound, and reducing precision would allow for better usage of cache and reduction of bandwidth bottlenecks. Second, the hardware may enable higher operations per second (OPS) at lower numerical precision, as these multipliers require less silicon area and power.

In this article, we describe the INT8 data type acceleration using Intel® Deep Learning Boost (Intel® DL Boost), available in 2nd generation Intel® Xeon® Scalable processors, the only microprocessor with built-in AI inference acceleration. We describe how to quantize the model weights and activations and describe the lower numerical precision functions available in the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN). Finally, we describe how deep learning frameworks take advantage of these lower numerical precision functions, and reduce the conversion overhead between different numerical precisions.

The Intel Xeon Scalable processor includes the Intel® Advanced Vector Extension 512 (Intel® AVX-512) instruction set, which contains the 512-bit wide Fused Multiply Add (FMA) core instructions. These instructions enable lower numerical precision multiplications with higher precision accumulates. Multiplying two 8-bit values and accumulating the result to 32 bits requires three instructions, and requires one of the 8-bit vectors to be in* ***unsigned int8 (u8)** format, the other in **signed int8 (s8)** format, with the accumulation in **signed int32 (s32)** format. This allows for 4x more input at the cost of 3x more instructions, or 33.33% more compute capability (Figure 1).

**Figure 1**. The Intel Xeon Scalable processor enables 8-bit multiplies with 32-bit accumulates with three instructions: VPMADDUBSW *u8×s8→s16* multiples,** **VPMADDWD broadcast1 *s16→s32*, and VPADDD *s32→s32* adds the result to accumulator.

A potential issue is the undefined behavior on overflows that may occur when using the** **VPMADDUBSW instruction *u8×s8→s16* (see Figure 1). This is a problem when both *u8* and *s8* values are near their maximum values^{1}. This can be mitigated by reducing the precision of the inputs by 1 bit. Another technique, used at Facebook, is to break a matrix multiplication into two matrix multiplications: one with small values (to prevent overflow) using 8-bit multiplies and 16-bit accumulates, and another one with sparse large values at full precision.

The 2nd generation Intel Xeon Scalable processor includes Intel DL Boost, which contains the Vector Neural Network Instruction (VNNI) intrinsic **AVX-512_VNNI**. For more information, see **Table 1-1** in the Intel® Architecture Instruction Set Extensions and Future Features Programming Reference.

It includes an FMA instruction for 8-bit multiplies with 32-bits accumulates *u8×s8→s32*, as shown in Figure 2. Accumulating to *s32* eliminates the risk of overflow. The theoretical peak compute gains are 4x *int8* OPS over *fp32* OPS. Practically, the gains may be lower due to memory bandwidth bottlenecks.

**Figure 2**. The Intel DL Boost AVX512_VNNI VPDPBUSD instruction enables 8-bit multiplies with 32-bit accumulates with one instruction *u8×s8→s32*. This allows for 4x more compute capability.

The Intel MKL-DNN library contains popular deep learning functions or primitives used across various models, along with functions necessary to manipulate the layout of tensors or high dimensional arrays optimized for Intel® processors. Intel MKL-DNN implements the 8-bit convolution operations with the activation (or input) values in *u8* format, the weights in *s8* format, and the biases in *s32* format (biases can be kept in* fp32* as well, as they take a very small percentage of the overall compute; see Figure 3).

**Figure 3**. Process of inference operations with 8-bit multipliers accumulated to *s32*.

Intel MKL-DNN quantizes the values (assumes non-negative activations; that is, after rectified linear unit (ReLU) activation function execution) for a given tensor or for each channel in a tensor (the choice is up to the framework developer), as follows:

where

is a tensor corresponding to either the weights *w*, or the activations, or model inputs *a*.

is the quantization factor for activations with non-negative values, and

is the quantization factor for the weights. The quantized activation, weights, and bias are:

where the function

rounds to the nearest integer. Note that while the *s8* format supports *-128*, the smallest quantized *s8* weight value use is *-127*.

The affine transformation using 8-bit multipliers and 32-bit accumulates results in:

where the approximation is because the equation ignores the rounding operation, and

is the affine transformation with *f32* format, and

is the dequantization factor.

In quantizing to *u8* and *s8* formats, a zero value maps to a specific value without any rounding. Given that zero is one of the most common values, it is advantageous to have exact mappings to reduce quantization errors and improve statistical accuracy.

The quantization factors above can be in *fp32* format in the Intel Xeon Scalable processors.

In Figure 4, we demonstrate how to efficiently perform the 8-bit multiplies for *A×W*. Intel MKL-DNN uses an NHWC layout for the activation tensors where *N* is the batch size, *H* is the height, *W* is the width, and *C* is the number of channels, and an

layout for the weight tensors where *O* is the number of kernels or output channels, *C* is the number of input channels, *Κ* is the height, and *Τ* is the width. The first 32 bits (4 int8 values) of tensor *A* shown in gray are broadcast 16 times to fill a 512-bit register. Intel MKL-DNN modifies the data layout of tensor *W* after quantizing the weights. Tensor *W* data layout is rearranged as *W'* by groups of 16 columns, with each column having 32 bits (4 int8 values) to be read continuous in memory starting with the first four values in column one occupying the first 32 bits of the register (red), the next 4x1 occupying the next 32 bits of the register (orange), and so forth (green). The second, third, and fourth block (yellow) below the first block are rearranged in the same pattern. The next set of blocks (blue) follows. In practice, tensor *W* is usually transposed before rearranging the memory layout in order to access 1x4 continuous memory values rather than 4x1 scatter values when rearranging the data layout. Modifying this data layout is usually done once and stored for reuse for all inference iterations.

**Figure 4**. The efficient use of int8 multiplies to compute the product *A×W* requires a data layout transformation of tensor *W* in order to read continuous bits. Groups of 32 bits of *A* are broadcast 16 times to fill a 512-bit register, which are multiplied by groups of 512 bits from tensor *W’*.

The register with the first four int8 values (copied 16 times) of *A* is multiplied by the 64 int8 values (512 bits) of *W’ *and accumulated. The next four values in *A* are broadcasted 16 times to another register, which is multiplied by the next 64 int8 values of *W’*. This continues until the first row of *A* is read and the results are accumulated. The outputs (after all three instructions of the 8-bit FMA) are the first 16 output values (requiring 512 bits at s32). The first row of *A* is then multiplied by the next values of *W’*, resulting in the next 16 values of the output.

The Intel Xeon Scalable processors have up to 32 registers. When executing in 512-bit register port scheme on processors with two FMA units^{2}, Port 0 FMA has a latency of four cycles and Port 5 FMA has a latency of six cycles. The instructions used for deep learning workloads at int8 support bypass and have a latency of five cycles for both ports 0 and 5 (see Section 15.17 in the Intel® 64 and IA-32 Architectures Optimization Reference Manual). In practice, multiple rows of *W'* are loaded to multiple registers to hide these latencies.

Quantizing the weights is done before inference starts. Quantizing activations efficiently requires precomputing the quantization factors. The activation quantization factors are precomputed, usually sampling the validation dataset to find the range as described above. Values in the test dataset outside this range are saturated to the range. For negative activation values, the range before saturation could be relaxed to

in order to use the *s8=-128* value, where

is the maximum absolute value of these activations. These scalars are then written to a file.

Intel has enabled 8-bit inference in Intel® Optimization for Caffe*, Intel’s deep learning inference engine, Apache MXNet* and TensorFlow*. In the Intel Optimization for Caffe, the **xmodel.prototxt** file is modified to include the precomputed scalars, as shown in Figure 5. Currently, the Intel Optimization for Caffe can provide the quantization factor as either a power-of-two or as regular* fp32 *value and can use either one quantization factor per tensor or one per channel. Those quantization factors are computed using a sampling tool built into the Intel Optimization for Caffe.

Quantization factors are added to the **model.prototxt** file.

```
layer {
name: “conv2” type: “Convolution”
…
quantization _param {
precision: DYNAMIC_FIXED_POINT
bw_layer_in: 8 // input bit-width
bw_layer_out: 8 // output_bit-width
bw_params: 8 // weights bit-width
fl_layer_in: 0 // input fraction length
fl_layer_out: -2 // output fraction length
fl_params: 8 // weights fraction length
}
}
```

Quantizing activations or input values with negative values can be implemented at the framework level as follows:

The *s8* quantized format is

where the function

rounds to the nearest integer. However, the activation must be in *u8* format to take advantage of the **AVX512_VNNI VPMADDUBSW** instruction or the **AVX512_VNNI VPDPBUSD** instruction. Therefore, all values in

are shifted by *K=128* to be non-negative:

where *1* is a vector of all 1s, and the bias

is modified as:

The methodology to quantize the weights and modified bias is the same as before:

The affine transformation using 8-bit multipliers and 32-bit accumulates results in:

where

where

is the dequantization factor.

When the input signal is already in *u8* format (for example, RGB images) but a preprocessing step is required to subtract the mean signal, the above equations can be used where *K* is the mean

is the input signal (not pre-processed), and

To recap, to use activations with negative values, the activations are quantized to *s8* format and then shifted by *K=128* to *u8* format. The only additional change is to modify the bias:

For a convolution layer, the product

is generalized to equal the sum over all the values of

along all dimensions except the dimension shared with

See Appendix A for details.

Fused quantization improves performance by combining dequantization and quantization as follows, so there is no need to convert to *fp32*. The activation at layer

is:

where

is a non-linear activation function. Assuming the ReLU activation function, the activation can be expressed in *u8* format as:

where the product

enables computing the next layer’s quantized activation in *u8* format without computing the *fp32* representation.

When

is the ReLU function (as in the equations below) and

(as is always the case for the quantization factors), the following property holds:

This property is useful for models with skip connections such as ResNet*, where a skip connection branch may have dependencies on various activations. As an example, and using the nomenclature by the ResNet-50 author in Caffe’s deploy.prototxt (see Figure 5), the quantized input activation in layer **res2b_branch2a** (abbreviated as **2b2a** in the equations below) is:

Where

(instead of *[0,255]*) because

is in *s8* format because the product comes before the ReLU function and

is the quantization factor. Following this procedure, it is shown in Appendix B that the activation

depends on

and

Similarly, the activation

depends on

and

Figure 5. As an example, ResNet-50*—The layers marked with a blue arrow have dependencies on two or more activations. Image credit Barukh Ziv, Etay Meiri, and Eden Segal.

A batch normalization (BN) inference layer is not needed as it can be absorbed by its preceding layer by scaling the weight values and modifying the bias. This technique only works for inference and is not unique to lower numerical precision. It can be implemented at the framework level instead of Intel MKL-DNN. BN is usually applied after the affine transformation

and before the activation function (details in the original BN paper). BN normalizes x to be zero mean and unit norm, and then scales and shifts the normalized vector by *γ* and *β*, respectively, which are parameters also learned during training. During a training iteration, *x* is normalized using the mini-batch statistics. For inference, the mean *E* and variance *V* of *x* are precomputed using the statistics of the entire training dataset or a variant, such as a running average of these statistics computed during training. During inference, the BN output y is:

where

That is, during inference the BN layer can be replaced by adjusting weights and bias in the preceding convolutional or fully connected layer.

Model optimizations can further improve inference performance. For example, in ResNet, the stride operation can be moved to an earlier layer without modifying the end result and reducing the number of operations, as shown in Figure 6. This modification applies to both 8 bits and 32 bits.

Figure 6. Example model optimization in ResNet. Illustration courtesy of Eden Segal and Etay Meiri.

Lower numerical precision inference can improve the computational performance with minimal or no reduction in statistical accuracy. Intel has enabled 8-bit precision for inference acceleration on the current generation of Intel Xeon Scalable processors. Intel is also enabling 16-bit precision for training on future microarchitectures in both hardware and software enabling compilers, the Intel MKL-DNN library, and popular deep learning frameworks. For more information, read Next-generation Intel Xeon Scalable Processors to Deliver Breakthrough Platform Performance with up to 56 Processor Cores.

To convince the reader that these same formulas (see the section 8-bit quantization of activations or inputs with negative values) generalize to convolutional layers, we use the indices of each tensor entry and work through the steps to show the convolutional output.

Let

be the weight tensor with *O* kernels or output channels, *C* input channels, *Κ* height, and *Τ* width. The modified bias can be represented as:

where

and

and

are the indices for the kernels or output channels, input channels, kernel height, and kernel width, respectively. The convolution output can be represented as follows. Note that we assume batch size one (to omit the batch index for simplicity), the activations have been already zero padded in *fp32* format (or equivalently padded with *K=128* in *u8* format), and the convolution stride is one.

The activation inputs to the layers marked by the blue arrow in Figure 5 are as follows, where layer** res2b_branch2a** is abbreviated as **2b2a** in the equations below, with similar abbreviations for the other layers.

^{1} In practice, these u8 values are usually closer to their minimum than their maximum if their activations are preceded by the ReLU activation function.

^{2} Two 512-bit FMA units computing in parallel per core are available in Intel® Xeon® Platinum processors and Intel® Xeon® Gold processors 6000 series and 5122. Other Intel® Xeon® Scalable processor SKUs have one FMA unit per core.

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804