Most deep learning applications today use 32 bits of floating-point precision for inference workloads. Recently, the INT8 data type has been successfully used for deep learning inference. The article Lower Numerical Precision Deep Learning Inference provides an overview of INT8 data type acceleration using Intel® Deep Learning Boost (Intel® DL Boost), available in 2nd generation Intel® Xeon® Scalable processors, the only microprocessor with built-in AI inference acceleration.
Intel® DL Boost Vector Neural Network Instructions
Based on Intel® Advanced Vector Extensions 512 (Intel® AVX-512), the Intel DL Boost Vector Neural Network Instructions (VNNI) delivers a significant performance improvement by combining three instructions into one—thereby maximizing the use of compute resources, utilizing the cache better, and avoiding potential bandwidth bottlenecks.
Starting with the 1st generation Intel Xeon Scalable processor (shown above), the convolution operations predominant in neural network workloads were implemented in the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN). These operations use the FP32 data type via the vfmadd231ps instructions in the Intel AVX-512 instruction set, with up to two 512-bit FMA units computing in parallel per core, enabling the execution of two vfmadd231ps instructions in a given cycle.
INT8 uses 8 bits to represent integer data with 7 bits of mantissa and a sign bit, and FP32 uses 32 bits to represent floating-point data with 22 bits of mantissa, 8 bits of exponent, and a sign bit. The reduction in the number of bits used for inference with INT8 delivers better memory and compute utilization because less data is being transferred and data is being processed more efficiently. 1st generation Intel Xeon Scalable processors implemented convolution operations in Intel MKL-DNN using the Intel AVX-512 instructions vpmaddubsw, vpmaddwd, and vpaddd to take advantage of low-precision data. Although this gave some performance improvement compared to the use of FP32 data types for convolution, the use of three instructions in INT8 convolution and the microarchitecture limit of only two 512-bit instructions in a clock cycle left room for further innovation.
In 2nd generation Intel® Xeon® Scalable processors, convolutions in Intel MKL-DNN occur in INT8 precision via one Intel AVX-512 vpdpbusd instruction. Since the low-precision operation now uses a single instruction, two of these instructions can be executed in a given cycle. Reduced precision and use of a single instruction optimizes utilization of the microarchitecture for each convolution operation in a neural network and brings significant performance benefits.
Neural network inference requires weights from a trained model, often stored in FP32 precision during training to maintain accuracy and ensure convergence during training. To take advantage of low-precision inference, the FP32 weights from the trained model are converted to INT8 through a process called quantization. This conversion from a floating-point data type to integer data type may result in some loss in accuracy that can be mitigated via:
- Post-training, where we collect statistics for the activation in order to find an appropriate quantization factor. Using the quantization factor, we perform post-training quantization for 8-bit inference.
- Quantization-aware training that employs “fake” quantization in the networks during training so the captured FP32 weights are quantized to INT8 at each iteration after the weight updates.
You can realize the performance benefits of VNNI on the 2nd generation Intel Xeon Scalable processor with the quantization techniques in the Intel® Distribution of OpenVINO™ toolkit or Intel®-optimized frameworks such as TensorFlow* and PyTorch*.
With VNNI, low-precision inference capabilities can be more easily integrated alongside other workloads on versatile, multipurpose 2nd generation Intel Xeon Scalable processors. Further, performance can significantly improve for both batch inference and real-time inference, because vector neural network instructions reduces both the number and complexity of convolution operations required for AI inference, which also reduces the compute power and memory accesses these operations require.