Code Sample: Intel® AVX512-Deep Learning Boost: Intrinsic Functions

By Alberto Villarreal Cueva,

Published:04/02/2019   Last Updated:12/23/2019


License: 3-Clause BSD License
Optimized for...  
Operating System: Linux*
Hardware: Second generation Intel® Xeon® Scalable processor
(Programming Language, tool, IDE, Framework)
C++ Compiler version 19, Intel® Parallel Studio XE 2019
Prerequisites: Familiarity with C++

This code example shows how to take advantage of the new Intel® Advanced Vector Extensions 512 (Intel® AVX512) with Intel® Deep Learning Boost (Intel® DL Boost) in 2nd generation Intel® Xeon® Scalable processors.

The example demonstrates testing the new functionality using intrinsic functions.

Intel® AVX-512 and Intel® DL Boost

2nd generation Intel Xeon Scalable processors include a new Intel AVX-512 extension called Intel DL Boost, which contains the Vector Neural Network instruction (VNNI). Designed to improve the throughput of integer linear algebra, this instruction can accelerate loops in some convolutional neural networks (CNNs) that perform multiplication of two 8-bit (or 16-bit) integers and accumulate the result in a 32-bit integer variable.

The VNNI feature includes a fused instruction to perform lower precision (8-bit and 16-bit) multiplies with 32-bit accumulates. This instruction replaces a sequence of three instructions that are part of the Intel AVX-512 Fused-Multiply-Add (FMA) Extensions. Figure 1 shows how the new instruction in VNNI VPDPBUSD replaces the three separate FMA instructions VPMADDUBSW, VPMADDWD, and VPADDD.

Diagram of VNNI instruction
Figure 1. Intel® AVX512-DL Boost instruction VPDPBUSD replaces the three separate FMA instructions VPMADDUBSW, VPMADDWD and VPADDD to perform 8-bit multiplies with 32-bit accumulates. Image credit to Israel Hirsh and Bob Valentine.

Find a detailed description of both Intel® AVX512-DL Boost fused instruction and the FMA-based instructions, as well as the theoretical peak compute gains, in this white paper: Lower Numerical Precision Deep Learning Inference and Training.

Code Sample

This code sample uses Intel AVX-512 intrinsics to illustrate use of both the VNNI fused instruction and the three equivalent FMA-based instructions.

Find the prototypes for Intel AVX-512 intrinsics in the immintrin.h header file:

#include <immintrin.h>

The Intel AVX-512 intrinsic functions use C data types as operands representing the 512-bit registers used in the operations. The __m512i data type can hold 64 8-bit integer values, 32 16-bit values, or 16 32-bit values:

   uint8_t  op1_int8[64];
   int8_t   op2_int8[64];
   int32_t  op3_int[16];
   int16_t  op4_int16[32];
   int32_t  result[16];

   __m512i v1_int8;
   __m512i v2_int8;
   __m512i v3_int;
   __m512i v4_int16;
   __m512i vresult;


Data from memory can be loaded into the registers using the _mm512_loadu_si512 function (data does not need to be aligned on any particular boundary; otherwise, if data is aligned on a 64-byte boundary, the _mm512_load_si512 function can be used instead):

   v1_int8 =_mm512_loadu_si512(op1_int8);
   v2_int8 =_mm512_loadu_si512(op2_int8);
   v3_int =_mm512_loadu_si512(op3_int);
   v4_int16 =_mm512_loadu_si512(op4_int16);

Once the data is loaded, perform the dot product operation using the fused instruction vpdpbusds, which is called via the intrinsic function _mm512_dpbusds_epi32. This instruction multiplies groups of four adjacent pairs of unsigned 8-bit integers in v1_int8 with corresponding signed 8-bit integers in v2_int8, producing four intermediate signed 16-bit results. It then adds these four results with the corresponding 32-bit integer in v3_int using signed saturation, and returns the packed 32-bit results:

   vresult = _mm512_dpbusds_epi32(v3_int, v1_int8, v2_int8);

   _mm512_storeu_si512((void *) result, vresult);

   for (int j = 15; j >= 0; j--){
       cout << result[j]<<" ";

It is also possible to perform the same dot product operation using three separate FMA instructions vpmaddubsw, vpmaddwd, and vpaddd (which are called using the intrinsic functions _mm512_maddubs_epi16, _mm512_madd_epi16, and _mm512_add_epi32, respectively):


   // Vertically multiply two 8-bit integers,
   // then horizontally add adjacent pairs of 16-bit integers

   __m512i vresult1 = _mm512_maddubs_epi16(v1_int8, v2_int8);

   // Upconvert to 32-bit and horizontally add neighbors. Multiply by 1.
   __m512i vresult2 = _mm512_madd_epi16(vresult1, v4_int16);

   // Add packed 32-bit integers
   vresult = _mm512_add_epi32(vresult2, v3_int);

   _mm512_storeu_si512((void *) result, vresult);

   for (int j = 15; j >= 0; j--)
       cout << result[j]<<" ";

For More Information

Find detailed descriptions of Intel AVX-512 intrinsics in the Intel® Intrinsics Guide. A detailed description of the VNNI instruction, as well as how it is implemented in the Intel® MKL-DNN library, can be found in the following white paper: Accelerate Lower Numerical Precision Inference with Intel® Deep Learning Boost.

Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804