By Alberto Villarreal Cueva, Published: 04/02/2019, Last Updated: 12/23/2019

File(s): |
Download |

License: | 3-Clause BSD License |

Optimized for... | |
---|---|

Operating System: |
Linux* |

Hardware: |
Second generation Intel® Xeon® Scalable processor |

Software:(Programming Language, tool, IDE, Framework) |
C++ Compiler version 19, Intel® Parallel Studio XE 2019 |

Prerequisites: |
Familiarity with C++ |

This code example shows how to take advantage of the new Intel® Advanced Vector Extensions 512 (Intel® AVX512) with Intel® Deep Learning Boost (Intel® DL Boost) in 2nd generation Intel® Xeon® Scalable processors.

The example demonstrates testing the new functionality using intrinsic functions.

2nd generation Intel Xeon Scalable processors include a new Intel AVX-512 extension called Intel DL Boost, which contains the Vector Neural Network instruction (VNNI). Designed to improve the throughput of integer linear algebra, this instruction can accelerate loops in some convolutional neural networks (CNNs) that perform multiplication of two 8-bit (or 16-bit) integers and accumulate the result in a 32-bit integer variable.

The VNNI feature includes a fused instruction to perform lower precision (8-bit and 16-bit) multiplies with 32-bit accumulates. This instruction replaces a sequence of three instructions that are part of the Intel AVX-512 Fused-Multiply-Add (FMA) Extensions. Figure 1 shows how the new instruction in VNNI VPDPBUSD replaces the three separate FMA instructions **VPMADDUBSW**, **VPMADDWD**, and **VPADDD**.

Figure 1. Intel® AVX512-DL Boost instruction **VPDPBUSD** replaces the three separate FMA instructions **VPMADDUBSW**, **VPMADDWD** and **VPADDD** to perform 8-bit multiplies with 32-bit accumulates. Image credit to Israel Hirsh and Bob Valentine.

Find a detailed description of both Intel® AVX512-DL Boost fused instruction and the FMA-based instructions, as well as the theoretical peak compute gains, in this white paper: Lower Numerical Precision Deep Learning Inference and Training.

This code sample uses Intel AVX-512 intrinsics to illustrate use of both the VNNI fused instruction and the three equivalent FMA-based instructions.

Find the prototypes for Intel AVX-512 intrinsics in the immintrin.h header file:

`#include <immintrin.h>`

The Intel AVX-512 intrinsic functions use C data types as operands representing the 512-bit registers used in the operations. The **__m512i** data type can hold 64 8-bit integer values, 32 16-bit values, or 16 32-bit values:

```
uint8_t op1_int8[64];
int8_t op2_int8[64];
int32_t op3_int[16];
int16_t op4_int16[32];
int32_t result[16];
__m512i v1_int8;
__m512i v2_int8;
__m512i v3_int;
__m512i v4_int16;
__m512i vresult;
```

Data from memory can be loaded into the registers using the **_mm512_loadu_si512** function (data does not need to be aligned on any particular boundary; otherwise, if data is aligned on a 64-byte boundary, the **_mm512_load_si512** function can be used instead):

```
v1_int8 =_mm512_loadu_si512(op1_int8);
v2_int8 =_mm512_loadu_si512(op2_int8);
v3_int =_mm512_loadu_si512(op3_int);
v4_int16 =_mm512_loadu_si512(op4_int16);
```

Once the data is loaded, perform the dot product operation using the fused instruction **vpdpbusds**, which is called via the intrinsic function **_mm512_dpbusds_epi32**. This instruction multiplies groups of four adjacent pairs of unsigned 8-bit integers in **v1_int8** with corresponding signed 8-bit integers in **v2_int8**, producing four intermediate signed 16-bit results. It then adds these four results with the corresponding 32-bit integer in **v3_int** using signed saturation, and returns the packed 32-bit results:

```
// PERFORM THE DOT PRODUCT OPERATION USING FUSED INSTRUCTION
vresult = _mm512_dpbusds_epi32(v3_int, v1_int8, v2_int8);
_mm512_storeu_si512((void *) result, vresult);
printf("RESULTS USING FUSED INSTRUCTION: \n");
for (int j = 15; j >= 0; j--){
cout << result[j]<<" ";
}
```

It is also possible to perform the same dot product operation using three separate FMA instructions **vpmaddubsw**, **vpmaddwd**, and **vpaddd** (which are called using the intrinsic functions **_mm512_maddubs_epi16**, **_mm512_madd_epi16**, and **_mm512_add_epi32**, respectively):

```
// PERFORM THE DOT PRODUCT OPERATION USING A SEQUENCE OF 3 INSTRUCTIONS
// Vertically multiply two 8-bit integers,
// then horizontally add adjacent pairs of 16-bit integers
__m512i vresult1 = _mm512_maddubs_epi16(v1_int8, v2_int8);
// Upconvert to 32-bit and horizontally add neighbors. Multiply by 1.
__m512i vresult2 = _mm512_madd_epi16(vresult1, v4_int16);
// Add packed 32-bit integers
vresult = _mm512_add_epi32(vresult2, v3_int);
_mm512_storeu_si512((void *) result, vresult);
printf("RESULTS USING SEQUENCE OF 3 INSTRUCTIONS: \n");
for (int j = 15; j >= 0; j--)
cout << result[j]<<" ";
```

Find detailed descriptions of Intel AVX-512 intrinsics in the Intel® Intrinsics Guide. A detailed description of the VNNI instruction, as well as how it is implemented in the Intel® MKL-DNN library, can be found in the following white paper: Accelerate Lower Numerical Precision Inference with Intel® Deep Learning Boost.

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804