|License:||3-Clause BSD License|
|Hardware:||Second generation Intel® Xeon® Scalable processor|
(Programming Language, tool, IDE, Framework)
|C++ Compiler version 19, Intel® Parallel Studio XE 2019|
|Prerequisites:||Familiarity with C++|
Basic Code Sample Using Intrinsic Functions
This is the first in a series of code samples developers can use to take advantage of the new Intel® AVX512-Deep Learning Boost (Intel® AVX512-DL Boost).
The code sample demonstrates testing the new functionality using intrinsic functions.
Intel® AVX512-DL Boost
Second generation Intel® Xeon® Scalable processors now include Intel® AVX512-DL Boost, which can improve the throughput of integer linear algebra. This instruction can accelerate loops in some convolutional neural networks (CNN) because those loops perform multiplication of two 8-bit (or 16-bit) integers and accumulate the result in a 32-bit integer variable.
Intel® AVX512-DL Boost includes a fused instruction to perform lower precision (8-bit and 16-bit) multiplies with 32-bit accumulates. This instruction replaces a sequence of three instructions that are part of the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Fused-Multiply-Add (FMA) Extensions. Figure 1 shows how the new instruction in Intel® AVX512-DL Boost VPDPBUSD replaces the three separate FMA instructions VPMADDUBSW, VPMADDWD and VPADDD.
Figure 1. Intel® AVX512-DL Boost instruction VPDPBUSD replaces the three separate FMA instructions VPMADDUBSW, VPMADDWD and VPADDD to perform 8-bit multiplies with 32-bit accumulates. Image credit to Israel Hirsh and Bob Valentine.
Find a detailed description of both Intel® AVX512-DL Boost fused instruction and the FMA-based instructions, as well as the theoretical peak compute gains, in this white paper: Lower Numerical Precision Deep Learning Inference and Training.
This code sample uses Intel AVX-512 intrinsics to illustrate the use of both Intel® AVX512-DL Boost fused instruction and the three equivalent FMA-based instructions.
Find the prototypes for Intel AVX-512 intrinsics in the immintrin.h header file:
The Intel AVX-512 intrinsic functions use C data types as operands representing the 512-bit registers used in the operations. The __m512i data type can hold 64 8-bit integer values, 32 16-bit values, or 16 32-bit values. This code sample uses these three contents:
int8_t op1_int8; int8_t op2_int8; int op3_int; int16_t op4_int16; __m512i v1_int8; __m512i v2_int8; __m512i v3_int; __m512i v4_int16;
Data from memory can be loaded into the registers using the _mm512_load_si512 function.
(…) v1_int8 =_mm512_load_si512(&op1_int8); v2_int8 =_mm512_load_si512(&op2_int8); v3_int =_mm512_load_si512(&op3_int); v4_int16 =_mm512_load_si512(&op4_int16);
Once the data is loaded, perform the dot product operation using the fused instruction vpdpbusds, which is called via the intrinsic function _mm512_dpbusds_epi32 . This instruction multiplies groups of four adjacent pairs of unsigned 8-bit integers in v1_int8 with corresponding signed 8-bit integers in v2_int8, producing four intermediate signed 16-bit results. It then adds these four results with the corresponding 32-bit integer in v3_int using signed saturation, and returns the packed 32-bit results:
// PERFORM THE DOT PRODUCT OPERATION USING FUSED INSTRUCTION result = _mm512_dpbusds_epi32(v3_int, v1_int8, v2_int8); presult = (int*) &result; printf("RESULTS USING FUSED INSTRUCTION: \n "); for (int j = 15; j >= 0; j--) cout << presult[j]<<" ";
It is also possible to perform the same dot product operation using three separate FMA instructions vpmaddubsw, vpmaddwd and vpaddd (which are called using the intrinsic functions _mm512_maddubs_epi16, _mm512_madd_epi16 and _mm512_add_epi32, respectively):
// PERFORM THE DOT PRODUCT OPERATION USING A SEQUENCE OF 3 INSTRUCTIONS // Vertically multiply two 8-bit integers, // then horizontally add adjacent pairs of 16-bit integers __m512i vresult1 = _mm512_maddubs_epi16(v1_int8,v2_int8); // Upconvert to 32-bit and horizontally add neighbors. Multiply by 1. __m512i vresult2 = _mm512_madd_epi16(vresult1,v4_int16); // Add packed 32-bit integers result = _mm512_add_epi32(vresult2,v3_int); printf("RESULTS USING SEQUENCE OF 3 INSTRUCTIONS: \n "); presult = (int*) &result;
Find detailed descriptions of Intel AVX-512 intrinsics in the Intel® Intrinsics Guide. A detailed description of the lower precision dot product operations, as well as their advantages in deep learning, can be found in this white paper: Lower Numerical Precision Deep Learning Inference and Training.