Code Sample: Intel® AVX512-Deep Learning Boost: Intrinsic Functions

File(s):

Download
License:3-Clause BSD License
Optimized for... 
Operating System:Linux*
Hardware:Second generation Intel® Xeon® Scalable processor
Software:
(Programming Language, tool, IDE, Framework)
C++ Compiler version 19, Intel® Parallel Studio XE 2019
Prerequisites:Familiarity with C++

Basic Code Sample Using Intrinsic Functions

This is the first in a series of code samples developers can use to take advantage of the new Intel® AVX512-Deep Learning Boost (Intel® AVX512-DL Boost).

The code sample demonstrates testing the new functionality using intrinsic functions.

Intel® AVX512-DL Boost

Second generation Intel® Xeon® Scalable processors now include Intel® AVX512-DL Boost, which can improve the throughput of integer linear algebra. This instruction can accelerate loops in some convolutional neural networks (CNN) because those loops perform multiplication of two 8-bit (or 16-bit) integers and accumulate the result in a 32-bit integer variable.

Intel® AVX512-DL Boost includes a fused instruction to perform lower precision (8-bit and 16-bit) multiplies with 32-bit accumulates. This instruction replaces a sequence of three instructions that are part of the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Fused-Multiply-Add (FMA) Extensions. Figure 1 shows how the new instruction in Intel® AVX512-DL Boost VPDPBUSD replaces the three separate FMA instructions VPMADDUBSW, VPMADDWD and VPADDD.

Diagram of VNNI instruction
Figure 1. Intel® AVX512-DL Boost instruction VPDPBUSD replaces the three separate FMA instructions VPMADDUBSW, VPMADDWD and VPADDD to perform 8-bit multiplies with 32-bit accumulates. Image credit to Israel Hirsh and Bob Valentine.

Find a detailed description of both Intel® AVX512-DL Boost fused instruction and the FMA-based instructions, as well as the theoretical peak compute gains, in this white paper: Lower Numerical Precision Deep Learning Inference and Training.

Code Sample

This code sample uses Intel AVX-512 intrinsics to illustrate the use of both Intel® AVX512-DL Boost fused instruction and the three equivalent FMA-based instructions.

Find the prototypes for Intel AVX-512 intrinsics in the immintrin.h header file:

#include <immintrin.h>

The Intel AVX-512 intrinsic functions use C data types as operands representing the 512-bit registers used in the operations. The __m512i data type can hold 64 8-bit integer values, 32 16-bit values, or 16 32-bit values. This code sample uses these three contents:
 

int8_t    op1_int8[64]; 
int8_t    op2_int8[64]; 
int       op3_int[16]; 
int16_t   op4_int16[32]; 

__m512i  v1_int8; 
__m512i  v2_int8;
__m512i  v3_int;
__m512i  v4_int16; 

Data from memory can be loaded into the registers using the _mm512_load_si512 function.

(…)
v1_int8 =_mm512_load_si512(&op1_int8);
v2_int8 =_mm512_load_si512(&op2_int8);
v3_int =_mm512_load_si512(&op3_int);
v4_int16 =_mm512_load_si512(&op4_int16);

Once the data is loaded, perform the dot product operation using the fused instruction vpdpbusds, which is called via the intrinsic function _mm512_dpbusds_epi32 . This instruction multiplies groups of four adjacent pairs of unsigned 8-bit integers in v1_int8 with corresponding signed 8-bit integers in v2_int8, producing four intermediate signed 16-bit results. It then adds these four results with the corresponding 32-bit integer in v3_int using signed saturation, and returns the packed 32-bit results:

// PERFORM THE DOT PRODUCT OPERATION USING FUSED INSTRUCTION
result = _mm512_dpbusds_epi32(v3_int, v1_int8, v2_int8);
presult = (int*) &result;
printf("RESULTS USING FUSED INSTRUCTION: \n ");
for (int j = 15; j >= 0; j--)
    cout << presult[j]<<" ";

It is also possible to perform the same dot product operation using three separate FMA instructions vpmaddubsw, vpmaddwd and vpaddd (which are called using the intrinsic functions _mm512_maddubs_epi16, _mm512_madd_epi16 and _mm512_add_epi32, respectively):

// PERFORM THE DOT PRODUCT OPERATION USING A SEQUENCE OF 3 INSTRUCTIONS

// Vertically multiply two 8-bit integers,
// then horizontally add adjacent pairs of 16-bit integers

__m512i vresult1 = _mm512_maddubs_epi16(v1_int8,v2_int8);

// Upconvert to 32-bit and horizontally add neighbors. Multiply by 1.
__m512i vresult2 = _mm512_madd_epi16(vresult1,v4_int16);

// Add packed 32-bit integers
result = _mm512_add_epi32(vresult2,v3_int);

printf("RESULTS USING SEQUENCE OF 3 INSTRUCTIONS: \n ");
presult = (int*) &result; 

Find detailed descriptions of Intel AVX-512 intrinsics in the Intel® Intrinsics Guide. A detailed description of the lower precision dot product operations, as well as their advantages in deep learning, can be found in this white paper: Lower Numerical Precision Deep Learning Inference and Training.

Для получения подробной информации о возможностях оптимизации компилятора обратитесь к нашему Уведомлению об оптимизации.
Возможность комментирования русскоязычного контента была отключена. Узнать подробнее.