# Code Sample: Intel® AVX512-Deep Learning Boost: Intrinsic Functions

By Alberto V., published on April 2, 2019

File(s): | Download |

License: | 3-Clause BSD License |

Optimized for... | |
---|---|

Operating System: | Linux* |

Hardware: | Second generation Intel® Xeon® Scalable processor |

Software:(Programming Language, tool, IDE, Framework) | C++ Compiler version 19, Intel® Parallel Studio XE 2019 |

Prerequisites: | Familiarity with C++ |

## Basic Code Sample Using Intrinsic Functions

This is the first in a series of code samples developers can use to take advantage of the new Intel® AVX512-Deep Learning Boost (Intel® AVX512-DL Boost).

The code sample demonstrates testing the new functionality using intrinsic functions.

## Intel® AVX512-DL Boost

Second generation Intel® Xeon® Scalable processors now include Intel® AVX512-DL Boost, which can improve the throughput of integer linear algebra. This instruction can accelerate loops in some convolutional neural networks (CNN) because those loops perform multiplication of two 8-bit (or 16-bit) integers and accumulate the result in a 32-bit integer variable.

Intel® AVX512-DL Boost includes a fused instruction to perform lower precision (8-bit and 16-bit) multiplies with 32-bit accumulates. This instruction replaces a sequence of three instructions that are part of the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Fused-Multiply-Add (FMA) Extensions. Figure 1 shows how the new instruction in Intel® AVX512-DL Boost **VPDPBUSD** replaces the three separate FMA instructions **VPMADDUBSW**, **VPMADDWD** and **VPADDD**.

Figure 1. Intel® AVX512-DL Boost instruction **VPDPBUSD** replaces the three separate FMA instructions **VPMADDUBSW**, **VPMADDWD** and **VPADDD** to perform 8-bit multiplies with 32-bit accumulates. Image credit to Israel Hirsh and Bob Valentine.

Find a detailed description of both Intel® AVX512-DL Boost fused instruction and the FMA-based instructions, as well as the theoretical peak compute gains, in this white paper: Lower Numerical Precision Deep Learning Inference and Training.

## Code Sample

This code sample uses Intel AVX-512 intrinsics to illustrate the use of both Intel® AVX512-DL Boost fused instruction and the three equivalent FMA-based instructions.

Find the prototypes for Intel AVX-512 intrinsics in the immintrin.h header file:

#include <immintrin.h>

The Intel AVX-512 intrinsic functions use C data types as operands representing the 512-bit registers used in the operations. The **__m512i** data type can hold 64 8-bit integer values, 32 16-bit values, or 16 32-bit values. This code sample uses these three contents:

int8_t op1_int8[64]; int8_t op2_int8[64]; int op3_int[16]; int16_t op4_int16[32]; __m512i v1_int8; __m512i v2_int8; __m512i v3_int; __m512i v4_int16;

Data from memory can be loaded into the registers using the **_mm512_load_si512** function.

(…) v1_int8 =_mm512_load_si512(&op1_int8); v2_int8 =_mm512_load_si512(&op2_int8); v3_int =_mm512_load_si512(&op3_int); v4_int16 =_mm512_load_si512(&op4_int16);

Once the data is loaded, perform the dot product operation using the fused instruction **vpdpbusds**, which is called via the intrinsic function **_mm512_dpbusds_epi32** . This instruction multiplies groups of four adjacent pairs of unsigned 8-bit integers in **v1_int8** with corresponding signed 8-bit integers in **v2_int8**, producing four intermediate signed 16-bit results. It then adds these four results with the corresponding 32-bit integer in **v3_int** using signed saturation, and returns the packed 32-bit results:

// PERFORM THE DOT PRODUCT OPERATION USING FUSED INSTRUCTION result = _mm512_dpbusds_epi32(v3_int, v1_int8, v2_int8); presult = (int*) &result; printf("RESULTS USING FUSED INSTRUCTION: \n "); for (int j = 15; j >= 0; j--) cout << presult[j]<<" ";

It is also possible to perform the same dot product operation using three separate FMA instructions **vpmaddubsw**, **vpmaddwd** and **vpaddd** (which are called using the intrinsic functions **_mm512_maddubs_epi16**, **_mm512_madd_epi16** and **_mm512_add_epi32**, respectively):

// PERFORM THE DOT PRODUCT OPERATION USING A SEQUENCE OF 3 INSTRUCTIONS // Vertically multiply two 8-bit integers, // then horizontally add adjacent pairs of 16-bit integers __m512i vresult1 = _mm512_maddubs_epi16(v1_int8,v2_int8); // Upconvert to 32-bit and horizontally add neighbors. Multiply by 1. __m512i vresult2 = _mm512_madd_epi16(vresult1,v4_int16); // Add packed 32-bit integers result = _mm512_add_epi32(vresult2,v3_int); printf("RESULTS USING SEQUENCE OF 3 INSTRUCTIONS: \n "); presult = (int*) &result;

Find detailed descriptions of Intel AVX-512 intrinsics in the Intel® Intrinsics Guide. A detailed description of the lower precision dot product operations, as well as their advantages in deep learning, can be found in this white paper: Lower Numerical Precision Deep Learning Inference and Training.