Overview: Intrinsics for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Instructions
Intrinsics for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Instructions extend Intel® Advanced Vector Extensions (Intel® AVX) and Intel® Advanced Vector Extensions 2 (Intel® AVX2) by promoting most of the 256-bit SIMD instructions with 512-bit numeric processing capabilities.
The Intel® AVX-512 instructions follow the same programming model as the Intel® AVX2 instructions, providing enhanced functionality for broadcast, embedded masking to enable predication, embedded floating point rounding control, embedded floating-point fault suppression, scatter instructions, high speed math instructions, and compact representation of large displacement values. Unlike Intel® SSE and Intel® AVX, which cannot be mixed without performance penalties, the mixing of Intel® AVX and Intel® AVX-512 instructions is supported without penalty.
Intel® AVX-512 intrinsics are supported on IA-32 and Intel® 64 architectures built from 32nm process technology. They map directly to the new Intel® AVX-512 instructions and other enhanced 128-bit and 256-bit SIMD instructions.
Intel® AVX-512 Registers
512-bit Register state is managed by the operating system using
XSAVEOPTinstructions, introduced in 45nm Intel® 64 processors (see Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2B, and Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A).
- Support for sixteen new 512-bit SIMD registers in 64-bit mode (for a total of 32 SIMD registers, representing 2K of register space,ZMM0throughZMM31).
- Support for eight new opmask registers (k0throughk7) used for conditional execution and efficient merging of destination operands.
Intel® AVX registers
YMM15map into Intel® AVX-512 registers
ZMM15, very much like Intel® SSE registers map into Intel® AVX registers. In processors with Intel® AVX-512 support, Intel® AVX and Intel® AVX2 instructions operate on the lower 128- or 256-bits of the first sixteen
Prefix Instruction Encoding Support for Intel® AVX-512
A new encoding prefix (referred to as EVEX) to support additional vector length encoding up to 512 bits. The EVEX prefix builds upon the foundations of VEX prefix, to provide compact, efficient encoding for functionality available to VEX encoding while enhancing vector capabilities.
The Intel® AVX-512 intrinsic functions use three C data types as operands, representing the new registers used as operands to the intrinsic functions. These are
__m512idata types. The
__m512data type is used to represent the contents of the extended SSE register, the
ZMMregister, used by the Intel® AVX-512 intrinsics. The
__m512data type can hold sixteen 32-bit floating-point values. The
__m512ddata type can hold eight 64-bit double precision floating-point values. The
__m512idata type can hold sixty-four 8-bit, thirty-two 16-bit, sixteen 32-bit, or eight 64-bit integer values.
The compiler aligns the
__m512ilocal and global data to 64-byte boundaries on the stack. To align
doublearrays, use the
Data Types for Intel® AVX-512 Intrinsics
The prototypes for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) intrinsics are located in the
zmmintrin.hheader file. To use these intrinsics, include the
immintrin.hfile as follows:
Intel® AVX-512 intrinsics have vector variants that use
Naming and Usage Syntax
Most Intel® AVX-512 intrinsic names use the following notational convention:
The following table explains each item in the syntax.
Prefix representing the size of the largest vector in the operation considering any of the parameters or the result.
When present, indicates write-masked (
_mask) or zero-masked (
Indicates the basic operation of the intrinsic; for example, add for addition and sub for subtraction.
Denotes the type of data the instruction operates on. The first one or two letters of each suffix denote whether the data is packed (
p), extended packed (
ep), or scalar (
s). The remaining letters and numbers denote the type, with notation as follows:
Programs can pack eight double precision and sixteen single precision floating-point numbers within the 512-bit vectors, as well as eight 64-bit and sixteen 32-bit integers. This enables processing of twice the number of data elements that Intel® AVX or Intel® AVX2 can process with a single instruction and four times the capabilities of Intel® SSE.
Write-masking allows an intrinsic to perform its operation on selected SIMD elements of a source operand, with blending of the other elements from an additional SIMD operand. Consider the declarations below, where the write-mask
khas a 1 in the even numbered bit positions 0, 3, 5, 7, 9, 11, 13 and 15, and a 0 in the odd numbered bit positions.
__m512 res, src, a, b; __mmask16 k = 0x5555;
Then, given an intrinsic invocation such as this:
res = _mm512_mask_add_ps(src, k, a, b);
every even-numbered float32 element of the result
resis computed as the sum of the corresponding elements in
b, while every odd-numbered element is passed through (i.e., blended) from the corresponding float32 element in
Typical write-masked intrinsics are declared with a parameter order such that the values to be blended (
srcin the example above) are in the first parameter, and the write mask
kimmediately follows this parameter. Some intrinsics provide the blended values from a different SIMD parameter, for example:
_mm512_mask2_permutex2var_epi32. In this case too, the mask will follow that parameter.
Zero-masking is a simplified form of write-masking where there are no blended values. Instead result elements corresponding to zero bits in the write mask are simply set to zero. Given:
res = _mm512_maskz_add_ps(k, a, b);
the float32 elements of res corresponding to zeros in the write-mask
k, are set to zero. The elements corresponding to ones in
k, have the expected sum of corresponding elements in
Zero-masked intrinsics are typically declared with the write-mask as the first parameter, as there is no parameter for blended values.