Developer Guide and Reference

Contents

Overview: Intrinsics for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Instructions

Functional Overview

Intrinsics for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Instructions extend Intel® Advanced Vector Extensions (Intel® AVX) and Intel® Advanced Vector Extensions 2 (Intel® AVX2) by promoting most of the 256-bit SIMD instructions with 512-bit numeric processing capabilities.
The Intel® AVX-512 instructions follow the same programming model as the Intel® AVX2 instructions, providing enhanced functionality for broadcast, embedded masking to enable predication, embedded floating point rounding control, embedded floating-point fault suppression, scatter instructions, high speed math instructions, and compact representation of large displacement values. Unlike Intel® SSE and Intel® AVX, which cannot be mixed without performance penalties, the mixing of Intel® AVX and Intel® AVX-512 instructions is supported without penalty.
Intel® AVX-512 intrinsics are supported on IA-32 and Intel® 64 architectures built from 32nm process technology. They map directly to the new Intel® AVX-512 instructions and other enhanced 128-bit and 256-bit SIMD instructions.

Intel® AVX-512 Registers

512-bit Register state is managed by the operating system using
XSAVE
/
XRSTOR
/
XSAVEOPT
instructions, introduced in 45nm Intel® 64 processors (see Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2B, and Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A).
  • Support for sixteen new 512-bit SIMD registers in 64-bit mode (for a total of 32 SIMD registers, representing 2K of register space,
    ZMM0
    through
    ZMM31
    ).
  • Support for eight new opmask registers (
    k0
    through
    k7
    ) used for conditional execution and efficient merging of destination operands.
Intel® AVX registers
YMM0
-
YMM15
map into Intel® AVX-512 registers
ZMM0
-
ZMM15
, very much like Intel® SSE registers map into Intel® AVX registers. In processors with Intel® AVX-512 support, Intel® AVX and Intel® AVX2 instructions operate on the lower 128- or 256-bits of the first sixteen
ZMM
registers.

Prefix Instruction Encoding Support for Intel® AVX-512

A new encoding prefix (referred to as EVEX) to support additional vector length encoding up to 512 bits. The EVEX prefix builds upon the foundations of VEX prefix, to provide compact, efficient encoding for functionality available to VEX encoding while enhancing vector capabilities.
The Intel® AVX-512 intrinsic functions use three C data types as operands, representing the new registers used as operands to the intrinsic functions. These are
__m512
,
__m512d
, and
__m512i
data types. The
__m512
data type is used to represent the contents of the extended SSE register, the
ZMM
register, used by the Intel® AVX-512 intrinsics. The
__m512
data type can hold sixteen 32-bit floating-point values. The
__m512d
data type can hold eight 64-bit double precision floating-point values. The
__m512i
data type can hold sixty-four 8-bit, thirty-two 16-bit, sixteen 32-bit, or eight 64-bit integer values.
The compiler aligns the
__m512
,
__m512d
, and
__m512i
local and global data to 64-byte boundaries on the stack. To align
integer
,
float
, or
double
arrays, use the
__declspec(align)
statement.

Data Types for Intel® AVX-512 Intrinsics

The prototypes for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) intrinsics are located in the
zmmintrin.h
header file.
To use these intrinsics, include the
immintrin.h
file as follows:
#include <immintrin.h>
Intel® AVX-512 intrinsics have vector variants that use
__m128
,
__m128i
,
__m128d
,
__m256
,
__m256i
,
__m256d
,
__m512
,
__m512i
, and
__m512d
data types.

Naming and Usage Syntax

Most Intel® AVX-512 intrinsic names use the following notational convention:
_mm512[_<maskprefix>]_<intrin_op>_<suffix>
The following table explains each item in the syntax.
_mm512
Prefix representing the size of the largest vector in the operation considering any of the parameters or the result.
<maskprefix>
When present, indicates write-masked (
_mask
) or zero-masked (
_maskz
) predication.
<intrin_op>
Indicates the basic operation of the intrinsic; for example, add for addition and sub for subtraction.
<suffix>
Denotes the type of data the instruction operates on. The first one or two letters of each suffix denote whether the data is packed (
p
), extended packed (
ep
), or scalar (
s
). The remaining letters and numbers denote the type, with notation as follows:
  • s
    :
    single-precision floating point
  • d
    :
    double-precision floating point
  • i512
    :
    signed 512-bit integer
  • i256
    :
    signed 256-bit integer
  • i128
    :
    signed 128-bit integer
  • i64
    :
    signed 64-bit integer
  • u64
    :
    unsigned 64-bit integer
  • i32
    :
    signed 32-bit integer
  • u32
    :
    unsigned 32-bit integer
  • i16
    :
    signed 16-bit integer
  • u16
    :
    unsigned 16-bit integer
  • i8
    :
    signed 8-bit integer
  • u8
    :
    unsigned 8-bit integer
Programs can pack eight double precision and sixteen single precision floating-point numbers within the 512-bit vectors, as well as eight 64-bit and sixteen 32-bit integers. This enables processing of twice the number of data elements that Intel® AVX or Intel® AVX2 can process with a single instruction and four times the capabilities of Intel® SSE.

Example: Write-Masking

Write-masking allows an intrinsic to perform its operation on selected SIMD elements of a source operand, with blending of the other elements from an additional SIMD operand. Consider the declarations below, where the write-mask