AVX-512 instructions

Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

The latest Intel® Architecture Instruction Set Extensions Programming Reference includes the definition of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions. These instructions represent a significant leap to 512-bit SIMD support. Programs can pack eight double precision or sixteen single precision floating-point numbers, or eight 64-bit integers, or sixteen 32-bit integers within the 512-bit vectors. This enables processing of twice the number of data elements that AVX/AVX2 can process with a single instruction and four times that of SSE.

Intel AVX-512 instructions are important because they offer higher performance for the most demanding computational tasks. Intel AVX-512 instructions offer the highest degree of compiler support by including an unprecedented level of richness in the design of the instructions. Intel AVX-512 features include 32 vector registers each 512 bits wide, eight dedicated mask registers, 512-bit operations on packed floating point data or packed integer data, embedded rounding controls (override global settings), embedded broadcast, embedded floating-point fault suppression, embedded memory fault suppression, new operations, additional gather/scatter support, high speed math instructions, compact representation of large displacement value, and the ability to have optional capabilities beyond the foundational capabilities. It is interesting to note that the 32 ZMM registers represent 2K of register space!

Intel AVX-512 offers a level of compatibility with AVX that is stronger than prior transitions to new widths for SIMD operations. Unlike SSE and AVX that cannot be mixed without performance penalties, the mixing of AVX and Intel AVX-512 instructions is supported without penalty. AVX registers YMM0–YMM15 map into the Intel AVX-512 registers ZMM0–ZMM15, very much like SSE registers map into AVX registers. Therefore, in processors with Intel AVX-512 support, AVX and AVX2 instructions operate on the lower 128 or 256 bits of the first 16 ZMM registers.

The evolution to Intel AVX-512 contributes to our goal to grow peak FLOP/sec by 8X over 4 generations: 2X with AVX1.0 with the Sandy Bridge architecture over the prior SSE4.2, extended by Ivy Bridge architecture with 16-bit float and random number support, 2X with AVX2.0 and its fused multiply-add (FMA) in the Haswell architecture and then 2X more with Intel AVX-512.

Intel AVX-512 in Intel products

Intel AVX-512 will be first implemented in the future Intel® Xeon Phi™ processor and coprocessor known by the code name Knights Landing, and will also be supported by some future Xeon processors scheduled to be introduced after Knights Landing. Intel AVX-512 brings the capabilities of 512-bit vector operations, first seen in the first Xeon Phi Coprocessors (previously code named Knights Corner), into the official Intel instruction set in a way that can be utilized in processors as well. Intel AVX-512 offers some improvements and refinement over the 512-bit SIMD found on Knights Corner that I've seen bring smiles to compiler writers and application developers alike. This is done in a way that offers source code compatibility for almost all applications with a simple recompile or relinking to libraries with Knights Landing support.

Intel AVX-512 Instruction encodings

Intel AVX instructions use the VEX prefix while Intel AVX-512 instructions use the EVEX prefix which is one byte longer. The EVEX prefix enables the additional functionality of Intel AVX-512. In general, if the extra capabilities of the EVEX prefix are not needed then the AVX2 instructions can be used, coded using the VEX prefix saving a byte in certain cases. Such optimizations can be done in compiler code generators or assemblers automatically

Emulation for Testing, Prior to Product

In order to help with testing of support, before Knights Landing is available, the Intel® Software Development Emulator (Intel® SDE) has been extended for Intel AVX-512 and is available at http://www.intel.com/software/sde.

Innovation Beyond Intel AVX-512

Intel AVX-512 foundation instructions will be included in all implementations of Intel AVX-512. Products may also include capabilities that extend Intel AVX-512 and have distinct CPUID bits for detection. Knights Landing will support three sets of capabilities to augment the foundation instructions. This is documented in the programmer’s guide; they are known as Intel AVX-512 Conflict Detection Instructions (CDI), Intel AVX-512 Exponential and Reciprocal Instructions (ERI) and Intel AVX-512 Prefetch Instructions (PFI). These capabilities provide efficient conflict detection to allow more loops to be vectorized, exponential and reciprocal operations and new prefetch capabilities, respectively.

Intel AVX-512 support

Release of detailed information on Intel AVX-512 helps enable support in tools and operating systems by the time products appear. We are working with both open source projects and tool vendors to help incorporate support. The Intel compilers, libraries, and analysis tools have, or will be updated, to provide first class Intel AVX-512 support.

Intel AVX-512 documentation

The Intel AVX-512 instructions are documented in the Intel® Architecture Instruction Set Extensions Programming Reference (see the "Getting Started" tab at http://software.intel.com/en-us/intel-isa-extensions). Intel AVX-512 is covered in Chapters 2-7; Chapters 5 and 6 detail the Intel AVX-512 foundation instructions while Chapter 7 details the capabilities that extend Intel AVX-512.

For more complete information about compiler optimizations, see our Optimization Notice.

Comments

c0d1f1ed's picture

"Intel AVX-512 will be first implemented in the future Intel® Xeon Phi™ processor and coprocessor known by the code name Knights Landing, and will also be supported by some future Xeon processors scheduled to be introduced after Knights Landing"

Will this future Xeon processor actually be a socketed MIC, or a CPU (more precisely Skylake)? Is it coming to consumer CPUs in a similar timeframe? Developers will want to know, to determine whether to adopt AVX2+ or heterogeneous computing.

iliyapolak's picture

As those 32 ZMM registers seems to be architectural registers it is interesting how many internal physical registers AVX-512 does have?

Agner's picture

AVX-512 - where do we go next?

The forthcoming AVX512 instruction set is one of the biggest steps in the
history of the x86 instruction set. It doubles the size of the vector registers
for the third time; it doubles the number of vector registers; it defines a new
way of doing masked vector operations; and it adds hundreds of new instructions.
This is a time for reflection and for planning - where are we going? What is the next step?
Will the vector size keep growing exponentially?

The first question is what about the Knights Corner instruction set? It is
very similar to AVX512. Both Knights Corner and AVX512 are backwards compatible,
but these two extensions are not compatible with each other. AVX512 instructions
and Knights Corner instructions differ by a single prefix bit, even for
otherwise identical instructions. I am sure that the market will not accept a
forking of the instruction set, so my guess is that the Knights Corner
instruction set is a dead end street. It doesn't even have an official name and
it doesn't have a CPUID bit. We just need official confirmation from Intel that
the Knights Corner instruction set is being phased out, and we can forget about it.
AVX512 is better than Knights Corner, so I am quite happy with this.

Next, I want to comment on the new mask registers k0 - k7. Why add the
complication of yet another register type - actually two new register types - we
also have the new bounds registers. Why not use existing registers for masks? If
we made eight of the general purpose registers available as mask registers we
would have all the functionality we could wish for free without adding new
instructions: logical operations, shift, rotate, bit scan, bit test, table
lookup. Or we could use some of the vector registers for masks, now that we have
32 of them. With a new type of registers we have to move data back and forth between the different
register types.

I would also like some more foresight here. When the 64-bit mmx registers
were replaced by 128-bit xmm registers, nobody thought about preparing for the
predictable next extension. The consequence of this lack of foresight is that we
now have the complication of two versions of all xmm instructions and three
states of the ymm register file. We have to issue a vzeroupper before every call
and return to an ABI-compliant function, or alternatively make two versions of all
library functions, with and without VEX.

We should learn from history and prepare well for future extensions. The EVEX
prefix has room for a future extention to 1024 bits, perhaps more, and all
manuals say that we will have zero-extention of the vector registers into future
larger registers. But we still have the very compelling problem of how to save
and restore a register when the size of the register may grow in the future.
There is currently a discussion going on about whether some of the new registers
zmm16 - zmm31 should have callee save status, but the problem is that the
register size is expected to grow further, see

http://gcc.gnu.org/ml/gcc/2013-07/threads.html#00332

How long will the vector size keep growing exponentially? In the link above
it is argued that this growth is subject to diminishing returns. Few
applications will be able to utilize the bigger vectors. Cache throughput will
be the bottleneck. And there will be more to save and restore on task switches.
But I suspect that somebody will keep expanding the registers beyond what is
technologically sound in an attempt to obey Moores law.

We definitely need an instruction to save a full vector register, whatever
its size. I am imagining an instruction similar to movdqa or movdqu which is
guaranteed to save the whole register, even if the size should increase in the
future. This instruction should read, write or copy a vector register to the
maximum size supported by the CPU and enabled by the OS. Of course we would need
a CPUID function telling what the maximum size is. This would be useful for
interrupt handlers, device drivers, etc. that want to save and restore a few
registers without spending hundreds of clock cycles on saving everything. Such a
function could simply allocate the necessary space on the stack and save the
full register there. It would also be useful for memcpy functions etc. which
could use the largest available register size without updating the software.

This problem is even more pressing for the mask registers k0 - k7. The manuals
say that these registers are 64 bits, but there is no instruction to read or
write more than 16 bits of these mask registers. If it is decided that some of
the mask registers should have callee save status then we definitely need a way
to save all 64 bits because it can be predicted that these bits will be used
in the future. Likewise, a device handler would need a way to save and restore the full
mask register. This problem could easily be solved by allowing the W bit or L
bit in the KMOVW instruction to specify that the instruction should move the
maximum available number of bits up to 64. But this has to be done before AVX512
is implemented and before the ABI is written.

All EVEX-coded instructions are at least 6 bytes long, and often more. With a
maximum throughput of four instructions per clock, it is obvious that the
current instruction fetch rate of 16 bytes per clock in Intel processors is insufficient. We need to
increase the instruction fetch rate or reduce the average instruction length,
preferably both. The microop cache helps, but its capacity is limited. The offset
multiplier also helps reduce the size of EVEX instructions, but I suspect that
only a small fraction of instructions can be made shorter by this trick. In
the post below I am discussing various possible ways of reducing the average instruction
length in the future, and I am also discussing if it is possible to double the number of general purpose
registers in a future instruction set.

www.agner.org
Agner's picture

Possible ways to make the x86 instruction set more efficient

There are now far more than a thousand logically different instructions in
the x86 instruction set, and even the extended opcode maps are being overloaded.
Just adding new instructions doesn't seem to be the way to go. Memory and cache
access is the bottleneck in most applications. For CPU-intensive applications,
the instruction fetch rate is often a bottleneck. The x86 instruction set still
has an advantage over RISC architectures because the code is more compact and
this improves the utilization of the instruction cache. Some of the most
frequently used instructions are only one byte long. But the instruction length
gets longer with the 4-bytes EVEX prefix. There are still things that can be
done to reduce instruction lengths, and I will discuss them here.

Most instructions with an immediate constant operand have a short form with a
sign-extended 8-bit constant. But not the most common of them all, the MOV
instruction. To move a constant into a 32-bit or 64-bit register or memory
operand you need a 32-bit immediate operand. The reason why there is no 8-bit
short form is that it was not needed in the original 16-bit 8086 instruction
set. A short form mov instruction with an 8-bit sign-extended operand would
reduce code size significantly. There are no unused single-byte opcodes in
32-bit mode, and only a few unused 2-bytes opcodes. So there is no point in
making such an instruction in 32-bit mode. But there are around 20 unused
single-byte opcodes in 64-bit mode, so it would make sense to make such an
instruction in 64-bit mode only. The instruction would have one opcode byte, a
mod/reg/rm byte, and one immediate data byte.

Another common instruction that could use a short form is the call
instruction. The call instruction always has a 32-bit signed displacement. A
short version of the call instruction with a 16-bit signed displacement would
also be useful for reducing code size. Again, this should use a single-byte
opcode in 64 bit mode, and not be available in 32-bit mode. A compiler could use
the short form where it is known that the target is within +/- 32 kbytes, e.g.
for calls within the same module and where whole-program optimization is used.
It would even be possible to define a tiny memory model to use if the code
segment is less than 32 kbytes and resolve the 16-bit addresses at link time.
The commonly used object file formats (PE/COFF, OMF, ELF, MACHO) all support
16-bit self-relative fixups.

It would also be nice to have an instruction that puts a 32-bit immediate
constant into a vector register. The reg bits could indicate where to put the
constant, e.g. 0: first dword, 1: second dword, .., 4: all odd dwords, 5: all
even dwords, 7: all dwords. The rest of the register would be zero. This
instruction would be useful for inserting bit masks to manipulate sign bits and
to insert simple numbers. For example, putting 0x3FFF4000 into the odd dwords
would broadcast the double precision constant 1.25. This instruction would
reduce the load on the data cache without increasing the use of the code cache.

AVX512 doubles the number of vector registers in 64-bit mode (but not in
32-bit mode). It is natural to ask if the number of general purpose registers
can also be doubled. I have explored what it would require to do so. The 4-bytes
EVEX prefix could in principle be applied to any legacy instruction in order to
extend the number of registers. The mm and pp bits replace previous prefixes and
escape codes, and the 1-byte opcode page should obviously be coded as mm = 0.
The EVEX prefix contains the R' and V' bits to extend the R and V bits, but it
is missing B' and X' bits to extend the B and X bits of the REX prefix. So we
are missing two bits. The four V bits can be used for this purpose on
instructions with two operands. However, if we want to cover all cases,
including instructions with three operands and a SIB byte, we need two more
bits. We can use the bit that distinguishes the EVEX prefix from the MVEX prefix
used by the Knights Corner instruction set, since the latter is unlikely to have
a future, and the two instruction sets cannot run on the same CPU anyway. The
last missing bit can be obtained by supplementing the 62 (hexadecimal) EVEX byte
with the unused 60 byte and use bit 1 of the 60/62 pair as the last register
extension bit. In this way we will still have two unused mm bits for future
extensions. (The 61 byte can also be reserved for future extensions).

A four-bytes EVEX prefix makes instructions rather long, so it would be good
to have a short form EVEX prefix for legacy instructions that don't need all the
AVX512 option bits. A short form EVEX prefix should ideally have the following
option bits: W, R, B, X, R', B', X', mm and pp. This is 11 bits. We want the
short EVEX prefix to be two bytes. This would require 8 different bytes for the
first byte to encode 3 bits (2^3 = 8) and the remaining 8 bits go in the second
byte. While there are enough unused bytes in 64-bit mode, there is no natural
octet. There are several quartets: (06,07,16,17) or (16,17,1E,1F) or
(27,2F,37,3F). Combining two of these quartets would be too clumsy. Thus, we are
one bit short of being able to fulfill all needs for legacy non-VEX instructions
with a 2-bytes EVEX prefix. We have to make compromises. The following possible
compromises would save one bit:

  1. Omit pp[1]. This would cover many cases, but not the scalar floating point
    instructions, such as ADDSD.
  2. Omit mm[1]. This would cover all legacy instructions up to SSE3, but not
    SSSE3 and later.
  3. Omit B'. We would be unable to use register r16-r31 as base if there is a SIB
    byte, and we would have to use X' for extending the B bit in other cases.
  4. Omit X'. We would be unable to use register r16-r31 as index if there is a
    SIB byte (but if the scale factor is 1 and the base register is less than r16 we could
    circumvent the problem by swapping base and index).

I think that options 2 and 4 are the most acceptable. With one of these
solutions, we could use a short 2-bytes EVEX prefix for extending the number of
general purpose and vector registers in most cases of legacy non-VEX
instructions. We could even use it for some AVX512 instructions. Most legacy VEX
instructions would need the 4-bytes EVEX prefix for extending the number of
registers.

The two-bytes EVEX prefix could use the compressed displacement address mode,
thereby even making some legacy instructions shorter. The compressed
displacement is a complication and possible source of error in compilers and
assemblers because they have to find the right multiplier for each instruction.
But we can expect this problem to be solved anyway for the sake of AVX512, and
it can potentially reduce the size of some instructions by 3 bytes.

Another question is whether the different types of VEX and EVEX prefixes can
be used interchangeably. In most cases this would be no problem. There are just
a few cases where the same opcode byte has different meanings with different
types of prefixes. These cases are:

  1. vmaskmovps with EVEX prefix becomes vscalefp.  The former instruction
    can be considered obsolete, replaced by the more efficient move instructions
    with mask.
  2. The SHA instructions with EVEX prefix become mathematical instructions.
    Thus, we will be unable to use the extra registers with SHA instructions.
  3. Conditional set and conditional move instructions with a VEX prefix become
    mask instructions. The mask instructions have only registers k0 - k7 with no
    room for expansion.

So we can conclude that it is technically possible to double the number of
general purpose registers in 64-bit mode. But the solution is rather complex,
and it would require yet another set of patches to the already ugly patchwork of
the current x86 instruction set. Is it worth the trouble?

Having 32 general purpose registers would be 'nice', but would it matter in
terms of performance? Most CPU-intensive programs have one or a few hot spots or
inner loops that take more than 99% of the CPU time. A compiler doing
whole-program optimization would be quite likely to use all 32 registers, but
how many registers would be used within the limits of a hot spot? Probably only
a few. Whatever happens outside the hot spot is of no serious consequence. With
vector registers we may need multiple accumulators for instructions that have
long latency, but this is not the case with general purpose registers.

So the pros and cons of adding more general purpose registers with a 2-bytes
EVEX prefix are:

pros:

  • fewer register spills
  • more register variables available for inlined functions and whole-program
    optimization
  • some instructions can be made smaller with compressed displacement

cons:

  • complicated instruction set full of patches
  • instructions with extended registers become one or two bytes longer
  • more registers to save during task switches
  • the advantage of fewer register spills is likely to be irrelevant because
    it happens outside the hot spots

Let me hear the opinion of other readers.

www.agner.org

Pages