AVX512 is arguably the biggest step yet in the evolution of the x86 instruction set in terms of new instructions, new registers and new features. The first try was the Knights Corner instruction set. It had some problems and AVX512 is better, so I am quite happy that AVX512 seems to be replacing the Knights Corner instruction set. But there are still some shortsighted issues that are lilkely to cause problems for later extensions.
We have to learn from history. When the 64-bit mmx registers were replaced by the 128-bit xmm registers, nobody thought about preparing for the predictable next extension. The consequence of this lack of foresight is that we now have the complication of two versions of all xmm instructions and three states of the ymm register file. We have to issue a vzeroupper instruction before every call and return to ABI-compliant functions, or alternatively make two versions of all library functions, with and without VEX.
Such lack of foresight can be disastrous. Unfortunately, it looks like the AVX512 design is similarly lacking foresight. I want to point out two issues here that are particularly problematic:
- AVX512 does not provide for clean extensions of the mask registers
- The overloading of the register extension bits will mess up possible future expansions of the general purpose register space
First the new mask registers, k0 - k7. The manual says that these registers are 64 bits, yet there is no instruction to read or write more than 16 bits of a mask register. Thus, there is no way of saving and restoring a mask register that is compatible with the expected future extension to 64 bits. If it is decided to give some of the mask registers callee-save status, then there is no way of saving and restoring all 64 bits. We will be saving/restoring only 16 bits and zeroing the rest. Likewise, if an interrupt handler or device driver needs to use a mask register, it has no way of saving and restoring the full mask register short of saving the entire register file, which costs hundreds of clock cycles.
It is planned that the mask registers can grow to 64 bits, but not more, because they have to match the general purpose registers. Yet, we can predict already now that 64 bits will be insufficient within a few years. There seems to be plans to extend the vector registers to 1024 bits. Whether they should be extended further has perhaps not been decided yet (these extensions are certainly subject to diminishing returns). People are already now asking for an addition to AVX512 to support vector operations on 8-bit and 16-bit integers. A 1024 bit vector of 8-bit integers will require mask registers of 128 bits. There are apparently no plans for how the mask registers can be extended beyond 64 bits, so we will be needing another clumsy patch at that time.
Let me suggest a simple solution to this problem: Drop the mask registers and allow 8 of the vector registers to be used as mask registers. Then we can be certain that the registers used for masks will never become too small because a mask will always need fewer bits than the vector it is masking. We have 32 vector registers now, so we can certainly afford to use a few of them as mask registers. I think, generally, that it is bad to have many different register types. It delays task switching, it makes the ABI more complicated, it makes compilers more complicated, and it fills up the already crowded opcode space with similar instructions for different register types. The new instructions for manipulating mask registers will not be needed when we use xmm registers for masks, because the xmm instructions provide most of this functionality already, and much more.
So let me propose: Drop the new mask registers and the instructions for manipulating them. Allow seven of the vector registers (e.g. xmm1 - xmm7 or xmm25 - xmm31) to be used as mask registers. All mask functionality will be the same as currently specified by AVX512. This will make future extensions problem-free and allow the synergy of using the same instructions for manipulating vectors and manipulating masks.
The second issue I want to point out relates to doubling the number of registers. AVX512 doubles the number of vector registers from 16 to 32 in 64-bit mode. It is natural to ask whether the number of general purpose registers can also be doubled. In fact, it can, though this will be a little complicated. I have posted a comment on Intel's blog with a possible technical solution. I am not convinced that more general purpose registers will give a significant improvement in performance, but it is quite possible that we will need more registers in the future, perhaps for purposes that don't exist today. We should keep this in mind and keep the possibility open for having 32 general purpose registers in a future extension. Unfortunately, AVX512 is messing up this possibility by overloading the register extension bits. The X bit is reused for extending the B bit, and the V' bit is reused for extending the X bit. This is a patch that fits only a very narrow purpose. It will be a mess if these bits are needed in future extenstions for their original purpose. We need two more bits (B' and X') to make a clean extention of the register space. We can easily get one more bit by extending the 0x62 prefix byte into 0x60 and use bit 1 of the 60/62 prefix as e.g. register extension bit B'. The byte 0x60 is only vacant in 64-bit mode, but we don't need the register extension bit in 32-bit mode anyway. The bit that distinguishes AVX512 instructions from Knights Corner instructions can be used as the X' register extension bit. No CPU will ever be able to run both instruction sets, so we don't need this bit anyway.
There are other less attractive solutions in case the Knights Corner bit cannot be used, but anyway I think it is important to keep the possibility open for future extensions of the register space instead of messing up everything with short-sighted patches.
I will repeat what I have argued before, that instruction set extensions should be discussed in an open forum before they are implemented. This is the best way to prevent lapses and short-sighted decisions like these ones.