The
extension of registers to the double size has happened several times in the
history of the x86 ISA. Every time registers are extended to a larger size we
have the problem with partial register access and false dependencies when
legacy instructions write to the lower part of the register.
The
solutions to this problem seen hitherto are the following:
- Make
the new registers independent of the previous smaller registers. This is the
solution that was used in the transition from 64-bit MMX to 128-bit XMM. The
advantage is that there is no false dependency. The disadvantage is that there
are more registers to save on every task switch and that we need new
instructions for moving data between the new and the old register set. The
now-obsolete MMX registers and all instructions relating to them are still
supported for the sake of backwards compatibility although they are rarely
used. - Allow the hardware to split the register
in two. A write to the lower half of an extended register is resolved by
splitting the register in two independent registers of smaller size. This
method is used in Intel Pentium Pro through Pentium M to handle 8-16-32 bits
general-purpose registers. There is no false dependency as long as the two
partial registers can be kept apart. But there is a penalty if the two halves
have to be joined again by an instruction that reads the full register (for
example for saving it on the stack). The two partial registers cannot be joined
together until both have retired to the permanent register file, which takes 5
- 7 clock cycles. - Allow
the hardware to split the register in two, but join them together again at the
register read stage if needed. This method is used in some versions of Core 2
for the 8-16-32 bit registers. The register read stage in the pipeline will
automatically insert an extra micro-operation when needed for joining the two
partial registers into one. The delay is 2 - 3 clock cycles. - Don't
split the register into parts. This method is used in AMD processors and in
Intel Pentium 4 for the 8-16-32 bit registers. There is no penalty for managing
partial registers and for joining them together, but every write to a partial
register has a false dependency on previous writes to the same register or any
part of it. The instruction scheduler has an extra dependency to keep track of. - Any
write to a partial register causes the rest of the register to be set to zero.
This method is used for the transition from 32 to 64-bit general-purpose
registers. There is no false dependency and no splitting into partial
registers. 32- and 64-bit modes cannot be mixed, but XMM and YMM instructions
can be mixed. This can cause the upper part of a YMM register to be lost when
executing XMM instructions. - The
programmer (or compiler) can remove false dependencies by zeroing the full
register, or the upper part of it, before accessing the full register. The
disadvantage is that the value of the full register cannot be preserved across
a call to a legacy function that sav
es and restores the lower part of the
register.
The
announced extension from 128-bit XMM to 256-bit YMM will use a combination of
the above methods, according to the preliminary info published by Intel
(http://softwareprojects.intel.com/avx/). To recap the documentation, all
instructions that write to an XMM register will have two versions: A legacy version
that modifies the lower half of the 256-bit register and leaves the upper part
unchanged, and a new version of the same instruction with a VEX prefix that
zeroes the upper half of the register. So the VEX version of a 128-bit
instruction uses method (5) above. It is not clear whether the legacy version of
128-bit instructions will use method (2), (3) or (4). A new instruction
VZEROUPPER clears the upper half of all the YMM registers, according to method
(6).
Now, I
wonder if we really need the complexity of having two versions of all 128-bit
instructions. The possibility of writing to the lower half of a YMM register
and leave the upper half unchanged is needed only in the following scenario: A
function using a full YMM register calls a legacy function which is unaware of
the YMM extension but saves the corresponding XMM register before using it, and
restores the value before returning. The calling function can then rely on the
full YMM register being unchanged.
However,
this scenario is only relevant if the legacy function saves and restores the
XMM register, and this happens only in 64-bit Windows. The ABI for 64-bit
Windows specifies that register XMM6 - XMM15 have callee-save status, i.e.
these registers must be saved and restored if they are used. All other x86
operating systems (32-bit Windows, 32- and 64-bit Linux, BSD and Mac) have no
XMM registers with callee-save status. So this discussion is relevant only to
64-bit Windows. There can be no problem in any other operating system because
there are no legacy functions that save these registers anyway.
The design
of the AVX instruction set allows a possible amendment to the ABI for 64-bit
Windows, specifying that YMM6 - YMM15 should have callee-save status. The
advantage of callee-save registers is that local variables can be saved in
registers rather than in memory across a call to a library function.
The
disadvantage of this hypothetical specification of callee-save status to YMM6 -
YMM15 in a future Windows 64 ABI is that we will have a penalty for reading a
full YMM register after saving and restoring the partial register. The cost of
this is unknown as long as it has not been revealed whether method (2), (3) or
(4) will be used. I assume, however, that the penalty will not be insignificant
because Intel designers wouldn't have defined the VZEROUPPER instruction and
recommended the use of it unless there is some situation where the penalty of
partial register access is higher than the cost of zeroing the upper half of
all sixteen YMM registers. But if VZEROUPPER is used for reducing th
e penalty
of partial register access then we have destroyed the advantage of callee-save
status because all the YMM registers are destroyed anyway. This is a catch-22
situation! If there is a significant penalty to partial register access then
there is no point in defining callee-save status to YMM registers. If
VZEROUPPER uses 16 micro-ops then I can't imagine any situation where it saves
time. Either VZEROUPPER is very fast, or the penalty for partial register
access is very high, or the use of VZEROUPPER is never advantageous. Can
somebody please clarify?
So if my
assumptions are correct, then the advantage of having two different versions of
all 128-bit instructions is minimal at best. Now, let's look at the
disadvantages:
- There
will be a penalty for mixing the legacy XMM instructions using partial register
writes with any of the full YMM instructions. Is there a penalty only when
reading a full register after writing to the partial register, or are there
other situations where mixing instructions with and without VEX causes delays? - Compilers
will need a switch for compiling 128-bit XMM instructions with or without VEX
prefix. Software developers will have problems avoiding the penalty of mixing
code with and without VEX prefixes. - It will
be very hard for programmers using vector intrinsics to avoid mixing the
different kinds of instructions.
Do you
expect all function libraries using XMM registers to have two versions of every
function: a legacy version for backwards compatibility, and a version with VEX
prefixes on all XMM instructions for calling from procedures that use YMM
registers?
If we have
two versions of every library function then we don't have to care about YMM
registers being saved across a call to a legacy library function, because the
compiler will insert a call to the VEX version of the function, which can save
and restore the full YMM registers if required by the ABI.
It would be
nice to have some indication of whether the penalty for mixing VEX and non-VEX
XMM instructions is so high that we need separate VEX and non-VEX versions of
all library functions. It would also be nice to know if there are any
situations where the partial register penalty is higher than the time it takes
to execute the VZEROUPPER instruction.
The
solution of having two versions of all XMM instructions looks to me like a
shortsighted patch in an otherwise well designed and future-oriented ISA
extension. The problem will appear again in all future extensions of the size
of the vector registers. How do you plan to solve the problem next time the
register size is increased? Will we have two versions of every YMM instruction
when ZMM is introduced, and two versions of every ZMM instruction when.... This would be a waste of th
e few unused bits
that are left in the VEX prefix.
It is
probably too late to change the AVX spec now, although it looks to me like a
draft, published prematurely as a response to AMD's SSE5 stunt?



