Assignment macro-op fusion

Assignment macro-op fusion

Hi all,

Are micro-instructions non-destructive? If so, wouldn't it make sense to fuse an assignment and dependent arithmetic instruction into one?

a = b; a += c; -> a = b + c;

This would make up for x86's lack of non-destructive instructions. Of course compilers would have to be made aware that pairing these instructions is faster, but that seems to be a simple case of defining a non-destructive instruction pattern which is implicitly encoded as two legacy instructions.

So I wonder if there's any reason not to do this...



publicaciones de 8 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

Just to clarify, it would turn the following:

8B C3 mov eax, ebx03 C1 add eax, ecx


8B C303 C1 add eax, ebx, ecx

Nothing changes at the binary level. It just decodes it as one non-destructive micro-operation.

Anyone? I realize this complicates the decoding a bit, but it seems like a big win in performance and power consumption to me. Or are there some additional complications I'm currently not aware of?Is there an easy way to assess the potential performance gain? I know how to get a compiler to generate optimal code for this, but is there some freely available x86 simulator which would allow evaluating this macro-op fusion?

ForFP instructions, AVX already provides you a solution: The VEX prefix allows a non-destructive operand, for example VADDSD xmm1, xmm2, xmm3.

Quoting Thomas Willhalm (Intel)

ForFP instructions, AVX already provides you a solution: The VEX prefix allows a non-destructive operand, for example VADDSD xmm1, xmm2, xmm3.

I know. I'm specifically talking about the scalar instructions. In discussions about other architectures, people claimed that x86 is crippled by the lack of non-destructive instructions and will never be able to make up for it (without a drastic redesign or lots of extra hardware which consumes more power). But since it's already largely a RISC architecture internally anyway I wondered whether simply executing a move and arithmetic operation as one instruction would make things more efficient at a minimal cost.

As I had written in Copy and modify I wondered why this optimization is not state of the art. Why? Too expensive in terms of die space? Not worth the effort?
In some cases there is the possibility to circumvent the problem by making a copy but modifying the original (thereby letting original and copy change roles) and relying on superscalar execution.

A related case might be that some "complex" commands such as jecxz, loop, enter (level 0), leave are so slow although their meaning is almost trivial and they are easily outperformed by a sequence of other commands. Why? Probably because of the same reasons the gluing of mov and an arithmetic command is not performed:

It seems that RISC commands are still much faster than micro coded ones and the effort to make a glued or "complex" command a new RISC command is estimated too high for the expected gain.
Let's see what the next generations will bring...

The next generation has just been presented as Haswell new instructions.
With the introduction of ANDN, BEXTR, RORX, SARX, SHLX, SHRX these new commands
effectively solve our problem for some special cases, albeit with the aid of the compiler.

It seems that performing the optimization as described in Copy and modify is too complicated for the compiler writers and that indeed your described hardware optimizer is more or less "mandatory".
Could it be that e.g. AMD has already implemented your proposal possibly years ago?
Has anyone done any benchmarking on other processors than i7 and Atom N450?

Deje un comentario

Por favor inicie sesión para agregar un comentario. ¿No es socio? Únase ya