There is a known performance penalty for mixing AVX with legacy code on today's Haswell processors.
- Will this same problem exist in the Skylake chips and AVX512?
- Will the delay be even longer for the ZMMs, since there are twice as many to save & restore and they are twice as long?
- The workaround instruction for this, VZEROUPPER, is not listed as changed in Intel's Instruction Extensions manual. Won't there be changes like zeroing the high portion of the ZMM register?
This is documented by Intel:
- in section 11.3, "Mixing AVX with SSE Code" Intel's Architecture Optimizations Manual (248966-029)
- in a long 2008 discussion in this forum "Software consequences of extending XMM to YMM" with discussion from Mark Buxton from Intel:
Intel writes: "These save and restore operations have a high penalty. Frequent execution of these transitions causes significant performance loss" and on this forum commented that the penalty was "something like 50 cycles - still TBD" each for backup and restore (so 100 for round trip).
The remarkable thing is that system code triggers this during interrupts, so there is no way to avoid the penalty for someone who uses the full YMMs! (He can't VZEROUPPER because he needs the YMMs. He doesn't know when to backup & restore manually since interrupts can occur silently at any moment.)
I am preparing some assembly language code for the new processors, so this information would be very helpful in deciding how to proceed. Thank you.