Performance penalty for mixed AVX512 code?

Performance penalty for mixed AVX512 code?

There is a known performance penalty for mixing AVX with legacy code on today's Haswell processors.

  • Will this same problem exist in the Skylake chips and AVX512?
  • Will the delay be even longer for the ZMMs, since there are twice as many to save & restore and they are twice as long?
  • The workaround instruction for this, VZEROUPPER, is not listed as changed in Intel's Instruction Extensions manual. Won't there be changes like zeroing the high portion of the ZMM register?

This is documented by Intel:

  • in section 11.3, "Mixing AVX with SSE Code" Intel's Architecture Optimizations Manual (248966-029)
  • in a long 2008 discussion in this forum "Software consequences of extending XMM to YMM" with discussion from Mark Buxton from Intel:
  • https://software.intel.com/en-us/forums/topic/301853

Intel writes: "These save and restore operations have a high penalty. Frequent execution of these transitions causes significant performance loss" and on this forum commented that the penalty was "something like 50 cycles - still TBD" each for backup and restore (so 100 for round trip).

The remarkable thing is that system code triggers this during interrupts, so there is no way to avoid the penalty for someone who uses the full YMMs! (He can't VZEROUPPER because he needs the YMMs. He doesn't know when to backup & restore manually since interrupts can occur silently at any moment.)

I am preparing some assembly language code for the new processors, so this information would be very helpful in deciding how to proceed. Thank you.

8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
iliyapolak's picture

>>>The remarkable thing is that system code triggers this during interrupts, so there is no way to avoid the penalty>>>

Do you mean that system uses specific interrupt to handle such a transition?

>> Do you mean that system uses specific interrupt to handle such a transition?

No, it means that by using legacy SSE instructions (which alter only the lower half of the registers) the interrupt forces the processor to save the upper half of the registers, which my program is using, and then restore those registers before my program needs them. The sequence of events is:

  1. My program uses full YMM instructions, so the processor sets an internal flag indicating so. ("State 2")
  2. An interrupt handler interrupts my program and uses a legacy instruction. The processor internally saves the register halves, which takes time.("State 3")
  3. My program wakes up and uses another YMM instruction. The processor restores the upper halves of the registers, which takes time. (Back to "State 2")

Intel describes this in the "Intel 64 and IA-32 Architectures Optimization Reference Manual", which can be downloaded from this web page:

http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html

Intel writes, in section 11.3:

"Initially the processor is in clean state (1), where Intel SSE and Intel AVX instructions are executed with no penalty. When a 256-bit Intel AVX instruction is executed, the processor marks that it is in the Dirty Upper state (2). While in this state, executing an Intel SSE instruction saves the upper 128 bits of all YMM registers and the state changes to Saved Dirty Upper state (3). Next time an Intel AVX instruction is executed the upper 128 bits of all YMM registers are restored and the processor is back at state (2). These save and restore operations have a high penalty. Frequent execution of these transitions causes significant performance loss."

 

 

 

iliyapolak's picture

>>>No, it means that by using legacy SSE instructions (which alter only the lower half of the registers) the interrupt forces the processor to save the upper half of the registers, which my program is using, and then restore those registers before my program needs tchem>>>

I know this.

I asked you because one of your sentences was not clear for me.

Regarding the transition penalty as described in your post I think that we must wait for the response Intel engineers.

I'm not an expert in kernel programming but isn't kernel or interrupt handler supposed to save CPU state before clobbering any registers? After saving the state the code can issue vzeroall to avoid the penalty in your interrupt handler.

 

iliyapolak's picture

>>>No, it means that by using legacy SSE instructions (which alter only the lower half of the registers) the interrupt forces the processor to save the upper half of the registers, which my program is using, and then restore those registers before my program needs them. The sequence of events is>>>

Are you developing for Windows or Linux OS. Is your program or probably its kernel module scheduling and handling aforementioned interrrupt?

 

 

iliyapolak's picture

Quote:

andysem wrote:

I'm not an expert in kernel programming but isn't kernel or interrupt handler supposed to save CPU state before clobbering any registers? After saving the state the code can issue vzeroall to avoid the penalty in your interrupt handler.

 

Yes and on Windows OS KeSaveFloatingPointState is saving FP context.

Intel, any comments from you?

Login to leave a comment.