Hello,
the following code fragment shows up some strange behaviour that I couldn't find described in the docs.
#include
#includeint main() {
float tmp;
volatile __m256 b = _mm256_broadcast_ss(&tmp);asm volatile ("nop"); // align jump target to 16 byte
for (int i=0;i<1000000000;i++) {
asm volatile("vpmuludq %xmm1, %xmm0, %xmm1");
}}
This code runs in approx. 5.5 billion cycles.
If I comment out the broadcast, the codes runs in 4.5 billion cycles. It seems that the broadcast sets the saved flag in the avx status reg and the AVX instructions in the loop never clear this flag and all following AVX instructions are suffering a penalty.
If I add a "vzeroupper" instruction after the broadcast the code runs in expected time.
Do I have missed something out, or could this be a bug (in documentation or cpu)?
My two questions are:
1. Why is the "saved" flag set, without using any SSE instruction?
2. Why isn't that flag cleared after running the first AVX instruction?
See http://www.intel.com/content/www/my/en/architecture-and-technology/64-ia... page 512 for the penalty I'm reffering to.
This behavior is the same using the intel compiler and the gcc.
Best regards,
Heiko



