I have remarkedin production code that often a single _mm256_load_ps leads to severalloads in the ASMdump (indexed addressing for one operand), even if there isn't high register pressure, i.e. the same in 64-bit,the simplified (and otherwise meaningless) example below shows well the effect:
void Strange(float *dst, const float *src, int size)
{
const __m256 k1 = _mm256_set1_ps(10.0), k2 = _mm256_set1_ps(20.0) , k3 = _mm256_set1_ps(30.0), k4 = _mm256_set1_ps(40.0);
for (int i=0; i
.B8.3:: ; Preds .B8.3 .B8.2
vmulps ymm5, ymm3, YMMWORD PTR [rdx+rax*4] ;440.55
vmulps ymm6, ymm2, YMMWORD PTR [rdx+rax*4] ;440.75
vaddps ymm0, ymm5, ymm6 ;440.41
vmulps ymm5, ymm4, YMMWORD PTR [rdx+rax*4] ;441.55
vmulps ymm6, ymm1, YMMWORD PTR [rdx+rax*4] ;441.75
vaddps ymm5, ymm5, ymm6 ;441.41
vaddps ymm0, ymm0, ymm5 ;440.27
vmovups YMMWORD PTR [rcx+rax*4], ymm0 ;440.21
add rax, 8 ;437.25
cmp rax, r8 ;437.19
jl .B8.3 ; Prob 82% ;437.19
I was expecting a single move like vmovups ymm7, YMMWORD PTR [rdx+rax*4] at the start of the loop, then ymm7 used 4 times instead of 4 times a load, I'm missing something here?



