I have a AVX vector register reg containing 4 double values, let's call them (in order): 0 - 2 - 3 - 4
These values have to be added to distinct locations of an array A, namely to positions A, A, A, A
In other words:
A += reg, A += reg and so on
This is a quite recurrent situation in my program, i.e. sequences of load-add-stores that are "almost" unit-stride - but actually they are not.
At the beginning I thought I could have used some sort of shuffle instructions to shift values in reg, i.e. getting 0 - x - 2 - 3 (and maybe treating reg as a scalar value), and then perfom standard 256-bit instructions. However, as far as I know, I can't reduce that kind of shifting to a single instruction, right?
Related to this question, let's say that now reg is 0 - 2 - 3 - 5. Should I treat all 4 values as scalar values or is there a way of efficiently (1/2 instructions?) extracting the two values in the middles (i.e. those crossing the two 128-bits lanes) into a 128-bit register ?