Hi, I am new to SSE programming and I was just wondering if someone more experienced could explained this to me:
When I sum elements in an array like this:
for(i...N; i+=16){
sumA += array[i]
sumB += array[i+4]
sumC += array[i+8]
sumD += array[i+12]
}
Why is it not faster when compared to: for(i...N;i+=4) { sum +=array[i]; } ?
It is actually a bit slower. I was under the impression that this inctruction level parallelism should speed it up. I checked it using Intel Architecture Code Analyzer and the performace critical path is about the same (while thesecond version sums only one float4 at a time), so I would expect the first version to be faster. Can someone explain why is my assumption wrong and what is limiting the performance?
I have been testing it on 06_17H, compiled in both 32 and 64bits by VS2010, memory of the array is locked to get rid of page faults, prefetches are in place



