I have been trying to use the Intel DPPS instruction with either EXTRACTPS or BLENDPS. Essentially I have a loop in which
x1 = dot-product(y1,z1)
x2 = dot-product(y2,z2)
x3 = dot-product(y3,z3)
x4 = x1/(sqrt(x2)*sqrt(x3)
I can do x1,x2,x3 with the DPPS instruction and then use extractps. So 3 DPPS with 3 EXTRACTPS. Turns out I did not get any improvement in performance. To use lesser number of EXTRACTPS, I used BLENDPS.
x1_sse = dpps(y1,z1,241)
x2_sse = dpps(y2,z2,242)
x2_sse = blendps(x1_sse,x2_sse, 2);
x3_sse = dpps(y3,z3, 244)
x3_sse = blendps(x2_sse, x3_sse, 4)
x1 = x3_array
x2 = x3_array
x3 = x3_array
Turns out there is no improvement from this either, infact a slight degradation. All loads and stores are aligned. I am using icpc -ipo -xT -O3 -no-prec-div -static -funroll-loops (so -fast without -ipo since -ipo does not work with SSE4.1 instructions). Any comments on how I could do this better or are these instruction latencies just too long for my use ? I guess I am dissapointed with the performance of the SSE 4.1 so far.