I have been trying to use the Intel DPPS instruction with either EXTRACTPS or BLENDPS. Essentially I have a loop in which

x1 = dot-product(y1,z1)

x2 = dot-product(y2,z2)

x3 = dot-product(y3,z3)

x4 = x1/(sqrt(x2)*sqrt(x3)

I can do x1,x2,x3 with the DPPS instruction and then use extractps. So 3 DPPS with 3 EXTRACTPS. Turns out I did not get any improvement in performance. To use lesser number of EXTRACTPS, I used BLENDPS.

x1_sse = dpps(y1,z1,241)

x2_sse = dpps(y2,z2,242)

x2_sse = blendps(x1_sse,x2_sse, 2);

x3_sse = dpps(y3,z3, 244)

x3_sse = blendps(x2_sse, x3_sse, 4)

storeps(x3_sse, x3_array)

x1 = x3_array[0]

x2 = x3_array[1]

x3 = x3_array[2]

Turns out there is no improvement from this either, infact a slight degradation. All loads and stores are aligned. I am using icpc -ipo -xT -O3 -no-prec-div -static -funroll-loops (so -fast without -ipo since -ipo does not work with SSE4.1 instructions). Any comments on how I could do this better or are these instruction latencies just too long for my use ? I guess I am dissapointed with the performance of the SSE 4.1 so far.