Recent posts
https://software.intel.com/en-us/recent/320628
enSSE 4.1 instructions - DPPS/EXTRACTPS
https://software.intel.com/en-us/forums/intel-isa-extensions/topic/301624
<p>I have been trying to use the Intel DPPS instruction with either EXTRACTPS or BLENDPS. Essentially I have a loop in which </p>
<p>x1 = dot-product(y1,z1)<br />x2 = dot-product(y2,z2)<br />x3 = dot-product(y3,z3)</p>
<p>x4 = x1/(sqrt(x2)*sqrt(x3)</p>
<p>I can do x1,x2,x3 with the DPPS instruction and then use extractps. So 3 DPPS with 3 EXTRACTPS. Turns out I did not get any improvement in performance. To use lesser number of EXTRACTPS, I used BLENDPS. </p>
<p>x1_sse = dpps(y1,z1,241)<br />x2_sse = dpps(y2,z2,242)<br />x2_sse = blendps(x1_sse,x2_sse, 2);<br />x3_sse = dpps(y3,z3, 244)<br />x3_sse = blendps(x2_sse, x3_sse, 4)</p>
<p>storeps(x3_sse, x3_array)<br />x1 = x3_array[0]<br />x2 = x3_array[1]<br />x3 = x3_array[2]</p>
<p>Turns out there is no improvement from this either, infact a slight degradation. All loads and stores are aligned. I am using icpc -ipo -xT -O3 -no-prec-div -static -funroll-loops (so -fast without -ipo since -ipo does not work with SSE4.1 instructions). Any comments on how I could do this better or are these instruction latencies just too long for my use ? I guess I am dissapointed with the performance of the SSE 4.1 so far. </p>
Sun, 06 Jul 08 11:50:57 -0700vsachde301624