I was just trying out to optimize the "Dot Product" operation of 2 vectors. Both the vectors are laid out in aligned memory locations as arrays.
I did an assembly implementation only to realize that repeated additions are causing resource stalls (at least thats what I infer)
For example, consider this:
. . . pxor xmm7, xmm7 ; Result movapd xmm0, [esi] mulpd xmm0, [edi] movapd xmm1, [esi+16] ; Uses INTEGER_ADD port mulpd xmm1, [edi+16] ; Uses INTEGER_ADD port addpd xmm7, xmm0 ; Uses SIMD ADD port movapd xmm2, [esi+32] ; Uses INTEGER ADD port mulpd xmm2, [edi+32] ; Uses INTEGER ADD port addpd xmm7, xmm1 ; Uses SIMD ADD port addpd xmm7, xmm2 ; Uses SIMD ADD port . . .
To me, the repeated use of integer unit seems to cause lot of stalls in the resulting code -- This results in poor SSE performance. Can some1 throw some light on this?
Is this the right way to write SIMD code for Dot Products (Let us keep the DPPS instruction away for a while -- I want to understand SIMD correctly.....)
Thanks for any help,