Hi everyone:

This is my first time posting to the forum. I have a lot of experience designing and optimizing assembly language routines for signal processing. Until recently, this was on what you might call predictable architectures (Moto 56k DSP and PowerPC). I am now doing this on an x86, and having difficulty understanding where timing changes are occurring.

I'm only interested in optimizing one loop: a second-order section (a basic building block of digital filters). The loop processes one sample, reading it in from memory, doing the math and updating the internal states, and writing the sample out again (on top of where it was read from). The loop executes N times, where N is the number of samples in the buffer.

I began by writing the loop in C++, then turning on the SSE2 optimizations in the compiler (Visual Studio) and grabbing the disassembly. I then manually removed all the unnecessary re-loads of registers that never changed and whatnot, and ended up with something like a 15% improvement in speed. So far so good.

All the arithmetic is double precision, with the core being five multiplies and four adds. The compiler did not vectorize. I combined four multiplies into two vectorized multiplies, and three adds into two vectorized adds (there is a redundant add to zero). Using both halves of XMM registers also freed up space so all the constants could be loaded into registers outside the loop. The total code length is shorter. And yet, this version runs slightly *slower* than the previous version (by maybe 5%). I've verified that the code is in fact correct and the numbers are not becoming denormalized.

I've read a good chunk of Agner Fog's work on this topic, and I believe I've not done anything too foolish. I've vainly tried moving instructions around in the loop. The loop code is aligned on a 16-byte boundary. None of this has made any difference. (This is all on a Core i7-860.) So at this point I'm pretty confused. If anyone can help me out with some understanding or things to try, I'd be very grateful. With that, here's the non-vectorized loop (I've omitted the preamble in which some registers are loaded up):

_loop:

movsd xmm7,mmword ptr [eax] // x = samples[i]

movapd xmm6,xmm7 // x

mulsd xmm6,xmm1 // b0*x

addsd xmm6,xmm4 // v0 = b0*x + z1

movsd xmm4,mmword ptr [b1]

mulsd xmm4,xmm7 // b1*x

movsd xmm0,mmword ptr [a1]

mulsd xmm0,xmm6 // v0*a1

subsd xmm4,xmm0 // x*b1-v0*a1

addsd xmm4,xmm5 // x*b1-v0*a1+z2

movsd mmword ptr [eax],xmm6 // samples[i] = v0

add eax, 8 // i++

movsd xmm5,mmword ptr [b2]

mulsd xmm5,xmm7 // x*b2

mulsd xmm6,mmword ptr [a2] // v0*a2

subsd xmm5,xmm6 // x*b2-v0*a2

sub ecx, 1

jg _loop

And here is the vectorized version (again omitting the preamble):

_loop:

movsd xmm7,mmword ptr [eax] // x = samples[i]

movapd xmm6,xmm7 // x

mulsd xmm6,xmm1 // x*b0

addsd xmm6,xmm4 // v0 = x*b0 + z1

movsd mmword ptr [eax],xmm6 // samples[i] = v0

add eax, 8 // i++

shufpd xmm7, xmm7, 0 // x|x

mulpd xmm7, xmm2 // x*b2|x*b1

shufpd xmm6, xmm6, 0 // v0|v0

mulpd xmm6, xmm3 // v0*a2|v0*a1

subpd xmm7, xmm6 // x*b2-v0*a2|x*b1-v0*a1

psrldq xmm4, 8 // 0|z2

addpd xmm4, xmm7 // x*b2-v0*a2|x*b1-v0*a1+z2

sub ecx, 1

jg _loop

Thanks in advance for any help you can give.

Best wishes,

Tom Kite