Hi,

I wote a benchmark to compare thpossible speedup with SSE vs. scalar execution. But I don't undestand the results I get.

The following loop:

loop:

movaps 0x10(%rax),%xmm1

cmpltps %xmm1,%xmm0

movaps 0x20(%rax),%xmm0

cmpltps %xmm0,%xmm1

movaps 0x30(%rax),%xmm1

cmpltps %xmm1,%xmm0

add $0x40,%rax

movaps (%rax),%xmm0

cmpltps %xmm0,%xmm1

cmp %rax,%rbx

ja loop

appears to require ~2 cycles per movaps+cmpltps (8 cycles per iteration) on a Nehalem processor. (The memory it iterates over is of a size < L1 size.)

The generated code for the scalar case looks like this:

loop:

movss 0x4(%rax),%xmm1

ucomiss %xmm0,%xmm1

seta %dl

movss 0x8(%rax),%xmm0

ucomiss %xmm1,%xmm0

seta %dl

movss 0xc(%rax),%xmm1

ucomiss %xmm0,%xmm1

seta %dl

add $0x10,%rax

movss (%rax),%xmm0

ucomiss %xmm1,%xmm0

seta %dl

cmp %rax,%rbx

ja loop

This requires ~1.33 cycles per ucomiss (i.e. 5.33 cycles per iteration) on the same processor. (Same memory size, too.)

The result is that to compare N floats with SSE I need N/2 cycles. Without SSE I need 1.33*N cycles. That's a speedup of factor 2.66. I expected something closer to a factor of 4 than that...

Now I'm trying to understand where this comes from:

1. the cmpps result is not used, therefore only the throughput should count, i.e. I can execute one cmpps per cycle. Do the movaps account for the second cycle? Could the movaps execute in parallel with cmpps if they'd use a different register?

2. The ucomiss call has a latency of 1 cycle. The result of set is not used, therefore the instruction can run in parallel with everything else. The movss instruction can execute in parallel to the previous ucomiss and seta. So 1.33 looks sensible, but I can't fully understand where this comes from.

Question: Does the second call to seta have to wait for the first one to retire because it writes to the same register?

Anybody that can help me to understand instruction level parallelism better?