I wote a benchmark to compare thpossible speedup with SSE vs. scalar execution. But I don't undestand the results I get.
The following loop:
appears to require ~2 cycles per movaps+cmpltps (8 cycles per iteration) on a Nehalem processor. (The memory it iterates over is of a size < L1 size.)
The generated code for the scalar case looks like this:
This requires ~1.33 cycles per ucomiss (i.e. 5.33 cycles per iteration) on the same processor. (Same memory size, too.)
The result is that to compare N floats with SSE I need N/2 cycles. Without SSE I need 1.33*N cycles. That's a speedup of factor 2.66. I expected something closer to a factor of 4 than that...
Now I'm trying to understand where this comes from:
1. the cmpps result is not used, therefore only the throughput should count, i.e. I can execute one cmpps per cycle. Do the movaps account for the second cycle? Could the movaps execute in parallel with cmpps if they'd use a different register?
2. The ucomiss call has a latency of 1 cycle. The result of set is not used, therefore the instruction can run in parallel with everything else. The movss instruction can execute in parallel to the previous ucomiss and seta. So 1.33 looks sensible, but I can't fully understand where this comes from.
Question: Does the second call to seta have to wait for the first one to retire because it writes to the same register?
Anybody that can help me to understand instruction level parallelism better?