SIMD tuning with ASM pt. 3 - PS good, SS bad

If you recall where we left off on my post yesterday we compiled a test program with gcc and saw this code for the 'working' part of a loop. (Yes, I will be getting to the Intel C++ compiler next post, but I'll stick with what I've got so far just so we can take baby steps).


.LBB52:

        .loc 1 14 0

        movss   (%rbp,%rax,4), %xmm0

        addss   (%rdx,%rax,4), %xmm0

        movss   %xmm0, (%rbp,%rax,4)

        addq    $1, %rax

        .loc 1 13 0

        cmpq    $1000, %rax

        jne     .L4



My quiz left you with 2 questions
1) Is this using SSE?
Answer: YES! ADDSS and MOVSS are most certainly SSE instructions. After all, they are using xmm registers, no? Dump your old Doom*, Lotus123* and Gato binaries...won't find any XMM registers in there!

2) Is this using SIMD? (which was poorly worded - better phrased as "are we taking advantage of SIMD hardware?")
Answer: NO!

And this is where my lesson for the day comes in. Yes, ASM is hard to read. Yes, there is a lot of it. But you can go a LONG WAY just by being able to find the ASM for the code you care about and knowing that 'SS' (as in ADDSS) is bad, and 'PS' (as in ADDPS) is good.

May I direct you to the Intel Software Developer's Manual? It's our canonical tome of all our opcodes. I have to admit it's not exactly something I'd print out and take on a hazy summer day to a lakeside cottage, but you'll want to have it handy nonetheless. And for the first time ever, we have combined all the opcodes into one manual instead of splitting it in half!

Anyway, if we look up ADDSS we will see that it does this: Add Scalar Single-Precision Floating-Point Values. The key word here is SINGLE. For SSE, each register is 128 bits wide. Since a floating point value is 32 bits, four can fit. I call those 4 'lanes'. For AVX, we've doubled it to 256 bits...but to keep things simpler let's use good old SSE. Conceptually, at least, the SIMD register looks like this:

SSE register

But ADDSS only does a SINGLE add...which means 3 out of 4 lanes are empty! Conceptually, if you have this assembly:


   addss xmm2, xmm1



You are getting this (add one value from xmm1 and xmm2 together and store into xmm2):


On the other hand, if you had ADDPS - ADDSS' packed cousin,


   addps xmm2, xmm1



You would get this:



And now we're doing 4x the work in the same number of clock cycles!

I should point out I'm hung up on single precision (32 bit) floats...if this was double (64 bit) floating point numbers, then the desirable instruction would be ADDPD.

BTW, another clue that our vectorization was not quite right was the


        addq    $1, %rax



That is, the loop counter is adding '1' each time through. Pay attention to this...usually with a vectorized loop (depending on the way memory is loaded) you'll see this as an increment of 4 or 8 or more.

So to sum up: with just a bit of ASM dumping know-how and a smattering of knowledge of SSE instructions you can gain a lot of insight into your true code - no compiler reports or 'guess what the compiler did' games.

Next time I'll show you how to coax those PS's out.

So, to see who is paying attention, what other opcode in the ASM snippet is holding us back from using all the SIMD lanes?

For more complete information about compiler optimizations, see our Optimization Notice.