SIMD tuning with ASM pt. 3 - PS good, SS bad

If you recall where we left off on my post yesterday we compiled a test program with gcc and saw this code for the 'working' part of a loop. (Yes, I will be getting to the Intel C++ compiler next post, but I'll stick with what I've got so far just so we can take baby steps).


        .loc 1 14 0

        movss   (%rbp,%rax,4), %xmm0

        addss   (%rdx,%rax,4), %xmm0

        movss   %xmm0, (%rbp,%rax,4)

        addq    $1, %rax

        .loc 1 13 0

        cmpq    $1000, %rax

        jne     .L4

My quiz left you with 2 questions
1) Is this using SSE?
Answer: YES! ADDSS and MOVSS are most certainly SSE instructions. After all, they are using xmm registers, no? Dump your old Doom*, Lotus123* and Gato binaries...won't find any XMM registers in there!

2) Is this using SIMD? (which was poorly worded - better phrased as "are we taking advantage of SIMD hardware?")
Answer: NO!

And this is where my lesson for the day comes in. Yes, ASM is hard to read. Yes, there is a lot of it. But you can go a LONG WAY just by being able to find the ASM for the code you care about and knowing that 'SS' (as in ADDSS) is bad, and 'PS' (as in ADDPS) is good.

May I direct you to the Intel Software Developer's Manual? It's our canonical tome of all our opcodes. I have to admit it's not exactly something I'd print out and take on a hazy summer day to a lakeside cottage, but you'll want to have it handy nonetheless. And for the first time ever, we have combined all the opcodes into one manual instead of splitting it in half!

Anyway, if we look up ADDSS we will see that it does this: Add Scalar Single-Precision Floating-Point Values. The key word here is SINGLE. For SSE, each register is 128 bits wide. Since a floating point value is 32 bits, four can fit. I call those 4 'lanes'. For AVX, we've doubled it to 256 bits...but to keep things simpler let's use good old SSE. Conceptually, at least, the SIMD register looks like this:

SSE register

But ADDSS only does a SINGLE add...which means 3 out of 4 lanes are empty! Conceptually, if you have this assembly:

   addss xmm2, xmm1

You are getting this (add one value from xmm1 and xmm2 together and store into xmm2):

On the other hand, if you had ADDPS - ADDSS' packed cousin,

   addps xmm2, xmm1

You would get this:

And now we're doing 4x the work in the same number of clock cycles!

I should point out I'm hung up on single precision (32 bit) floats...if this was double (64 bit) floating point numbers, then the desirable instruction would be ADDPD.

BTW, another clue that our vectorization was not quite right was the

        addq    $1, %rax

That is, the loop counter is adding '1' each time through. Pay attention to this...usually with a vectorized loop (depending on the way memory is loaded) you'll see this as an increment of 4 or 8 or more.

So to sum up: with just a bit of ASM dumping know-how and a smattering of knowledge of SSE instructions you can gain a lot of insight into your true code - no compiler reports or 'guess what the compiler did' games.

Next time I'll show you how to coax those PS's out.

So, to see who is paying attention, what other opcode in the ASM snippet is holding us back from using all the SIMD lanes?

For more complete information about compiler optimizations, see our Optimization Notice.


Just to give an analogy to this "PS is good and SS is bad" notion for people who want to do the same thing here but for an ARM device rather than an x86 device: Older code / compilers will use Vxxx for SIMD and Fxxx for Scalar, eg: VADD instead of FADD would add multiple numbers at the same time. But if you are using a newer version, then Fxxx has been replaced with Vxxx, so you can't necessarily tell whether it is a Scalar or a Vector instruction unless if you analyze the instruction to see if it uses NEON instead of VFP. But in general you can just look at the registers used, if it is Q0 then it accesses the whole 128-bit register but if it is Q0[0] then it just accesses a scalar number inside that vector.

Nice. Will the next article discuss memory alignment issues then? From the Friendly Manual, it indicates that unaligned access really require two access, and doubling time is certainly contrary to the goal of gaining speed. :)

I love all this assembly. Takes me back to college and my trusty microcontroller opcode reference.

> Add Scalar Single-Precision Floating-Point Values. The key word here is SINGLE.

The key word is actually Scalar, which has the counterpart of Packed. The word Single refers to the precision (i.e. 32-bit float), and it has the counterpart of Double, which is 64-bit float.