SIMD tuning with ASM pt. 4 - Vectorization & ICC

If you remember from my first post I presented a program. Stripping out the setup, the part we care about - the part that does the work - is this:

        float x[PTS], 

        float y[PTS];

        for (int i = 0; i < PTS; i++) { // line 13 in orig source

                x[i] += y[i];           // line 14 in orig source


 And if you'll remember from my last post I showed that at least with gcc and standard optimization flags, we were getting SSE, but not getting packed or SIMD SSE. That is, the processor is doing SSE adds, but they are the 'SS' variety that only operate on one (not 4) elements at a time.

 Today I'm switching to using The Intel® C++ Compiler for Linux. I will call it by its pet name, icc, from here on in. You certainly you can keep playing along with gcc too, though. Here's how you compile:

icc -msse4.2 -O3 blah.cpp                 # Intel® C++ Compiler for Linux

gcc -msse4a -O3 -ftree-vectorize blah.cpp # gcc

 Now let's see what we get:

..B1.3:                         # Preds ..B1.2

        xorl      %eax, %eax                                    #13.2

                                # LOE rax rbx r12 r13 r14 r15

..B1.4:                         # Preds ..B1.4 ..B1.3

        movaps        (%rsp,%rax,4), %xmm0                      #14.3

        movaps      16(%rsp,%rax,4), %xmm1                      #14.3

        addps     4000(%rsp,%rax,4), %xmm0                      #14.3

        addps     4016(%rsp,%rax,4), %xmm1                      #14.3

        movaps    %xmm0, (%rsp,%rax,4)                          #14.3

        movaps    %xmm1, 16(%rsp,%rax,4)                        #14.3

        addq      $8, %rax                                      #13.2

        cmpq      $1000, %rax                                   #13.2

        jb        ..B1.4        # Prob 99%                      #13.2

                                # LOE rax rbx r12 r13 r14 r15


If I impart nothing else to you, dear reader, it is that you come to recognize that code like this is good. Just as many aspiring astronomers perhaps first observe Orion's belt as a curiously aligned set of stars but later blossom to where they can spot the heretofore-chaotic astral dots as the Hunter in all his glory in an instant, you too can learn to spot that this is good, and code I showed you before is bad.

(I can aniticpate that the detail I go into below might be a bit overwhelming if this is your first time doing an ASM teardown. That's ok. It's important to know that you don't have to understand everything about everything. But I wanted to be thorough for the curious.)

Sure, there are other ways to tell that you've produced good code - vectorization reports from the compiler, measuring performance, guessing what the compiler will do and predicting clock counts. But this is the only way to see the 'truth'.

Before I do my rundown, I feel I must confess that I snuck in a gcc command option, "-ftree-vectorize" above. Indeed, both icc and now gcc vectorize this loop. icc vectorizes by default; gcc requires an option.

So let's pull it apart one line at a time.

..B1.3:                         # Preds ..B1.2

As Joseph Pingenot pointed out, ..B1.3 is indeed a label. You'll notice, for instance that at the bottom of the ASM code I included that there is a jump to balbel B1.4. More on the jump in a bit.

But notice also this cute but of info: Preds B1.2. Let's say you want to figure out all the different ways I could have gotten to this block. Maybe there are 5 different jumps that take you to this label. And maybe the code ahead of this just falls through to this point with no jump. Preds lets you figure out how you can get here. I will interject that these labels are also really definitions of basic blocks, which are chunks or spans of code in which there are no jumps in nor jumps out. Basic block analysis is an important element to code generation and tuning, but I'll save that for another day.

Now let's move on to the MOVs

 movaps      (%rsp,%rax,4), %xmm0                        #14.3

 movaps    16(%rsp,%rax,4), %xmm1                        #14.3

Note the #14.3 business on the right. I love this. This is one of the least-discussed and best features of the Intel compiler - clean, easy-to-read line number annotation. This means this ASM code came from line 14 of your code (which I have commented in my mini listing above). This, critically, is how I find the ASM I care about in the huge mass of assembly you usually get. In vim I just search for \#13\.\|\#14\. and let the highlighting do the work for me.

Now on to the move itself. This is on Linux, and Linux uses AT&T assembly, which means the destination is on the right. If you work on Windows™ platforms you will see the reverse. You will also see the reverse if you follow my advice and consult the Intel Software Development Manual. I dunno, I like having the destination on the right but neither Federico nor Ted consulted with me.

Anyway, in both these lines we are storing to registers; xmm0 and xmm1. Pretty simple. We'll see those registers get consumed shortly.

But how about the source-the left hand side? In both cases these are memory loads from addresses computed from registers.

 -    (%rsp, %rax, 4)  --> load from address that is rsp + rax*4 (the x)

 -  16(%rsp, %rax, 4)  --> load from address that is rsp + rax*4 + 16

At first to me this looked overly complicated, but there is some genius going on in here. If you peek ahead you'll notice that we utlimately need to deal with 4 different memory locations each loop iteration.

    (%rsp, %rax, 4)  # points to x[i]

  16(%rsp, %rax, 4)  # points to x[i+4]

4000(%rsp, %rax, 4)  # points to y[i]

4016(%rsp, %rax, 4)  # points to y[i+4]

(The astute reader will wonder why, if we are doing 4 floats per instruction in SSE that we need 8...stay tuned)

If we just had one pointer for each memory location, we'd have to increment 4 pointers each loop iteration. By cleverly using this addressing mode we only have to increment one thing, which is the register rax.

As I've beaten into you by now, "PS is good and SS is bad". PS is good because it really does SIMD work - all the 'lanes' are used. So the MOVAPS means, load 4 sequential floats into the destination. That is, xmm0 and xmm1 will each have 4 ready-to-process floats.

Similarly the ADDPS will then add 4 floats in one shot. And truly, all things equal, if you measure the time an ADDPS takes vs. an ADDSS you will see (give or take) 4 times the throughput. This is the payoff!

        addps     4000(%rsp,%rax,4), %xmm0                      #14.3

        addps     4016(%rsp,%rax,4), %xmm1                      #14.3

But again, why are we doing 2 ADDPS's?

The answer is revealed as we look at the aforementioned increment of our loop variable, rax.

        addq      $8, %rax                                      #13.2

        cmpq      $1000, %rax                                   #13.2

        jb        ..B1.4        # Prob 99%                      #13.2

The addq (add a quadword, or 64 bit value) adds 8 to rax. Then cmpq compares a quadword to 1000. This sets a flag...the jb will jump to B1.4 if it is 'below' the value.

We saw before that our MOVAPS and ADDPS have computed addresses that increase each loop by rax*4 bytes. Here we are adding 8 to rax, which means we jump 32 bytes each loop. But an SSE register only holds 16 bytes (128 bits). Aren't we skipping every other array slot?

And the answer is look back at those mysterious secondary MOVAPS's and ADDPS's. They load and process the next 4 array entries. Note both have an offset of 16 more.

At the first level, the compiler can 'compress' or 'vectorize' (the proper term) the loop so that it does 1000 trips in 250 - doing 4 operations per loop via the 'PS' instructions. If you look back at my first post, you'll see the ASM there only increments rax by 1.

But the Intel® C++ Compiler takes it to the next level by cleverly cutting the loop count to 125. The registers aren't wide enough to hold 8 points, so it just uses 2 extra registers. The upshot is, we do fewer loops and potentially increase performance by doing more sequential, adjacent loads at once.

The much more concise way of saying this, thanks to my colleague Greg Junker, is that the compiler has unrolled the loop by 1.

Ok, fingers worn out. Did you get this far on this epic posting? I realize someone asked about alignment and I skipped talking about it. But please continue to comment - the comments are my fuel to do more!

For more complete information about compiler optimizations, see our Optimization Notice.


Joseph Pingenot's picture

Is there some sort of website where I can enter in a mnemonic and get the description of it (opcode, modes, etc.)? The Friendly Manual is rather large and unwieldly.

Joseph Pingenot's picture

FWIW, with gcc 4.6 on ubuntu precise, gcc -g -S -march=native -O3 blah.cpp gives me significantly different ASM from what you list for gcc:
.loc 1 16 0 is_stmt 1 discriminator 2
movaps (%rbx,%rax), %xmm0
addps 4000(%rsp,%rax), %xmm0
movaps %xmm0, (%rbx,%rax)
addq $16, %rax
cmpq $4000, %rax
jne .L3
.loc 1 6 0
leaq 4000(%rsp), %r12
.loc 1 16 0
movq %rsp, %rbx
jmp .L7

Note that my line numbers are one larger than yours (I added a comment at the top about where the code came from)

Joseph Pingenot's picture

Two more things:
0) why ..B.4 instead of .B.4? Is this just a difference in syntax between the two compilers, or are the two periods salient?

1) What determines the label names? Surely icc doesn't roll a D20 and call it by the letter....

2) Why jb instead of jne? The opcodes are different and it looks from the manual like the side-effects may be different.

3) Sure. the "13" in "#13.2" is the line number; what's the second number?

4) The pred stuff is pretty neat.

5) For those interested in playing but don't have an icc license, it's worth pointing out that icc on linux is free for (very strict) non-commercial use.

Now I'll have to restart my browser. I put 5 points in 2 and probably trashed its stack.

Joseph Pingenot's picture

> Before I do my rundown, I feel I must confess that I snuck in a gcc command option, "-ftree-vectorize" above. Indeed, both icc and now gcc vectorize this loop. icc vectorizes by default; gcc requires an option.

What version of gcc are you using for these? -ftree-vectorize "is enabled by default at -O3" according to TFM for 4.6.3 (Ubuntu Precise).

> (The astute reader will wonder why, if we are doing 4 floats per instruction in SSE that we need 8...stay tuned)

The reader following your pointers from the second article will already have noted the $8 in the loop counter increment and realize that we're doing 8 iterations at once.. :)

> I realize someone asked about alignment and I skipped talking about it.

*cough* :)

So clearly the compiler's also taking advantge of the fact that x and y are adjacent in memory and aligned (perhaps because they're automatically allocated?)

How does one do this for two malloc'ed arrays? "this" meaning both the one-counter-two-pointers optimization and the alignment issue? Perferably from C, since gratuitious ASM usage is generally frowned upon.

What's a good way to deal with memory latency? In the example above, we effectively have two operation sets: load-add-save on xmm0 and load-add-save on xmm1. Are the fetches coalesced into one because they're adjacent? Are there other, more optimal arrangements so that we're doing something while the memory read/write is pending?

Many thanks for the explanation of the indirect addressing. I was just about to comment about it but you beat me to the punch. :)

The ARM NEON tutorial blog series also discussed how to deal optimally with the ragged-end case. Here, not every loop will be evenly divisible by 4 or 8. Will this be addressed soon? I saw some AVX masked instructions that look interesting, perhaps here.

How is it possible to address specific lanes? Is this the maksed set of instructions?

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.