SIMD tuning with ASM pt. 4 - Vectorization & ICC

If you remember from my first post I presented a program. Stripping out the setup, the part we care about - the part that does the work - is this:


        float x[PTS], 

        float y[PTS];

        for (int i = 0; i < PTS; i++) { // line 13 in orig source

                x[i] += y[i];           // line 14 in orig source

        }



 And if you'll remember from my last post I showed that at least with gcc and standard optimization flags, we were getting SSE, but not getting packed or SIMD SSE. That is, the processor is doing SSE adds, but they are the 'SS' variety that only operate on one (not 4) elements at a time.

 Today I'm switching to using The Intel® C++ Compiler for Linux. I will call it by its pet name, icc, from here on in. You certainly you can keep playing along with gcc too, though. Here's how you compile:


icc -msse4.2 -O3 blah.cpp                 # Intel® C++ Compiler for Linux

gcc -msse4a -O3 -ftree-vectorize blah.cpp # gcc



 Now let's see what we get:


..B1.3:                         # Preds ..B1.2

        xorl      %eax, %eax                                    #13.2

                                # LOE rax rbx r12 r13 r14 r15

..B1.4:                         # Preds ..B1.4 ..B1.3

        movaps        (%rsp,%rax,4), %xmm0                      #14.3

        movaps      16(%rsp,%rax,4), %xmm1                      #14.3

        addps     4000(%rsp,%rax,4), %xmm0                      #14.3

        addps     4016(%rsp,%rax,4), %xmm1                      #14.3

        movaps    %xmm0, (%rsp,%rax,4)                          #14.3

        movaps    %xmm1, 16(%rsp,%rax,4)                        #14.3

        addq      $8, %rax                                      #13.2

        cmpq      $1000, %rax                                   #13.2

        jb        ..B1.4        # Prob 99%                      #13.2

                                # LOE rax rbx r12 r13 r14 r15

 



If I impart nothing else to you, dear reader, it is that you come to recognize that code like this is good. Just as many aspiring astronomers perhaps first observe Orion's belt as a curiously aligned set of stars but later blossom to where they can spot the heretofore-chaotic astral dots as the Hunter in all his glory in an instant, you too can learn to spot that this is good, and code I showed you before is bad.

(I can aniticpate that the detail I go into below might be a bit overwhelming if this is your first time doing an ASM teardown. That's ok. It's important to know that you don't have to understand everything about everything. But I wanted to be thorough for the curious.)

Sure, there are other ways to tell that you've produced good code - vectorization reports from the compiler, measuring performance, guessing what the compiler will do and predicting clock counts. But this is the only way to see the 'truth'.

Before I do my rundown, I feel I must confess that I snuck in a gcc command option, "-ftree-vectorize" above. Indeed, both icc and now gcc vectorize this loop. icc vectorizes by default; gcc requires an option.

So let's pull it apart one line at a time.


..B1.3:                         # Preds ..B1.2



As Joseph Pingenot pointed out, ..B1.3 is indeed a label. You'll notice, for instance that at the bottom of the ASM code I included that there is a jump to balbel B1.4. More on the jump in a bit.

But notice also this cute but of info: Preds B1.2. Let's say you want to figure out all the different ways I could have gotten to this block. Maybe there are 5 different jumps that take you to this label. And maybe the code ahead of this just falls through to this point with no jump. Preds lets you figure out how you can get here. I will interject that these labels are also really definitions of basic blocks, which are chunks or spans of code in which there are no jumps in nor jumps out. Basic block analysis is an important element to code generation and tuning, but I'll save that for another day.

Now let's move on to the MOVs


 movaps      (%rsp,%rax,4), %xmm0                        #14.3

 movaps    16(%rsp,%rax,4), %xmm1                        #14.3



Note the #14.3 business on the right. I love this. This is one of the least-discussed and best features of the Intel compiler - clean, easy-to-read line number annotation. This means this ASM code came from line 14 of your code (which I have commented in my mini listing above). This, critically, is how I find the ASM I care about in the huge mass of assembly you usually get. In vim I just search for \#13\.\|\#14\. and let the highlighting do the work for me.

Now on to the move itself. This is on Linux, and Linux uses AT&T assembly, which means the destination is on the right. If you work on Windows™ platforms you will see the reverse. You will also see the reverse if you follow my advice and consult the Intel Software Development Manual. I dunno, I like having the destination on the right but neither Federico nor Ted consulted with me.

Anyway, in both these lines we are storing to registers; xmm0 and xmm1. Pretty simple. We'll see those registers get consumed shortly.

But how about the source-the left hand side? In both cases these are memory loads from addresses computed from registers.


 -    (%rsp, %rax, 4)  --> load from address that is rsp + rax*4 (the x)

 -  16(%rsp, %rax, 4)  --> load from address that is rsp + rax*4 + 16



At first to me this looked overly complicated, but there is some genius going on in here. If you peek ahead you'll notice that we utlimately need to deal with 4 different memory locations each loop iteration.


    (%rsp, %rax, 4)  # points to x[i]

  16(%rsp, %rax, 4)  # points to x[i+4]

4000(%rsp, %rax, 4)  # points to y[i]

4016(%rsp, %rax, 4)  # points to y[i+4]



(The astute reader will wonder why, if we are doing 4 floats per instruction in SSE that we need 8...stay tuned)

If we just had one pointer for each memory location, we'd have to increment 4 pointers each loop iteration. By cleverly using this addressing mode we only have to increment one thing, which is the register rax.

As I've beaten into you by now, "PS is good and SS is bad". PS is good because it really does SIMD work - all the 'lanes' are used. So the MOVAPS means, load 4 sequential floats into the destination. That is, xmm0 and xmm1 will each have 4 ready-to-process floats.

Similarly the ADDPS will then add 4 floats in one shot. And truly, all things equal, if you measure the time an ADDPS takes vs. an ADDSS you will see (give or take) 4 times the throughput. This is the payoff!


        addps     4000(%rsp,%rax,4), %xmm0                      #14.3

        addps     4016(%rsp,%rax,4), %xmm1                      #14.3



But again, why are we doing 2 ADDPS's?

The answer is revealed as we look at the aforementioned increment of our loop variable, rax.


        addq      $8, %rax                                      #13.2

        cmpq      $1000, %rax                                   #13.2

        jb        ..B1.4        # Prob 99%                      #13.2



The addq (add a quadword, or 64 bit value) adds 8 to rax. Then cmpq compares a quadword to 1000. This sets a flag...the jb will jump to B1.4 if it is 'below' the value.

We saw before that our MOVAPS and ADDPS have computed addresses that increase each loop by rax*4 bytes. Here we are adding 8 to rax, which means we jump 32 bytes each loop. But an SSE register only holds 16 bytes (128 bits). Aren't we skipping every other array slot?

And the answer is no...now look back at those mysterious secondary MOVAPS's and ADDPS's. They load and process the next 4 array entries. Note both have an offset of 16 more.

At the first level, the compiler can 'compress' or 'vectorize' (the proper term) the loop so that it does 1000 trips in 250 - doing 4 operations per loop via the 'PS' instructions. If you look back at my first post, you'll see the ASM there only increments rax by 1.

But the Intel® C++ Compiler takes it to the next level by cleverly cutting the loop count to 125. The registers aren't wide enough to hold 8 points, so it just uses 2 extra registers. The upshot is, we do fewer loops and potentially increase performance by doing more sequential, adjacent loads at once.

The much more concise way of saying this, thanks to my colleague Greg Junker, is that the compiler has unrolled the loop by 1.

Ok, fingers worn out. Did you get this far on this epic posting? I realize someone asked about alignment and I skipped talking about it. But please continue to comment - the comments are my fuel to do more!

Para obtener información más completa sobre las optimizaciones del compilador, consulte nuestro Aviso de optimización.