SIMD tuning with ASM pt. 2 - Your First Dump

(here's part 1 in case you missed it)

Let's take a really, really simple program. It just adds one array of things to another. This should be a SIMD slam dunk! I will call this program blah.cpp

 #define PTS 1000

 int main()
 float x[PTS];
 float y[PTS];

 for (int i = 0; i < PTS; i++) {
 x[i] = 0.0f; // set up some data
 y[i] = ((float)i)*1.1f;
 for (int i = 0; i < PTS; i++) {
 x[i] += y[i]; // do some math on the data
 for (int i = 0; i < PTS; i++) {
 std::cout << x[i] << std::endl;

 Like all my posts in this series, we'll use Linux to look at this. As I'm sure you know, to compile this into an executable you simply either:

gcc -msse4a -O3 blah.cpp # gcc

icc -msse4.2 -O3 blah.cpp # Intel® C++ Compiler for Linux

 Why don't you go ahead and copy this program on your machine and compile it just to make sure it works? When you run it you should see a long list of numbers ending with '1098.9'.

 But you came here for ASM. Let's do just a little bit today. For ASM generation, simply do the -S flag:

g++ -S -msse4a -O3 blah.cpp # gcc

icc -S -msse4.2 -O3 blah.cpp # Intel® C++ Compiler for Linux

 Now you can guess that I am probably biased towards Intel products, and I am a bit. But I can say for certain I prefer the Intel® compiler's assembly output for more reasons that one. But since everyone has gcc, and not everyone has icc, I will start by looking at the gcc output.

 So in your directory you should now have a file called blah.s. Open it up. My favorite way to work is with vim. I usually put the ASM on the top and the source (with line numbers turned on) on the bottom. Here's what my setup looks like with the gcc ASM.

 Now actually I skipped a step and jumped the part I'm interested in. The cursor is already at the part where the work really happens - the add on line 14. The .loc 1 14 is compiler-debug-speak for file 1, line 14.


        .loc 1 14 0

        movss   (%rbp,%rax,4), %xmm0

        addss   (%rdx,%rax,4), %xmm0

        movss   %xmm0, (%rbp,%rax,4)

        addq    $1, %rax

        .loc 1 13 0

        cmpq    $1000, %rax

        jne     .L4

        xorl    %ebx, %ebx

        .p2align 4,,7

I'll pull this apart more next time, but for now, hopefully you can, perhaps through a glass darkly, see the glimmer of loop...some adding, some comparing (compq), some jumping (jne), and some moving/storing (movss).

I'll leave you with a question - is this loop using SSE? Is it using SIMD?

Super bonus extra credit - try the Intel® C++ Compiler for Linux (icc) and compare. (I'll do that next time)

Para obtener información más completa sobre las optimizaciones del compilador, consulte nuestro Aviso de optimización.