SIMD tuning with ASM pt. 2 - Your First Dump

(here's part 1 in case you missed it)

Let's take a really, really simple program. It just adds one array of things to another. This should be a SIMD slam dunk! I will call this program blah.cpp

 #define PTS 1000

 int main()
 float x[PTS];
 float y[PTS];

 for (int i = 0; i < PTS; i++) {
 x[i] = 0.0f; // set up some data
 y[i] = ((float)i)*1.1f;
 for (int i = 0; i < PTS; i++) {
 x[i] += y[i]; // do some math on the data
 for (int i = 0; i < PTS; i++) {
 std::cout << x[i] << std::endl;

 Like all my posts in this series, we'll use Linux to look at this. As I'm sure you know, to compile this into an executable you simply either:

gcc -msse4a -O3 blah.cpp # gcc

icc -msse4.2 -O3 blah.cpp # Intel® C++ Compiler for Linux

 Why don't you go ahead and copy this program on your machine and compile it just to make sure it works? When you run it you should see a long list of numbers ending with '1098.9'.

 But you came here for ASM. Let's do just a little bit today. For ASM generation, simply do the -S flag:

g++ -S -msse4a -O3 blah.cpp # gcc

icc -S -msse4.2 -O3 blah.cpp # Intel® C++ Compiler for Linux

 Now you can guess that I am probably biased towards Intel products, and I am a bit. But I can say for certain I prefer the Intel® compiler's assembly output for more reasons that one. But since everyone has gcc, and not everyone has icc, I will start by looking at the gcc output.

 So in your directory you should now have a file called blah.s. Open it up. My favorite way to work is with vim. I usually put the ASM on the top and the source (with line numbers turned on) on the bottom. Here's what my setup looks like with the gcc ASM.

 Now actually I skipped a step and jumped the part I'm interested in. The cursor is already at the part where the work really happens - the add on line 14. The .loc 1 14 is compiler-debug-speak for file 1, line 14.


        .loc 1 14 0

        movss   (%rbp,%rax,4), %xmm0

        addss   (%rdx,%rax,4), %xmm0

        movss   %xmm0, (%rbp,%rax,4)

        addq    $1, %rax

        .loc 1 13 0

        cmpq    $1000, %rax

        jne     .L4

        xorl    %ebx, %ebx

        .p2align 4,,7

I'll pull this apart more next time, but for now, hopefully you can, perhaps through a glass darkly, see the glimmer of loop...some adding, some comparing (compq), some jumping (jne), and some moving/storing (movss).

I'll leave you with a question - is this loop using SSE? Is it using SIMD?

Super bonus extra credit - try the Intel® C++ Compiler for Linux (icc) and compare. (I'll do that next time)

For more complete information about compiler optimizations, see our Optimization Notice.


Joseph Pingenot's picture

the L4 is clearly a label for the loop jump; what are the LBE51 and LBB52? .p2align? Thanks!

Joseph Pingenot's picture

Did someone kick the RSS feed? I just today saw this article and its successor and have been looking out for them for a while now.

Matt Walsh's picture

Thanks, Tom! I deleted the -S by accident but put them back in.

Yes, Tom & Rotem & Dmitry you are right; the ADDSS is a scalar instruction. To an untrained eye it looks SIMD-esque (esp with XMM register usage) but in the end it only does one floating point operation per instruction, which is no better than a plain old non-SSE add.

doclight's picture

I don't have time to test the theory at the moment, but I suspect the gcc compiler is missing information about the underlying architecture. Still an interesting and informative blog entry.

Dmitry Oganezov (Intel)'s picture

I'm not an expert, nor in gcc neither in SSE, but according to wiki both instructions MOVSS and ADDSS are scalar.

So it seems like the loop does not use SSE. Hope you'll get it fixed without a headache ;)

anonymous's picture

Your gcc command-lines need to be tweaked: at least I had to in order to build for SLES11.

1. gcc doesn't include the C++ runtime, you need to compile with g++ or add -lstdc++ to the gcc command-line

2. In the command-line to generate the source, you say to add -S to get the source listing, but the command-line you show doesn't include it. And, to get the location info in the generated assembly, you need to compile with -g.

Also, you don't specify which version of gcc you're using: the code generated with 4.3.4 (default on SLES11) and 4.6.1 is different from what you show.

And no, it doesn't look like GCC is doing much with SSE here.

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.