missed optimization on SSE multiply-add loop

missed optimization on SSE multiply-add loop

The following code

for (int i = 0; i < 1000000; ++i) {
x[0] = _mm_add_ps(_mm_mul_ps(alpha[0], x[0]), y[0]);
x[1] = _mm_add_ps(_mm_mul_ps(alpha[1], x[1]), y[1]);
x[2] = _mm_add_ps(_mm_mul_ps(alpha[2], x[2]), y[2]);
x[3] = _mm_add_ps(_mm_mul_ps(alpha[3], x[3]), y[3]);
}
produces this assembly on icc 11.1 on Linux: 403679: 33 c0 xor %eax,%eax
40367b: 0f 28 84 24 90 02 00 00 movaps 0x290(%rsp),%xmm0
403683: 0f 59 84 24 f0 01 00 00 mulps 0x1f0(%rsp),%xmm0
40368b: 0f 28 8c 24 d0 01 00 00 movaps 0x1d0(%rsp),%xmm1
403693: 0f 59 8c 24 00 02 00 00 mulps 0x200(%rsp),%xmm1
40369b: 0f 28 94 24 80 02 00 00 movaps 0x280(%rsp),%xmm2
4036a3: 0f 59 94 24 10 02 00 00 mulps 0x210(%rsp),%xmm2
4036ab: 0f 28 9c 24 70 02 00 00 movaps 0x270(%rsp),%xmm3
4036b3: 0f 59 9c 24 20 02 00 00 mulps 0x220(%rsp),%xmm3
4036bb: 0f 58 84 24 30 02 00 00 addps 0x230(%rsp),%xmm0
4036c3: 0f 58 8c 24 40 02 00 00 addps 0x240(%rsp),%xmm1
4036cb: 0f 58 94 24 50 02 00 00 addps 0x250(%rsp),%xmm2
4036d3: 0f 58 9c 24 60 02 00 00 addps 0x260(%rsp),%xmm3
4036db: 0f 29 84 24 90 02 00 00 movaps %xmm0,0x290(%rsp)
4036e3: 0f 29 8c 24 d0 01 00 00 movaps %xmm1,0x1d0(%rsp)
4036eb: 0f 29 94 24 80 02 00 00 movaps %xmm2,0x280(%rsp)
4036f3: 0f 29 9c 24 70 02 00 00 movaps %xmm3,0x270(%rsp)
4036fb: ff c0 inc %eax
4036fd: 3d 40 42 0f 00 cmp $0xf4240,%eax
403702: 0f 8c 73 ff ff ff jl 40367b That is about half the performance that is possible and which gcc 4.4.0 can reach. The gcc assembly looks like this: 402474: 0f 28 7c 24 10 movaps 0x10(%rsp),%xmm7
402479: 31 c0 xor %eax,%eax
40247b: 0f 28 34 24 movaps (%rsp),%xmm6
40247f: 0f 28 6c 24 30 movaps 0x30(%rsp),%xmm5
402484: 0f 28 64 24 20 movaps 0x20(%rsp),%xmm4
402489: 0f 28 5c 24 40 movaps 0x40(%rsp),%xmm3
40248e: 0f 28 54 24 50 movaps 0x50(%rsp),%xmm2
402493: 0f 28 4c 24 60 movaps 0x60(%rsp),%xmm1
402498: 0f 28 44 24 70 movaps 0x70(%rsp),%xmm0
40249d: 44 0f 28 84 24 80 00 00 00 movaps 0x80(%rsp),%xmm8
4024a6: 44 0f 28 8c 24 90 00 00 00 movaps 0x90(%rsp),%xmm9
4024af: 44 0f 28 94 24 a0 00 00 00 movaps 0xa0(%rsp),%xmm10
4024b8: 44 0f 28 9c 24 b0 00 00 00 movaps 0xb0(%rsp),%xmm11
4024c1: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
4024c8: 44 0f 59 c7 mulps %xmm7,%xmm8
4024cc: 83 c0 01 add $0x1,%eax
4024cf: 3d 40 42 0f 00 cmp $0xf4240,%eax
4024d4: 44 0f 59 ce mulps %xmm6,%xmm9
4024d8: 44 0f 59 d5 mulps %xmm5,%xmm10
4024dc: 44 0f 58 c3 addps %xmm3,%xmm8
4024e0: 44 0f 59 dc mulps %xmm4,%xmm11
4024e4: 44 0f 58 ca addps %xmm2,%xmm9
4024e8: 44 0f 58 d1 addps %xmm1,%xmm10
4024ec: 44 0f 29 84 24 80 00 00 00 movaps %xmm8,0x80(%rsp)
4024f5: 44 0f 58 d8 addps %xmm0,%xmm11
4024f9: 44 0f 29 8c 24 90 00 00 00 movaps %xmm9,0x90(%rsp)
402502: 44 0f 29 94 24 a0 00 00 00 movaps %xmm10,0xa0(%rsp)
40250b: 44 0f 29 9c 24 b0 00 00 00 movaps %xmm11,0xb0(%rsp)
402514: 75 b2 jne 4024c8

This is 10 GFLOP/s on a Q6600, where the icc code reaches 5 GFLOP/s.
The details of why the icc code is slower is not entirely clear to me. Is it the instruction decoder? (because the loop just barely doesn't fit into the instruction cache anymore) Or is the icc code worse at making use of instruction level parallelism?

Is there anything I can do to make icc generate the most performant variant? (e.g. gcc needs to be forced to actually use different registers for x[])

Vc: SIMD Vector Classes for C++ http://code.compeng.uni-frankfurt.de/projects/vc
11 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

It would be nice if you can help us with code that we can compile. Also please help us with compiler options.

Quoting - Om Sachan (Intel)

It would be nice if you can help us with code that we can compile. Also please help us with compiler options.

Sure. The full thing can be found at http://gitorious.org/vc. The part of the testcase that is relevant here is:

#include "benchmark.h"
#include      
#include     

static const int factor = 1000000;

static float randomF(float min, float max)
{                                         
    const float delta = max - min;        
    return min + delta * rand() / RAND_MAX;
}                                          

static float randomF12() { return randomF(1.f, 2.f); }

int main()
{         
    int blackHole = true;
    {           
        Benchmark timer("SAXPY (reference)", 8. * float_v::Size * factor, "FLOP");
        for (int repetitions = 0; repetitions < 10; ++repetitions) {              
#ifdef USE_SSE                                                                    
            __m128 tmp = _mm_set1_ps(static_cast(repetitions));            
            const __m128 oPoint2 = _mm_set1_ps(randomF(.1f, .2f));                
            const __m128 oPoint1 = _mm_set1_ps(randomF(.1f, .2f));                
            const __m128 alpha[4] = {                                             
                _mm_add_ps(tmp, oPoint2),                                         
                _mm_sub_ps(tmp, oPoint2),                                         
                _mm_add_ps(tmp, oPoint1),                                         
                _mm_sub_ps(tmp, oPoint1)                                          
            };                                                                    
            __m128 x[4] = { _mm_set1_ps(randomF12()), _mm_set1_ps(randomF12()), _mm_set1_ps(randomF12()), _mm_set1_ps(randomF12()) };
            const __m128 y[4] = { _mm_set1_ps(randomF12()), _mm_set1_ps(randomF12()), _mm_set1_ps(randomF12()), _mm_set1_ps(randomF12()) };

            timer.Start();
            ///////////////////////////////////////

            for (int i = 0; i < factor; ++i) {
                    x[0] = _mm_add_ps(_mm_mul_ps(alpha[0], x[0]), y[0]);
                    x[1] = _mm_add_ps(_mm_mul_ps(alpha[1], x[1]), y[1]);
                    x[2] = _mm_add_ps(_mm_mul_ps(alpha[2], x[2]), y[2]);
                    x[3] = _mm_add_ps(_mm_mul_ps(alpha[3], x[3]), y[3]);
            }                                                           

            ///////////////////////////////////////
            timer.Stop();                          

            const int k = _mm_movemask_ps(_mm_add_ps(_mm_add_ps(x[0], x[1]), _mm_add_ps(x[2], x[3])));
            blackHole &= k;
        }
        timer.Print(Benchmark::PrintAverage);
    }
    if (blackHole != 0) {
        std::cout << std::endl;
    }
    return 0;
}

Find benchmark.h at http://gitorious.org/vc/vc/blobs/master/benchmarks/benchmark.h.

gcc compiles with -O3
icc compiles with -mieee-fp -O3 -ansi-alias

Vc: SIMD Vector Classes for C++ http://code.compeng.uni-frankfurt.de/projects/vc

The details of why the icc code is slower is not entirely clear to me. Is it the instruction decoder? (because the loop just barely doesn't fit into the instruction cache anymore) Or is the icc code worse at making use of instruction level parallelism?

Is there anything I can do to make icc generate the most performant variant?

Matthias,

I am more concerned about the loop stream detector than the instruction cache. Furthermore, I wonder why ICC puts all these loads next to each other. This should put some stress on the OOO engine.

Have you tried to experiment with "#pragma unroll"? Reducing the unrolling might improve the situation.

Kind regards
Thomas

Quoting - Thomas Willhalm (Intel)

I am more concerned about the loop stream detector than the instruction cache. Furthermore, I wonder why ICC puts all these loads next to each other. This should put some stress on the OOO engine.

Have you tried to experiment with "#pragma unroll"? Reducing the unrolling might improve the situation.

The LSD is what I meant. I just didn't recall the right name...

No, I didn't try #pragma unroll. You mean I should try to let icc unroll the whole 1000000 iterations loop? I guess you meant something else.

The "manual unrolling" that is in the code is there to enable the code to make use of instruction level parallelism. If you loop only over the 0s then 1s... then you have a dependency chain between all the instructions in the loop and there's nothing left to execute in parallel. The four independent multiply-add make gcc able to reach peak-performance.

Vc: SIMD Vector Classes for C++ http://code.compeng.uni-frankfurt.de/projects/vc

Quoting - Matthias Kretz
No, I didn't try #pragma unroll. You mean I should try to let icc unroll the whole 1000000 iterations loop? I guess you meant something else.

Matthias,

Actually I meant that you should reduce the loop unrolling. However, I must confess that I missed that the unrollingwas not done by the compiler but "by hand". Is this loop only there to increase the time for the measurement? I don't see that the loop body depends on the loop variable.

Both, the GNU compiler and the Intel compiler can unroll loops automatically and let thembother with it. You can give the Intel compiler a hint with "#pragma unroll(n)" if it doesn't guess the optimal unrolling.

Kind regards
Thomas

Quoting - Thomas Willhalm (Intel)

Actually I meant that you should reduce the loop unrolling. However, I must confess that I missed that the unrollingwas not done by the compiler but "by hand". Is this loop only there to increase the time for the measurement? I don't see that the loop body depends on the loop variable.

The loop that is there is only supposed to increase the time, yes. This is a benchmark of my SSE vector classes (i.e. how well they perform compared to directly using intrinsics).

The inner, "unrolled" loop is there to enable the CPU to make use of instruction level parallelism. The calculation is completely arbitrary, but there must be a dependency on the result of the calculation otherwise gcc will optimize everything away. Because of that data dependency instruction level parallelism doesn't work if there's just one multiply add line.

I actually started with just one line and you get about 1/4 of the performance then...

Vc: SIMD Vector Classes for C++ http://code.compeng.uni-frankfurt.de/projects/vc

Hmm, would it be possible to come up with some benchmark that is based on a realistic example? Using arbitrary computation might easily end up in corner cases that is not too relevant for the final usage of your vector class.

Kind regards
Thomas

Quoting - Thomas Willhalm (Intel)

Hmm, would it be possible to come up with some benchmark that is based on a realistic example? Using arbitrary computation might easily end up in corner cases that is not too relevant for the final usage of your vector class.

That's certainly a good point. What I am testing in this benchmark is the comparison of the vector class against the intrinsics (the vector class really just boils down to the same intrinsics code if the compiler does its job right).

The part of the benchmark that I posted here is only the intrinsics reference code. The vector class part performs a factor of ~two worse than the reference on icc. So the final results is:
icc vector class: 2.9 GFLOP/s
icc intrinsics: 4.7 GFLOP/s
gcc vector class: 10.9 GFLOP/s
gcc intrinsics: 10.9 GFLOP/s

I didn't come around to analyzing the icc vector class code in depth yet. First I wanted to understand what happened to the intrinsics part of the code as that is the easier part. I am able to get full speed by editing the icc generated asm:

Compile with icpc -S and edit the output. Move the four movaps that copy from the stack to the xmm registers out of the loop, then compile and link and that part of the code now also shows 10.9 GFLOP/s.

Alternatively I tried to reduce the number of multiply-adds inside the loop to three which results in a loop that fits into the LSD. But that's even a worse result: 4.4 GFLOP/s.

May I assume that the LSD does not help the performance here, but only lowers the power consumption?

Is it then the correct explanation that the movaps from the stack to the xmm registers introduces so much extra data dependencies that the CPU cannot work at full speed?

Vc: SIMD Vector Classes for C++ http://code.compeng.uni-frankfurt.de/projects/vc

I just tried to increase the number of multiply-add lines. With 6 lines I get 9.5 GFLOP/s (subsequent runs show numbers between 8.8 and 10.4) and with 8 lines 9.3 GFLOP/s (rather stable numbers).

So increasing the number of independent calculations reduces the problem of data dependencies and we get higher performance.

It would, of course, be better if the unnecessary data dependencies were not there in the first place, though.

Vc: SIMD Vector Classes for C++ http://code.compeng.uni-frankfurt.de/projects/vc

...

Leave a Comment

Please sign in to add a comment. Not a member? Join today