missed optimization on SSE multiply-add loop
The following code
for (int i = 0; i < 1000000; ++i) { x[0] = _mm_add_ps(_mm_mul_ps(alpha[0], x[0]),
y[0]); x[1] = _mm_add_ps(_mm_mul_ps(alpha[1], x[1]), y[1]); x[2] = _mm_add_ps(_mm_mul_ps(alpha[2], x[2]),
y[2]); x[3] = _mm_add_ps(_mm_mul_ps(alpha[3], x[3]), y[3]); }
produces this assembly on icc 11.1 on Linux:
403679: 33 c0 xor %eax,%eax 40367b: 0f 28 84 24 90 02 00 00 movaps
0x290(%rsp),%xmm0 403683: 0f 59 84 24 f0 01 00 00 mulps 0x1f0(%rsp),%xmm0 40368b: 0f 28 8c 24 d0 01
00 00 movaps 0x1d0(%rsp),%xmm1 403693: 0f 59 8c 24 00 02 00 00 mulps 0x200(%rsp),%xmm1 40369b: 0f
28 94 24 80 02 00 00 movaps 0x280(%rsp),%xmm2 4036a3: 0f 59 94 24 10 02 00 00 mulps 0x210(%rsp),%xmm2 4036ab: 0f 28 9c 24 70 02 00 00 movaps 0x270(%rsp),%xmm3 4036b3: 0f 59 9c 24 20 02 00 00 mulps
0x220(%rsp),%xmm3 4036bb: 0f 58 84 24 30 02 00 00 addps 0x230(%rsp),%xmm0 4036c3: 0f 58 8c 24 40 02
00 00 addps 0x240(%rsp),%xmm1 4036cb: 0f 58 94 24 50 02 00 00 addps 0x250(%rsp),%xmm2 4036d3: 0f
58 9c 24 60 02 00 00 addps 0x260(%rsp),%xmm3 4036db: 0f 29 84 24 90 02 00 00 movaps %xmm0,0x290(%rsp) 4036e3: 0f 29 8c 24 d0 01 00 00 movaps %xmm1,0x1d0(%rsp) 4036eb: 0f 29 94 24 80 02 00 00 movaps
%xmm2,0x280(%rsp) 4036f3: 0f 29 9c 24 70 02 00 00 movaps %xmm3,0x270(%rsp) 4036fb: ff c0
inc %eax 4036fd: 3d 40 42 0f 00 cmp $0xf4240,%eax 403702: 0f 8c 73 ff ff ff
jl 40367b <main+0xfcb>
That is about half the performance that is possible and which gcc 4.4.0 can reach. The gcc assembly looks like this:
402474: 0f 28 7c 24 10 movaps 0x10(%rsp),%xmm7 402479: 31 c0 xor
%eax,%eax 40247b: 0f 28 34 24 movaps (%rsp),%xmm6 40247f: 0f 28 6c 24 30 movaps
0x30(%rsp),%xmm5 402484: 0f 28 64 24 20 movaps 0x20(%rsp),%xmm4 402489: 0f 28 5c 24 40
movaps 0x40(%rsp),%xmm3 40248e: 0f 28 54 24 50 movaps 0x50(%rsp),%xmm2 402493: 0f 28 4c 24 60
movaps 0x60(%rsp),%xmm1 402498: 0f 28 44 24 70 movaps 0x70(%rsp),%xmm0 40249d: 44 0f 28
84 24 80 00 00 00 movaps 0x80(%rsp),%xmm8 4024a6: 44 0f 28 8c 24 90 00 00 00 movaps 0x90(%rsp),%xmm9
4024af: 44 0f 28 94 24 a0 00 00 00 movaps 0xa0(%rsp),%xmm10 4024b8: 44 0f 28 9c 24 b0 00 00 00 movaps
0xb0(%rsp),%xmm11 4024c1: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 4024c8: 44 0f 59 c7 mulps
%xmm7,%xmm8 4024cc: 83 c0 01 add $0x1,%eax 4024cf: 3d 40 42 0f 00 cmp
$0xf4240,%eax 4024d4: 44 0f 59 ce mulps %xmm6,%xmm9 4024d8: 44 0f 59 d5 mulps
%xmm5,%xmm10 4024dc: 44 0f 58 c3 addps %xmm3,%xmm8 4024e0: 44 0f 59 dc mulps
%xmm4,%xmm11 4024e4: 44 0f 58 ca addps %xmm2,%xmm9 4024e8: 44 0f 58 d1 addps
%xmm1,%xmm10 4024ec: 44 0f 29 84 24 80 00 00 00 movaps %xmm8,0x80(%rsp) 4024f5: 44 0f 58 d8
addps %xmm0,%xmm11 4024f9: 44 0f 29 8c 24 90 00 00 00 movaps %xmm9,0x90(%rsp) 402502: 44 0f 29 94
24 a0 00 00 00 movaps %xmm10,0xa0(%rsp) 40250b: 44 0f 29 9c 24 b0 00 00 00 movaps %xmm11,0xb0(%rsp)
402514: 75 b2 jne 4024c8
This is 10 GFLOP/s on a Q6600, where the icc code reaches 5 GFLOP/s. The details of why the icc code is slower
is not entirely clear to me. Is it the instruction decoder? (because the loop just barely doesn't fit into the
instruction cache anymore) Or is the icc code worse at making use of instruction level parallelism?
Is there anything I can do to make icc generate the most performant variant? (e.g. gcc needs to be forced to actually
use different registers for x[])
| |
Re: missed optimization on SSE multiply-add loop
It would be nice if you can help us with code that we can compile. Also please help us with compiler options.
| |
Re: missed optimization on SSE multiply-add loop
It would be nice if you can help us with code that we can compile.
Also please help us with compiler options.
Sure. The full thing can be found at http://gitorious.org/vc. The part of the testcase that is relevant here
is:
#include "benchmark.h"
#include <cstdio>
#include <cstdlib>
static const int factor = 1000000;
static float randomF(float min, float max)
{
const float delta = max - min;
return min + delta * rand() / RAND_MAX;
}
static float randomF12() { return randomF(1.f, 2.f); }
int main()
{
int blackHole = true;
{
Benchmark timer("SAXPY (reference)", 8. * float_v::Size * factor, "FLOP");
for (int repetitions = 0; repetitions < 10; ++repetitions) {
#ifdef USE_SSE
__m128 tmp = _mm_set1_ps(static_cast<float>(repetitions));
const __m128 oPoint2 = _mm_set1_ps(randomF(.1f, .2f));
const __m128 oPoint1 = _mm_set1_ps(randomF(.1f, .2f));
const __m128 alpha[4] = {
_mm_add_ps(tmp, oPoint2),
_mm_sub_ps(tmp, oPoint2),
_mm_add_ps(tmp, oPoint1),
_mm_sub_ps(tmp, oPoint1)
};
__m128 x[4] = { _mm_set1_ps(randomF12()), _mm_set1_ps(randomF12()), _mm_set1_ps(randomF12()),
_mm_set1_ps(randomF12()) };
const __m128 y[4] = { _mm_set1_ps(randomF12()), _mm_set1_ps(randomF12()), _mm_set1_ps(randomF12()),
_mm_set1_ps(randomF12()) };
timer.Start();
///////////////////////////////////////
for (int i = 0; i < factor; ++i) {
x[0] = _mm_add_ps(_mm_mul_ps(alpha[0], x[0]), y[0]);
x[1] = _mm_add_ps(_mm_mul_ps(alpha[1], x[1]), y[1]);
x[2] = _mm_add_ps(_mm_mul_ps(alpha[2], x[2]), y[2]);
x[3] = _mm_add_ps(_mm_mul_ps(alpha[3], x[3]), y[3]);
}
///////////////////////////////////////
timer.Stop();
const int k = _mm_movemask_ps(_mm_add_ps(_mm_add_ps(x[0], x[1]), _mm_add_ps(x[2], x[3])));
blackHole &= k;
}
timer.Print(Benchmark::PrintAverage);
}
if (blackHole != 0) {
std::cout << std::endl;
}
return 0;
}
Find benchmark.h at http://gitorious.org/vc/vc/blobs/master/benchmarks/benchmark.h.
gcc compiles with -O3 icc compiles with -mieee-fp -O3 -ansi-alias
| |
Re: missed optimization on SSE multiply-add loop
The details of why the icc code is slower is not entirely clear to me. Is it the instruction decoder? (because the
loop just barely doesn't fit into the instruction cache anymore) Or is the icc code worse at making use of instruction
level parallelism?
Is there anything I can do to make icc generate the most performant variant?
Matthias,
I am more concerned about the loop stream detector than the instruction cache. Furthermore, I
wonder why ICC puts all these loads next to each other. This should put some stress on the OOO engine.
Have
you tried to experiment with "#pragma unroll"? Reducing the unrolling might improve the situation.
Kind
regards Thomas
| |
Re: missed optimization on SSE multiply-add loop
I am more concerned about the loop stream detector than the instruction cache. Furthermore, I wonder why ICC puts all
these loads next to each other. This should put some stress on the OOO engine.
Have you tried to experiment
with "#pragma unroll"? Reducing the unrolling might improve the situation.
The LSD is what I meant. I just didn't recall the right name...
No, I didn't try #pragma unroll. You
mean I should try to let icc unroll the whole 1000000 iterations loop? I guess you meant something else.
The
"manual unrolling" that is in the code is there to enable the code to make use of instruction level parallelism. If you
loop only over the 0s then 1s... then you have a dependency chain between all the instructions in the loop and there's
nothing left to execute in parallel. The four independent multiply-add make gcc able to reach peak-performance.
| |
Re: missed optimization on SSE multiply-add loop
No, I didn't try #pragma unroll. You mean I should try to let icc unroll
the whole 1000000 iterations loop? I guess you meant something else.
Matthias,
Actually I meant that you should reduce the loop unrolling. However, I must confess that I missed
that the unrolling was not done by the compiler but "by hand". Is this loop only there to increase the time for the
measurement? I don't see that the loop body depends on the loop variable.
Both, the GNU compiler and the
Intel compiler can unroll loops automatically and let them bother with it. You can give the Intel compiler a hint with
"#pragma unroll(n)" if it doesn't guess the optimal unrolling.
Kind regards Thomas
| |
Re: missed optimization on SSE multiply-add loop
Actually I meant that you should reduce the loop unrolling. However, I must confess that I missed that the
unrolling was not done by the compiler but "by hand". Is this loop only there to increase the time for the measurement?
I don't see that the loop body depends on the loop variable.
The loop that is there is only supposed to increase the time, yes. This is a benchmark of my SSE vector classes
(i.e. how well they perform compared to directly using intrinsics).
The inner, "unrolled" loop is there to
enable the CPU to make use of instruction level parallelism. The calculation is completely arbitrary, but there must be
a dependency on the result of the calculation otherwise gcc will optimize everything away. Because of that data
dependency instruction level parallelism doesn't work if there's just one multiply add line.
I actually
started with just one line and you get about 1/4 of the performance then...
| |
Re: missed optimization on SSE multiply-add loop
Hmm, would it be possible to come up with some benchmark that is based on a realistic example?
Using arbitrary computation might easily end up in corner cases that is not too relevant for the final usage of your
vector class.
Kind regards Thomas
| |
Re: missed optimization on SSE multiply-add loop
Hmm, would it be possible to come up with some benchmark that is based on a realistic example?
Using arbitrary computation might easily end up in corner cases that is not too relevant for the final usage of your
vector class.
That's certainly a good point. What I am testing in this benchmark is the comparison of the vector class against
the intrinsics (the vector class really just boils down to the same intrinsics code if the compiler does its job
right).
The part of the benchmark that I posted here is only the intrinsics reference code. The vector class
part performs a factor of ~two worse than the reference on icc. So the final results is: icc vector class: 2.9
GFLOP/s icc intrinsics: 4.7 GFLOP/s gcc vector class: 10.9 GFLOP/s gcc intrinsics: 10.9 GFLOP/s
I didn't come around to analyzing the icc vector class code in depth yet. First I wanted to understand what happened
to the intrinsics part of the code as that is the easier part. I am able to get full speed by editing the icc generated
asm:
Compile with icpc -S and edit the output. Move the four movaps that copy from the stack to the xmm
registers out of the loop, then compile and link and that part of the code now also shows 10.9 GFLOP/s.
Alternatively I tried to reduce the number of multiply-adds inside the loop to three which results in a loop that fits
into the LSD. But that's even a worse result: 4.4 GFLOP/s.
May I assume that the LSD does not help the
performance here, but only lowers the power consumption?
Is it then the correct explanation that the movaps
from the stack to the xmm registers introduces so much extra data dependencies that the CPU cannot work at full
speed?
| |
Re: missed optimization on SSE multiply-add loop
I just tried to increase the number of multiply-add lines. With 6 lines I get 9.5 GFLOP/s (subsequent runs show numbers
between 8.8 and 10.4) and with 8 lines 9.3 GFLOP/s (rather stable numbers).
So increasing the number of
independent calculations reduces the problem of data dependencies and we get higher performance.
It would, of
course, be better if the unnecessary data dependencies were not there in the first place, though.
| |
Re: missed optimization on SSE multiply-add loop
That's certainly a good point. What I am testing in this benchmark is the comparison of the vector class against
the intrinsics (the vector class really just boils down to the same intrinsics code if the compiler does its job
right).
The part of the benchmark that I posted here is only the intrinsics reference code. The vector class
part performs a factor of ~two worse than the reference on icc. So the final results is: icc vector class: 2.9
GFLOP/s icc intrinsics: 4.7 GFLOP/s gcc vector class: 10.9 GFLOP/s gcc intrinsics: 10.9 GFLOP/s
I didn't come around to analyzing the icc vector class code in depth yet. First I wanted to understand what happened
to the intrinsics part of the code as that is the easier part. I am able to get full speed by editing the icc generated
asm:
Compile with icpc -S and edit the output. Move the four movaps that copy from the stack to the xmm
registers out of the loop, then compile and link and that part of the code now also shows 10.9 GFLOP/s.
Alternatively I tried to reduce the number of multiply-adds inside the loop to three which results in a loop that fits
into the LSD. But that's even a worse result: 4.4 GFLOP/s.
May I assume that the LSD does not help the
performance here, but only lowers the power consumption?
Is it then the correct explanation that the movaps
from the stack to the xmm registers introduces so much extra data dependencies that the CPU cannot work at full
speed?
My guess is that the icc implementation of the STL vector is not internally taking advantage of the restrict
keyword and 16 byte alignment like it could. The semantics of STL vectors allow for this as long as an addditional
practical restriction is adhered to in the implementation -- that a vector occupies a contiguous block of memory.
| | |