The following code
for (int i = 0; i < 1000000; ++i) {
x[0] = _mm_add_ps(_mm_mul_ps(alpha[0], x[0]), y[0]);
x[1] = _mm_add_ps(_mm_mul_ps(alpha[1], x[1]), y[1]);
x[2] = _mm_add_ps(_mm_mul_ps(alpha[2], x[2]), y[2]);
x[3] = _mm_add_ps(_mm_mul_ps(alpha[3], x[3]), y[3]);
}
produces this assembly on icc 11.1 on Linux:
403679: 33 c0 xor %eax,%eax
40367b: 0f 28 84 24 90 02 00 00 movaps 0x290(%rsp),%xmm0
403683: 0f 59 84 24 f0 01 00 00 mulps 0x1f0(%rsp),%xmm0
40368b: 0f 28 8c 24 d0 01 00 00 movaps 0x1d0(%rsp),%xmm1
403693: 0f 59 8c 24 00 02 00 00 mulps 0x200(%rsp),%xmm1
40369b: 0f 28 94 24 80 02 00 00 movaps 0x280(%rsp),%xmm2
4036a3: 0f 59 94 24 10 02 00 00 mulps 0x210(%rsp),%xmm2
4036ab: 0f 28 9c 24 70 02 00 00 movaps 0x270(%rsp),%xmm3
4036b3: 0f 59 9c 24 20 02 00 00 mulps 0x220(%rsp),%xmm3
4036bb: 0f 58 84 24 30 02 00 00 addps 0x230(%rsp),%xmm0
4036c3: 0f 58 8c 24 40 02 00 00 addps 0x240(%rsp),%xmm1
4036cb: 0f 58 94 24 50 02 00 00 addps 0x250(%rsp),%xmm2
4036d3: 0f 58 9c 24 60 02 00 00 addps 0x260(%rsp),%xmm3
4036db: 0f 29 84 24 90 02 00 00 movaps %xmm0,0x290(%rsp)
4036e3: 0f 29 8c 24 d0 01 00 00 movaps %xmm1,0x1d0(%rsp)
4036eb: 0f 29 94 24 80 02 00 00 movaps %xmm2,0x280(%rsp)
4036f3: 0f 29 9c 24 70 02 00 00 movaps %xmm3,0x270(%rsp)
4036fb: ff c0 inc %eax
4036fd: 3d 40 42 0f 00 cmp $0xf4240,%eax
403702: 0f 8c 73 ff ff ff jl 40367b <main+0xfcb>
That is about half the performance that is possible and which gcc 4.4.0 can reach. The gcc assembly looks like this:
402474: 0f 28 7c 24 10 movaps 0x10(%rsp),%xmm7
402479: 31 c0 xor %eax,%eax
40247b: 0f 28 34 24 movaps (%rsp),%xmm6
40247f: 0f 28 6c 24 30 movaps 0x30(%rsp),%xmm5
402484: 0f 28 64 24 20 movaps 0x20(%rsp),%xmm4
402489: 0f 28 5c 24 40 movaps 0x40(%rsp),%xmm3
40248e: 0f 28 54 24 50 movaps 0x50(%rsp),%xmm2
402493: 0f 28 4c 24 60 movaps 0x60(%rsp),%xmm1
402498: 0f 28 44 24 70 movaps 0x70(%rsp),%xmm0
40249d: 44 0f 28 84 24 80 00 00 00 movaps 0x80(%rsp),%xmm8
4024a6: 44 0f 28 8c 24 90 00 00 00 movaps 0x90(%rsp),%xmm9
4024af: 44 0f 28 94 24 a0 00 00 00 movaps 0xa0(%rsp),%xmm10
4024b8: 44 0f 28 9c 24 b0 00 00 00 movaps 0xb0(%rsp),%xmm11
4024c1: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
4024c8: 44 0f 59 c7 mulps %xmm7,%xmm8
4024cc: 83 c0 01 add $0x1,%eax
4024cf: 3d 40 42 0f 00 cmp $0xf4240,%eax
4024d4: 44 0f 59 ce mulps %xmm6,%xmm9
4024d8: 44 0f 59 d5 mulps %xmm5,%xmm10
4024dc: 44 0f 58 c3 addps %xmm3,%xmm8
4024e0: 44 0f 59 dc mulps %xmm4,%xmm11
4024e4: 44 0f 58 ca addps %xmm2,%xmm9
4024e8: 44 0f 58 d1 addps %xmm1,%xmm10
4024ec: 44 0f 29 84 24 80 00 00 00 movaps %xmm8,0x80(%rsp)
4024f5: 44 0f 58 d8 addps %xmm0,%xmm11
4024f9: 44 0f 29 8c 24 90 00 00 00 movaps %xmm9,0x90(%rsp)
402502: 44 0f 29 94 24 a0 00 00 00 movaps %xmm10,0xa0(%rsp)
40250b: 44 0f 29 9c 24 b0 00 00 00 movaps %xmm11,0xb0(%rsp)
402514: 75 b2 jne 4024c8
This is 10 GFLOP/s on a Q6600, where the icc code reaches 5 GFLOP/s.
The details of why the icc code is slower is not entirely clear to me. Is it the instruction decoder? (because the loop just barely doesn't fit into the instruction cache anymore) Or is the icc code worse at making use of instruction level parallelism?
Is there anything I can do to make icc generate the most performant variant? (e.g. gcc needs to be forced to actually use different registers for x[])