"#pragma vector aligned" makes slower and longer codes...? is it a feature?

"#pragma vector aligned" makes slower and longer codes...? is it a feature?

I am just wondering...is it a feature or not. It's weird.

It happens on Linux and Intel C++ 8.1 (icpc).

In environment variables:
CFLAGS='-xP -O3 -ipo -ipo_obj -I/usr/local/boinc-14/include -DREV=14'
and
CXXFLAGS='-xP -O3 -ipo -ipo_obj -I/usr/local/boinc-14/include -DREV=14'
are set. (former doesn't matter)

When I add "#pragma vector aligned" before a loop, it makes a longer and slower code, although the compiler says "(col. 3) remark: LOOP WAS VECTORIZED".

Withoug this #pragma, compiler doesn't say "LOOP WAS VECTORIZED", but makes a shorter and faster code although the compiler doesn't say "LOOP WAS VECTORIZED"...

my code is as follows:

------
void v_GetPowerSpectrum(
fftwf_complex* FreqData,
float* PowerSpectrum,
int NumDataPoints
) {
int i;

#pragma vector aligned // by Tetsuji Maverick Rai (this is line 531)
for (i = 0; i NumDataPoints; i++) {
PowerSpectrum[i] = FreqData[i][0] * FreqData[i][0]
+ FreqData[i][1] * FreqData[i][1];
}
------

When this pragma is active, produced code is very long (#pragma vector always does the same or similarly)

----------
0000257c _Z18v_GetPowerSpectrumPA2_fPfi>:
257c: 55 push %ebp
257d: 89 e5 mov %esp,%ebp
257f: 83 e4 f0 and $0xfffffff0,%esp
2582: 57 push %edi
2583: 56 push %esi
2584: 53 push %ebx
2585: 56 push %esi
2586: 8b 75 08 mov 0x8(%ebp),%esi
2589: 8b 5d 0c mov 0xc(%ebp),%ebx
258c: 8b 4d 10 mov 0x10(%ebp),%ecx
258f: 85 c9 test %ecx,%ecx
2591: 0f 8e 0c 01 00 00 jle 26a3 _Z18v_GetPowerSpectrumPA2_fPfi+0x127>
2597: 31 d2 xor %edx,%edx
2599: 83 f9 08 cmp $0x8,%ecx
259c: 0f 82 de 00 00 00 jb 2680 _Z18v_GetPowerSpectrumPA2_fPfi+0x104>
25a2: 8d 04 8b lea (%ebx,%ecx,4),%eax
25a5: 39 f0 cmp %esi,%eax
25a7: 76 0b jbe 25b4 _Z18v_GetPowerSpectrumPA2_fPfi+0x38>
25a9: 8d 3c ce lea (%esi,%ecx,8),%edi
25ac: 39 fb cmp %edi,%ebx
25ae: 0f 82 cc 00 00 00 jb 2680 _Z18v_GetPowerSpectrumPA2_fPfi+0x104>
25b4: 8d 7e 04 lea 0x4(%esi),%edi
25b7: 39 f8 cmp %edi,%eax
25b9: 76 0c jbe 25c7 _Z18v_GetPowerSpectrumPA2_fPfi+0x4b>
25bb: 8d 44 ce 04 lea 0x4(%esi,%ecx,8),%eax
25bf: 39 c3 cmp %eax,%ebx
25c1: 0f 82 b9 00 00 00 jb 2680 _Z18v_GetPowerSpectrumPA2_fPfi+0x104>
25c7: 89 cf mov %ecx,%edi
25c9: 83 e7 07 and $0x7,%edi
25cc: f7 df neg %edi
25ce: 01 cf add %ecx,%edi
25d0: 8d 04 d6 lea (%esi,%edx,8),%eax
25d3: 90 nop
25d4: f3 0f 10 3c d6 movss (%esi,%edx,8),%xmm7
25d9: f3 0f 10 44 d6 10 movss 0x10(%esi,%edx,8),%xmm0
25df: f3 0f 10 54 d6 08 movss 0x8(%esi,%edx,8),%xmm2
25e5: f3 0f 10 48 18 movss 0x18(%eax),%xmm1
25ea: f3 0f 10 74 d6 04 movss 0x4(%esi,%edx,8),%xmm6
25f0: f3 0f 10 5c d6 14 movss 0x14(%esi,%edx,8),%xmm3
25f6: f3 0f 10 6c d6 0c movss 0xc(%esi,%edx,8),%xmm5
25fc: f3 0f 10 60 1c movss 0x1c(%eax),%xmm4
2601: 0f 14 f8 unpcklps %xmm0,%xmm7
2604: 0f 14 d1 unpcklps %xmm1,%xmm2
2607: 0f 14 fa unpcklps %xmm2,%xmm7
260a: 0f 59 ff mulps %xmm7,%xmm7
260d: 0f 14 f3 unpcklps %xmm3,%xmm6
2610: 0f 14 ec unpcklps %xmm4,%xmm5
2613: 0f 14 f5 unpcklps %xmm5,%xmm6
2616: 0f 59 f6 mulps %xmm6,%xmm6
2619: 0f 58 fe addps %xmm6,%xmm7
261c: 0f 29 3c 93 movaps %xmm7,(%ebx,%edx,4)
2620: f3 0f 10 5c d6 20 movss 0x20(%esi,%edx,8),%xmm3
2626: f3 0f 10 44 d6 30 movss 0x30(%esi,%edx,8),%xmm0
262c: f3 0f 10 4c d6 28 movss 0x28(%esi,%edx,8),%xmm1
2632: f3 0f 10 54 d6 24 movss 0x24(%e
si,%edx,8),%xmm2
2638: 0f 14 d8 unpcklps %xmm0,%xmm3
263b: f3 0f 10 40 38 movss 0x38(%eax),%xmm0
2640: 0f 14 c8 unpcklps %xmm0,%xmm1
2643: f3 0f 10 44 d6 34 movss 0x34(%esi,%edx,8),%xmm0
2649: 0f 14 d9 unpcklps %xmm1,%xmm3
264c: 0f 59 db mulps %xmm3,%xmm3
264f: f3 0f 10 4c d6 2c movss 0x2c(%esi,%edx,8),%xmm1
2655: 0f 14 d0 unpcklps %xmm0,%xmm2
2658: f3 0f 10 40 3c movss 0x3c(%eax),%xmm0
265d: 0f 14 c8 unpcklps %xmm0,%xmm1
2660: 0f 14 d1 unpcklps %xmm1,%xmm2
2663: 0f 59 d2 mulps %xmm2,%xmm2
2666: 83 c0 40 add $0x40,%eax
2669: 0f 58 da addps %xmm2,%xmm3
266c: 0f 29 5c 93 10 movaps %xmm3,0x10(%ebx,%edx,4)
2671: 83 c2 08 add $0x8,%edx
2674: 39 fa cmp %edi,%edx
2676: 0f 82 58 ff ff ff jb 25d4 _Z18v_GetPowerSpectrumPA2_fPfi+0x58>
267c: 39 ca cmp %ecx,%edx
267e: 73 23 jae 26a3 _Z18v_GetPowerSpectrumPA2_fPfi+0x127>
2680: f3 0f 10 44 d6 04 movss 0x4(%esi,%edx,8),%xmm0
2686: f3 0f 59 c0 mulss %xmm0,%xmm0
268a: f3 0f 10 0c d6 movss (%esi,%edx,8),%xmm1
268f: f3 0f 59 c9 mulss %xmm1,%xmm1
2693: f3 0f 58 c8 addss %xmm0,%xmm1
2697: f3 0f 11 0c 93 movss %xmm1,(%ebx,%edx,4)
269c: 83 c2 01 add $0x1,%edx
269f: 39 ca cmp %ecx,%edx
26a1: 72 dd jb 2680 _Z18v_GetPowerSpectrumPA2_fPfi+0x104>
26a3: 59 pop %ecx
26a4: 5b pop %ebx
26a5: 5e pop %esi
26a6: 5f pop %edi
26a7: 89 ec mov %ebp,%esp
26a9: 5d pop %ebp
26aa: c3 ret
26ab: 90 nop
----------------

but without this pragma, this code looks like

----------------
00002470 _Z18v_GetPowerSpectrumPA2_fPfi>:
2470: 56 push %esi
2471: 8b 4c 24 08 mov 0x8(%esp),%ecx
2475: 8b 54 24 10 mov 0x10(%esp),%edx
2479: 31 c0 xor %eax,%eax
247b: 85 d2 test %edx,%edx
247d: 7e 2f jle 24ae _Z18v_GetPowerSpectrumPA2_fPfi+0x3e>
247f: 89 1c 24 mov %ebx,(%esp)
2482: 8b 5c 24 0c mov 0xc(%esp),%ebx
2486: 89 f6 mov %esi,%esi
2488: f3 0f 10 44 c1 04 movss 0x4(%ecx,%eax,8),%xmm0
248e: f3 0f 59 c0 mulss %xmm0,%xmm0
2492: f3 0f 10 0c c1 movss (%ecx,%eax,8),%xmm1
2497: f3 0f 59 c9 mulss %xmm1,%xmm1
249b: f3 0f 58 c8 addss %xmm0,%xmm1
249f: f3 0f 11 0c 83 movss %xmm1,(%ebx,%eax,4)
24a4: 83 c0 01 add $0x1,%eax
24a7: 39 d0 cmp %edx,%eax
24a9: 7c dd jl 2488 _Z18v_GetPowerSpectrumPA2_fPfi+0x18>
24ab: 8b 1c 24 mov (%esp),%ebx
24ae: 59 pop %ecx
24af: c3 ret
------------------

And that in the latter case, "LOOP WAS VECTORIZED" remark doesn't appear!! The latter is much smarter and actually faster.

Is it a feature or bug, or my misunderstanding? I hope you can reproduce this very easity using this code.
If it is a feature or my misunderstanding, #pragma should be carefully used.

Thanks in advance!!

8 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

Dear Abik,

Thank you very much!! I communicated with you by email and told if the real part and the imaginary part were defined in separate arrays, icpc produced much faster vectorized code which calculates 4 elements at once using SSE (and unrolled 2 loops into 1). It's amazing....nice compiler vectorization :)

the produced code to calculate 4 elements is:

32: 0f 28 0c 97 movaps (%edi,%edx,4),%xmm1 // load real parts
36: 0f 28 04 96 movaps (%esi,%edx,4),%xmm0 // load imaginary parts
3a: 0f 59 c9 mulps %xmm1,%xmm1 // square real
3d: 0f 59 c0 mulps %xmm0,%xmm0 // square imaginary
40: 0f 58 c8 addps %xmm0,%xmm1 // add them
43: 0f 29 0c 93 movaps %xmm1,(%ebx,%edx,4) // store results

it's not hand assembled...compiler did it :) This is the real vectorization.

PS: actually I don't think the latter code (w/o pragma) is vectorized, but the former isn't vectorized, either....am I correct?

There's a reason the compiler doesn't vectorize this without the pragma. If you could reverse the subscripts of FreqData[][] so as to enable use of parallel loads, no doubt you would see an advantage in vectorization for NumDataPoints >= 24 or so. You will still see more code generated, due to the remainder loop which has to be generated to take care of leftovers.

Forcing vectorization of a loop with non-unity read strides and unity write stride isn't likely to pay off, unless you are writing to more than 6 different array sections, so that you run into Write Combine buffer thrashing.

Thank you for your reply and I understand general meanings; There must be a proper preparation before vectorization and it made the code so big...so vectorization will pay only when the loop iteration count is large enough.

However I still don't know where the figure "24 or so" comes from. If I knew it, I could write faster codes. And how can I tell the compiler the iteration count is a multiple of 4? Should I write 4 calculations in the loop and add 4 to NumDataPoints each time?

I don't mind the size of code so much, I just want speed :) I like this compiler the better.

PS: I modified this code as follows, but then it wasn't vectorized!

----------
void v_GetPowerSpectrum(
fftwf_complex* FreqData,
float* PowerSpectrum,
int NumDataPoints
) {
int i;

#pragma vector always // #pragma vector aligned doesn't work either
for (i = 0; i < NumDataPoints; i+= 4) {
PowerSpectrum[i] = FreqData[i][0] * FreqData[i][0]
+ FreqData[i][1] * FreqData[i][1];
PowerSpectrum[i+1] = FreqData[i+1][0] * FreqData[i+1][0]
+ FreqData[i+1][1] * FreqData[i+1][1];
PowerSpectrum[i+2] = FreqData[i+2][0] * FreqData[i+2][0]
+ FreqData[i+2][1] * FreqData[i+2][1];
PowerSpectrum[i+3] = FreqData[i+3][0] * FreqData[i+3][0]
+ FreqData[i+3][1] * FreqData[i+3][1];
}

}
-----------

I'll follow the compiler :)

Message Edited by maverick6664 on 05-22-2005 02:01 PM

As vectorization typically implies performing 8 loop iterations in parallel, it will hurt performance of short loops which aren't a multiple of 8 in trip count. When you unroll manually, you depend on icpc undoing that, or "rolling" your source, to perform vectorization effectively. In principle, you're right, if a compiler "rolls" your loop, it could take advantage of the fact that you have shown that you have a trip count which is a multiple of 4.
Your vector aligned pragma does part of the job, by telling the compiler not to generate an initial remainder loop.
In this case, if you can't alter the loop to make it usefully vectorizable, unrolling either by writing it out or persuading the compiler with an unroll pragma may help.
It's possible that one of the other compilers which performs vectorization might be helped by unrolling by 2 or 4, but no compiler will swap your subscripts to improve optimization.

Dear maverick,

Pragmas vector always and vector aligned both tell the compiler to vectorize the loop regardless of the outcome of efficiency heuristics. The latter also tells the compiler that it is safe to use aligned data movement instructions for full vector load and store operations (without doing proper analysis on arrays and subscripts or generating peeling loops to force aligned access patterns at run time, as will be done in all other cases). Consequently, both pragmas force vectorization of this loop that would otherwise be deemed unprofitable due to the non-unit strides. Transposing the matrix to obtain more unit strides, as alluded to by Tim, may help in getting better vector performance. Unrolling loops by hand is generally not recommended because this may disable certain optimizations (even thought the compiler tries to reroll loops back before the analysis) and the compiler unrolls loops automatically where deemed profitable anyway. If you could provide me with a small example that contains all data types (feel free to email me directly at aart.bik@intel.com if you dont want to post more code), I may be able to give you some more hints. For programming guidelines related to vectorization, see the online article at http://www.intel.com/cd/ids/developer/asmo-na/eng/65774.htm or The Software Vectorization Handbook at http://www.intel.com/intelpress/sum_vmmx.htm.

Hope this helps.

Aart Bik
http://www.aartbik.com/

Thanks tim and abik,

I'll investigate and try my code and ask if necessary. Thank you!!

regards,

Lascia un commento

Eseguire l'accesso per aggiungere un commento. Non siete membri? Iscriviti oggi