Missed compiler optimization opportunity

Missed compiler optimization opportunity

I posted a thread on the AVX Instruction forum illustrating a missed compiler optimization issue

http://software.intel.com/en-us/forums/showthread.php?t=107112&o=a&s=lr

Jim Dempsey

www.quickthreadprogramming.com
15 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

Hello Jim,

I've took a look with the following results:

  1. It's not related to AVX, it occurs with SSE, too. In the following I'll only use/refer to SSE. For AVX the argumentation is similar.
  2. When not using IPO both loops are equal in performance. When turning on IPO the first loop (auto-vectorized version) becomes faster. Comparing the assembly of both loops then unveils nearly no difference. So, why the difference in performance? Let's take a closer look at the memory layout...
  3. The first loop profits from the a contiguous elements in the three arrays ("a1", "a2" & "a3") - array "a4" is not used at all.
    The reason is that the first loop can pre-fetch and keep array elements on the full cache-line - for each array there's high likely a separate cache-line which even amplifies further.
    The second loop uses an array of structures and needs to load a cache-line with every iteration. You exactly hit the worst case here: Your structure is exactly 64 bytes (SSE version) - the size of a cache-line. For AVX it's even two cache-lines. In the SSE case, remove member "a4" and you'll notice a speedup already.

Conclusion:
The effects you're seeing is that IPO improves optimization to the same level as the second loop. However, the first loop profits from caching effects and thus is faster.

Edit:
I should also mention that in summary the amount of fetched cache-lines is identical for both loops. However, the timing is different. Processors can very well detect contiguous access to data in memory and pre-fetch already the next cache-line while it's not yet needed. The loads instructions in the loop won't add latency then.
For arrays of structures that's less likely since the processor does not have knowledge about the layout of the structures. It can only load the data when there's a load instruction telling the exact address (non-contiguous). Inside a loop this will add undesired latency.

Best regards,

Georg Zitzlsberger

George,

You are not quite right. The reason loop 1 runs faster is it is interleving fetches.

AOS

		Block 41:

0x1400014bb	74	vmovups ymm0, ymmword ptr [r9+rcx*4]

0x1400014c1	74	vmovups xmm4, xmmword ptr [r8+rcx*4]

0x1400014c7	74	vmulps ymm3, ymm0, ymm0

0x1400014cb	74	vmulps ymm5, ymm3, ymm2

0x1400014cf	74	vinsertf128 ymm9, ymm4, xmmword ptr [r8+rcx*4+0x10], 0x1

0x1400014d7	74	vmulps ymm3, ymm5, ymm9

0x1400014dc	74	vsubps ymm5, ymm3, ymm0

0x1400014e0	74	vmovups xmm0, xmmword ptr [r10+rcx*4]

0x1400014e6	74	vinsertf128 ymm4, ymm0, xmmword ptr [r10+rcx*4+0x10], 0x1

0x1400014ee	74	vmulps ymm9, ymm1, ymm4

0x1400014f2	74	vaddps ymm0, ymm5, ymm9

0x1400014f7	72	vmovups xmmword ptr [r8+rcx*4], xmm0

0x1400014fd	72	vextractf128 xmmword ptr [r8+rcx*4+0x10], ymm0, 0x1

0x140001505	72	add rcx, 0x8

0x140001509	72	cmp rcx, rdx

0x14000150c	72	jb 0x1400014bb


See the fetch into ymm0 is interlieved with fetch into xmm4 _before_ ymm0 is used.
Similarly other instructions are inserted between the fetch into xmm4 and its use as ymm4 later.
The next use of xmm0 and ymm0 are not interleaved.
And the stores do not matter as much.
The above hasfour interleaved reads. This avoids 4 pipeline stalls waiting for data.

Loop 2 has

PAOS

		Block 45:

0x14000154d	81	vmovups ymm5, ymmword ptr [rax+rdx*1+0x20]

0x140001553	79	inc rcx

0x140001556	81	vmulps ymm2, ymm5, ymm5

0x14000155a	81	vmulps ymm3, ymm2, ymm1

0x14000155e	81	vmulps ymm4, ymm3, ymmword ptr [rax+rdx*1+0x40]

0x140001564	81	vmulps ymm3, ymm0, ymmword ptr [rax+rdx*1]

0x140001569	81	vsubps ymm2, ymm4, ymm5

0x14000156d	81	vaddps ymm4, ymm2, ymm3

0x140001571	81	vmovups ymmword ptr [rax+rdx*1+0x40], ymm4

0x140001577	79	add rax, 0x80

0x14000157d	79	cmp rcx, 0x800000

0x140001584	79	jl 0x14000154d


See than (other than for inc rcx) all the reads are immediately followed by use of target registerof read. This causes a pipeline stall waiting for data.

The way to fix this is to unroll the loop to 2x. Something like the following:

PAOS (unrolled 2x)

		Block 45:

0x14000154d	81	vmovups ymm5, ymmword ptr [rax+rdx*1+0x20]

			vmovups ymm15, ymmword ptr [rax+rdx*1+0x20+0x80]

0x140001553	79	add rcx,2

0x140001556	81	vmulps ymm2, ymm5, ymm5

			vmulps ymm12, ymm15, ymm15

0x14000155a	81	vmulps ymm3, ymm2, ymm1

			vmulps ymm13, ymm12, ymm11

0x14000155e	81	vmulps ymm4, ymm3, ymmword ptr [rax+rdx*1+0x40]

			vmulps ymm14, ymm13, ymmword ptr [rax+rdx*1+0x40+0x80]

0x140001564	81	vmulps ymm3, ymm0, ymmword ptr [rax+rdx*1]

			vmulps ymm13, ymm10, ymmword ptr [rax+rdx*1+0x80]

0x140001569	81	vsubps ymm2, ymm4, ymm5

			vsubps ymm12, ymm14, ymm15

0x14000156d	81	vaddps ymm4, ymm2, ymm3

			vaddps ymm14, ymm12, ymm13

0x140001571	81	vmovups ymmword ptr [rax+rdx*1+0x40], ymm4

			vmovups ymmword ptr [rax+rdx*1+0x40+0x80], ymm14

0x140001577	79	add rax, 0x80+0x80

0x14000157d	79	cmp rcx, 0x800000

0x140001584	79	jl 0x14000154d 
In the above, for clarity I +10'd the register number for the second section of the unrolled loop,
Also, I showed ...+0x80 (e.g. +0x20+0x80) where the compiler would combine the base (0xA0)
Now when you read the code you can see the reads into the ymm registers are fully interleaved.

The compiler optimizations should easily be able to do this since the compiler can already unroll loops .AND. it is already smart enough to interleave instructions.

I think this case may be a trivial fix.

Jim Dempsey

www.quickthreadprogramming.com

Hello Jim,

you're assuming that the latency of loads can be amortized by interleaving just a few non-load/store operations. That's not always true. If you're transferring lots of (different) data, which is the case in your example, cache-misses are high and latency is even worse. Interleaving only helps if the data is already in cache. Any access down the cache hierarchy towards main memory adds another magnitude of latency.

In your example you're comparing apples with oranges: consecutive (homogeneous) data in dedicated arrays vs. single array of structures (AoS).

Let's do the following and compare test/loop 1 with something that has the same memory access pattern:

[bash]F32vec8* b1; // [N];
F32vec8* b2; // [N];
F32vec8* b3; // [N];
F32vec8* b4; // [N];

...

b1 = new F32vec8[Nvecs];
b2 = new F32vec8[Nvecs];
b3 = new F32vec8[Nvecs];
b4 = new F32vec8[Nvecs];

...
// Added after test 2:

// test 3
for (int i=0; iYou'll notice that test/loop 3 has the same performance as test/loop 1. The access method is the same here but is definitely not for test/loop 2!

Does this answer your question?

Best regards,

Georg Zitzlsberger

>>you're assuming that the latency of loads can be amortized by interleaving just a few non-load/store operations

No George,

I am stating (assuming) that the latency of loads can be amortized by interleaving with a load/store that is not dependent on the prior load/store.

Looking at the higher performing AOS code:

  vmovups ymm0, ymmword ptr [r9+rcx*4]

  vmovups xmm4, xmmword ptr [r8+rcx*4]

  vmulps ymm3, ymm0, ymm0

  vmulps ymm5, ymm3, ymm2

  vinsertf128 ymm9, ymm4, xmmword ptr [r8+rcx*4+0x10], 0x1


You find a load of cache/mem into ymm0 and ymm0 is not used until after a load of cache/mem into ymm4. The execution of the request to load xmm4 is practically free (and returns while fetch pending to xmm4). The last instruction may or may not stall, or the degree of stall,waiting for ymm0 depending oncache level (or not) of ymmword ptr [r8+rcx*4]. And the last instruction memory reference [r8+rcx*4+0x10], has a 3/4 probability of being in L1. And the pipeline designer may have been smart enough to recognize the vmovups xmm...vinsertf128...xmmword ptr from same cache line and bypass the load from L1 (saving possibly 3 clock ticks).

The last instruction avbove uses ymm4 (results of load issued at second instruction) 3 (2+) instruction times following load request.

This lack of stall (or lesser degree of stall) waiting for fetch to complete is one of the reasons for the performance difference.

A secondary issue, looking at the complete AOS assembly in a post above, is the AOS is using 3 data streams (off of r8, r9, r10) whereas the PAOS (in this example) is using 1 data stream (off of rax). The combination of interleaved load/store .AND. three streams increases the L1 hit ratio. Introducing interleaving load/store on the PAOS code would reduce pipeline stalls.

The loop need not be unrolled x2 as in earlier post. Consider the original and two alternate proposals:

PAOS

---- original ----
  vmovups ymm5, ymmword ptr [rax+rdx*1+0x20]

  inc rcx

  vmulps ymm2, ymm5, ymm5

  vmulps ymm3, ymm2, ymm1

  vmulps ymm4, ymm3, ymmword ptr [rax+rdx*1+0x40]

  vmulps ymm3, ymm0, ymmword ptr [rax+rdx*1]

  vsubps ymm2, ymm4, ymm5

  vaddps ymm4, ymm2, ymm3

  vmovups ymmword ptr [rax+rdx*1+0x40], ymm4

  add rax, 0x80

  cmp rcx, 0x800000

  jl 0x14000154d 
---- proposed 1 ----
  vmovups ymm5, ymmword ptr [rax+rdx*1+0x20]

  vmovups ymm15, ymmword ptr [rax+rdx*1+0x40] ; load lifted

  inc rcx

  vmulps ymm2, ymm5, ymm5

  vmovups ymm16, ymmword ptr [rax+rdx*1] ; load lifted

  vmulps ymm3, ymm2, ymm1

  vmulps ymm4, ymm3, ymm15 ; was ymmword ptr [rax+rdx*1+0x40]

  vmulps ymm3, ymm0, ymm16 ; was ymmword ptr [rax+rdx*1]

  vsubps ymm2, ymm4, ymm5

  vaddps ymm4, ymm2, ymm3

  vmovups ymmword ptr [rax+rdx*1+0x40], ymm4

  add rax, 0x80

  cmp rcx, 0x800000

  jl 0x14000154d 
---- proposed 2 ----
  vmovups ymm16, ymmword ptr [rax+rdx*1] ; load lifted

  vmovups ymm5, ymmword ptr [rax+rdx*1+0x20]

  vmovups ymm15, ymmword ptr [rax+rdx*1+0x40] ; load lifted

  inc rcx

  vmulps ymm2, ymm5, ymm5

  vmulps ymm3, ymm2, ymm1

  vmulps ymm4, ymm3, ymm15 ; was ymmword ptr [rax+rdx*1+0x40]

  vmulps ymm3, ymm0, ymm16 ; was ymmword ptr [rax+rdx*1]

  vsubps ymm2, ymm4, ymm5

  vaddps ymm4, ymm2, ymm3

  vmovups ymmword ptr [rax+rdx*1+0x40], ymm4

  add rax, 0x80

  cmp rcx, 0x800000

  jl 0x14000154d 

The second one lifts a load further but also results in a sequential access pattern (+0, +0x20, +0x40) which may aid the hardware prefetcher.

Jim Dempsey

www.quickthreadprogramming.com

Hello Jim,

you still see test 1 performing different than test 3? That surprises me because both SSE(2) & AVX versions are at the very same level for me.

I've attached your initial example, extended by test 3 (compile with "-DUSE_AVX" & "-xAVX" for AVX, nothing else for SSE2).

I got those results (all 64 bit Linux systems - only systems I have at hand right now):

Core i7-2600 CPU @ 3.40GHz (AVX):
$ icpc openmp.cpp -ipo -o openmp -openmp -DUSE_AVX -xAVX # AVX
$ ./openmp
1.76201 6.20015 1.74814

$ icpc openmp.cpp -ipo -o openmp -openmp # SSE2
$ ./openmp
1.84652 6.30253
1.8449

Intel Core i7 CPU X 980 @ 3.33GHz
$ icpc openmp.cpp -ipo -o openmp -openmp
$ ./openmp
2.24444 7.25693 2.18517

Of course all systems have Intel Hyper-Threading, SpeedStep & Intel Turbo Boost disabled for providing comparable results.
Please not that I'm using optimization level 2 and IPO.

Discussing the details like timing of instructions/dependencies/(SW-)pipe-lining etc. is difficult because the underlying processors are doing lots of optimizations that only turn out during run-time and even change across generations.
I'd like to see hard numbers first to reproduce. So far I'm seeing the serial version (baseline) at the same level as the revised C++ class library version (test 3). Thus the C++ class library works well with the compiler optimizations.

In case you described more hypothetical improvements above, I'd like to ask for an example (e.g. with in-line assembly) on which we can work.

Thank you & best regards,

Georg Zitzlsberger

Modified source:
openmp.cpp

George,

Thanks for the edits

I receive

3.18165 5.98651 3.29903

Core i7 2600K (Turbo On, HT On) standard clock setting. Unwilling to reboot at this time to run test /wo turbo or HT.

Your edits will construct a scenario of 4 independent arrays (3 used as you stated earlier) resulting in 3 input streams. Essenially same configuratinas AOS. Performance should be approximately the same.

My system showed slightly slower, yours slightly faster.

In looking at the assembler code we see:

		Block 64:

  vmovups ymm5, ymmword ptr [rax+rcx*1]

  inc r9

  vmulps ymm2, ymm5, ymm5

  vmulps ymm3, ymm2, ymm1

  vmulps ymm4, ymm3, ymmword ptr [rdx]

  vmulps ymm3, ymm0, ymmword ptr [rax+r8*1]

  vsubps ymm2, ymm4, ymm5

  vaddps ymm4, ymm2, ymm3

  add rax, 0x20

  vmovups ymmword ptr [rdx], ymm4

  add rdx, 0x20

  cmp r9, 0x800000

  jl 0x1400016af 

The load of ymm5 (line 1) via rax+rcx streamis used "immediately" thereafter (inc occure concurrently in different path) so a stall could occur there every other iteration. The result of the second memory reference (rdx stream) is not used until after the fetchis issued for the rax+r8 stream.

This instruction sequence has approximatly the same interleaved fetch characteristics as theloop-1 has.

The following loop (hand written) mayyeild better performance

		Block 64:

  vmovups ymm5, ymmword ptr [rax+rcx*1]

  vmovups ymm15, ymmword ptr [rdx] ; lifted load

  vmovups ymm16, ymmword ptr [rax+r8*1] ; lifted load

  inc r9

  vmulps ymm2, ymm5, ymm5

  vmulps ymm3, ymm2, ymm1

  vmulps ymm4, ymm3, ymm15 ; was ymmword ptr [rdx]

  vmulps ymm3, ymm0, ymm16 ; ymmword ptr [rax+r8*1]

  vsubps ymm2, ymm4, ymm5

  vaddps ymm4, ymm2, ymm3

  add rax, 0x20

  vmovups ymmword ptr [rdx], ymm4

  add rdx, 0x20

  cmp r9, 0x800000

  jl 0x1400016af


Though placement of the 3rd read may need to be deferred one or two instructions depending on archetecture.

Jim Dempsey

www.quickthreadprogramming.com

Georg,

Here is something new to play with.

I added out of line functions for the three loops that are either conditionalized in or out (see #define near top).

I also added an out of line function that has __asm block that optimizes the fetching for Test2 (original PAOS). This produces the 4th output number.

As configured, on my system (no out of line functions, multi-file IPO, full optimization, AVX, #pragmas disabled) I see:

3.181 5.92313 3.29127 2.97794

So the minor code change I introduced into the (single stream) PAOS doubled the performance and it is now faster than the prior AOS and the four stream variant you produced forPAOS2.

Now then for some interesting observatons:

The complete program is a single file.

Compiling for muti-file IPO produces much faster code than compiling for single file IPO,why???
(note there is only 1 procedure/funciton)

Compiling single file IPO, and with USE_functions causes the dvec.h member functions to not be inlined, why?

This file may be a good test program for your optimization team to use to figure out what is going on (and to improve your product). Depending on options (all with high optimization levels) there is a wide variation in performance of each loop. (up to 6x difference).
[cpp]// Felix.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include "dvec.h"
#include "omp.h"
#include

#define USE_AVX
#define USE_functionCalls
//#define USE_pragmas
// 3.181 5.92313 3.29127 2.97794

#ifdef USE_AVX
struct aosoa
{
F32vec8 a1;
F32vec8 a2;
F32vec8 a3;
F32vec8 a4;
};
const int VecWidth = 8;
#else
struct aosoa
{
F32vec4 a1;
F32vec4 a2;
F32vec4 a3;
F32vec4 a4;
};
const int VecWidth = 4;
#endif

const int N = 64*1024*1024;
const int Nvecs = N / VecWidth;

float* a1; // [N];
float* a2; // [N];
float* a3; // [N];
float* a4; // [N];

#ifdef USE_AVX
F32vec8* b1; // [N];
F32vec8* b2; // [N];
F32vec8* b3; // [N];
F32vec8* b4; // [N];
#else
F32vec4* b1; // [N];
F32vec4* b2; // [N];
F32vec4* b3; // [N];
F32vec4* b4; // [N];
#endif

aosoa* s; // [Nvecs];

float coeff1 = 1.2345f;
float coeff2 = .987654321f;

#ifdef USE_functionCalls
void Test1()
{
#ifdef USE_pragmas
#pragma ivdep
#pragma vector always
#endif
for (int i=0; i
Jim Dempsey

www.quickthreadprogramming.com

Hello Jim,

why did you increment by 2?

[sectionBody]B4_2:
vmovups ymm5, YMMWORD PTR [32+rax+rdx]
add rcx, 2 <= HERE; should be "incq rcx"?!
vmulps ymm2, ymm5, ymm5
vmulps ymm3, ymm2, ymm1
vmulps ymm4, ymm3, ymm7 ; was YMMWORD PTR [64+rax+rdx]
vmovups ymm7, YMMWORD PTR [192+rax+rdx]
vmulps ymm3, ymm0, ymm6 ; was YMMWORD PTR [rax+rdx]
vmovups ymm6, YMMWORD PTR [128+rax+rdx]
vsubps ymm2, ymm4, ymm5
vaddps ymm4, ymm2, ymm3
vmovups YMMWORD PTR [64+rax+rdx], ymm4
add rax, 128
cmp rcx, 8388608
jl B4_2 [/sectionBody]
The original only increments by 1. So you'd have computed only half of the values...

Regarding IPO:
Even though (multi-file) IPO shouldn't make a difference to the single-file version there are still different optimization paths executed in the compiler. Thus, resulting in different binaries. In theory both variants should show identical results but reality is different. It's also the same as for optimization level 2 vs. 3 - sometimes 2 is better than 3 and vice versa otherwise. It always depends on the use-cases and nothing we could ultimately guarantee with small changes/fixes to the compiler.

Regarding in-lining:
The compiler might not in-line functions if it does not see any benefit. You can override that by various options (e.g. /Qinline-forceinline or /Qinline-factor). Or you can use the pragmas [my personal favorites btw.]:
#pragma [force]inline
#pragma noinline

Best regards,

Georg Zitzlsberger

>>why did you increment by 2?

Oops Error on my part, good catch.

Itried the unroll x2 and other variations and missed resetting"incq" (new instruction :). With the corrected code, the single stream pre-fetch is not showing a significant difference. So this confirms that having multiple streams improves the performance.

Thank you for taking your time to address this. The insight on use of multi-stream input will be a valuable lesson in future assessments of performance issues (at least for Sandy Bridge).

Jim Dempsey

www.quickthreadprogramming.com

RE: no/mf/sf IPO and inlining

>> different optimization paths executed in the compiler

Granted, however it appears that when a given optimization is added to mf IPO it is not determined if the particular optimization would also be suitable for sf IPO and/or no IPO. Most likely because this would require time in regression testing.

I think it a false assumption on Intel's part that mf IPO is the assumed preference for optimization.

Consider the situation of a software provider. Examples:

Intel with MKL, IPP, etc...
A central IT department providing libraries
An Independent Software Vendor providing libraries

Consider now a bug report coming in whereby the problem report contains the sorce files exhibiting the error and the error interacts with the providers libraries. The provider's support person will want source level debugging for their libraries .AND. compillation equivilancies of the user site which does not have the source to the library. The support person does NOT want mfIPO to include files from the libraries to which the user does not have source access.

I may have a configuration issue here, but on my site, when I compile a test program using mfIPO with a solution NOT containing the projects for the "private" library, that VS (or Intel C++) is able to locate the project and sources of the "private" library and include these into the mfIPO. I do not want to be placed into the postion of requiring two test machines and copying of object files. Yet I want (require) full source level debugging.

Jim

www.quickthreadprogramming.com

Hello Jim,

I think there is a basic misunderstanding regarding multi-file IPO:
IPO does not in-line more
code than the linker wouldn't collect already. IPO only works on what's
already been put together by the linker.

So, if you see anything being added to your application binary that should not be there it's a topic for setting up the linker. Same, vice versa.

And, anything that was compiled w/o IPO won't be subject to IPO optimization. This is esp. true for 3rd party libraries (where you don't have the source code for and hence did not compile with IPO).

Best regards,

Georg Zitzlsberger

Georg,

Are you saying single file IPO is post linker?

The problem is when the party of thethe 3rd party library is yourself. And when you want to use mf-IPO yet keep the library out of the IPO consideration. Because you are trying to compile with the same optimization selections that the user has - yet have full debug symbols of the user app + your library (not co-mingled with optimizations).

Jim

www.quickthreadprogramming.com

Hello Jim,

not single-file IPO, only multi-file IPO is done during linker stage.

Did you try "/Qipo-c" (or "-ipo-c"). This allows you to optimize a library with IPO but drops all the intermediate representation (IR) afterwards. The final (single & optimized) object file can easily be linked to a library using the usual tools. As it won't contain IR it won't be part of IPO of your main program, yet taking advantage of (partial*) IPO.
Of course, if you use "/Qipo" (or "-ipo") for every module/library/component then you also enable all of them for IPO. There are lots of other options that allow you to control how IPO should be applied.

*: Partial, because the inner core of the library will be optimized by IPO but not the interfaces.

Best regards,

Georg Zitzlsberger

George,

I did not try /Qipo-c (didn't know it was available). The pull-down in Visual Studio does not include this option for the project properties. This should be considered as something to add (although the -c part may be a Librarian function). My preference would be to include this choice on the:

(Project) | Optimization [Intel C++] | Interprocedural Optimization

property page (keep it in one place). Can you post this request to the development team.

Also consider including it on

(file.cpp) | Optimization [Intel C++] | Interprocedural Optimization

Which /Qipo-c would permit multi-file IPO to be incorporated into file.cpp while excluding the resultant output from being incorporated into callers offile.cpp's .objfunctions (procedures).

Thanks for your tip.

Jim Dempsey

www.quickthreadprogramming.com

Kommentar hinterlassen

Bitte anmelden, um einen Kommentar hinzuzufügen. Sie sind noch nicht Mitglied? Jetzt teilnehmen