Just compiling with OpenMP (not using it anywhere in the code) gives me a huge performance boost. How is that possible?

Just compiling with OpenMP (not using it anywhere in the code) gives me a huge performance boost. How is that possible?

Hi,

I'm developing a piece of code, and to my surprise I have found that when compiling, if I add -openmp option I get around a 36% reduction in execution time (not using any OMP directives in the code, and just in case setting OMP_NUM_THREADS to 1).

Regular compilation is with -fast

If I compile with -fast -openmp then I get this 36% time reduction.

This happens in a 64-bits Ubuntu 11.10 box, running ifort 12.1.3

How is this possible? I read the documentation and I only found that -openmp adds -automatic option, but removing -openmp and adding -automatic doesn't give any performance boost. Any clues?

Thanks a lot,
ngel de Vicente

---------------

angelv@carro:~/SPIA.WC/mancha_src$ ifort -V
Intel Fortran Intel 64 Compiler XE for applications running on Intel 64, Version 12.1.3.293 Build 20120212
Copyright (C) 1985-2012 Intel Corporation. All rights reserved.

angelv@carro:~/SPIA.WC/mancha_src$ uname -a
Linux carro 3.0.0-19-generic #33-Ubuntu SMP Thu Apr 19 19:05:14 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

13 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Besides changing the default allocation of local arrays to stack, including those in Fortran main, this starts up the OpenMP runtime.
Your hardware identity may be of interest. If you have a dual CPU NUMA platform, it's possible this may assist in placing those arrays on the memory banks local to the CPU which is running. If so, it could have more effect if your compilation is vectorized with AVX. You may want to watch your thread placement to see if locality to a core is improved, and check whether setting KMP_AFFINITY makes a difference.

If your program is performing WRITEs you may see an improvement due to buffering changes.

As TimP indicated, stack placement on NUMA may attribute to some advantage - though not as much as 36%.

Jim Dempsey

Hi,

I'm running this in a box with the following processor:
http://ark.intel.com/products/35365/Intel-Core2-Quad-Processor-Q9400-%286M-Cache-2_66-GHz-1333-MHz-FSB%29

Correct me if I am wrong, but I thought that compiling with -fast (well, in fact anything above -O2) already did auto-vectorization, so I should see the benefits of AVX without -openmp?

How can I see thread placement?

Thanks,
ngel de Vicente

Hi Jim,

no WRITEs (well, a few, but of no interest, the routines that get the biggest improvement don't do any I/O), so I want to follow the NUMA path. Which tools would you recommend to see stack placement, hardware counters, etc.?

ngel de Vicente

Hi,
just to add some extra information. I have run the code again and collected some performance counters with perf (https://perf.wiki.kernel.org/index.php/Main_Page). (I also profiled cache misses, though not shown in here. Cache misses count was similar for both versions, though actually a bit higher for the version compiled with -openmp). When compiled with -openmp, the number of instructions goes from 50000M to 33000M????

Any clues? The details are below:

Thanks,
ngel de Vicente

The code compiled with -fast -openmp gives:

Timing report:
Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
---------------------------------------- (s) (s) (s) (s)
pefrompg 14 0.7004E+00 0.7003E+00 0.6908E+00 0.7088E+00

Performance counter stats for '../../../mancha2D_h5fc.x mancha.trol':

14453.279816 task-clock # 1.000 CPUs utilized
510 context-switches # 0.000 M/sec
19 CPU-migrations # 0.000 M/sec
26,817 page-faults # 0.002 M/sec
38,429,014,812 cycles # 2.659 GHz
stalled-cycles-frontend
stalled-cycles-backend
33,733,041,117 instructions # 0.88 insns per cycle
1,759,229,047 branches # 121.718 M/sec
18,322,523 branch-misses # 1.04% of all branches

14.460452583 seconds time elapsed

The code compiled only with -fast gives:

Timing report:
Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
---------------------------------------- (s) (s) (s) (s)
pefrompg 14 0.1252E+01 0.1252E+01 0.1247E+01 0.1258E+01

Performance counter stats for '../../../mancha2D_h5fc.x mancha.trol':

22691.325492 task-clock # 1.000 CPUs utilized
527 context-switches # 0.000 M/sec
6 CPU-migrations # 0.000 M/sec
26,759 page-faults # 0.001 M/sec
60,334,580,636 cycles # 2.659 GHz
stalled-cycles-frontend
stalled-cycles-backend
50,161,667,724 instructions # 0.83 insns per cycle
1,781,648,031 branches # 78.517 M/sec
19,022,240 branch-misses # 1.07% of all branches

22.698564007 seconds time elapsed

Try compiling with Architecture Host (only include the instruction path of your host system).

OpenMP may insert fewer alternate code paths.

Note, if you are shipping out compiled code, then consider impact of -arch:host (or whatever the option switch is for your variant of the compiler).

Jim Dempsey

Adding OpenMP certainly will increase the number of instructions executed, and should distribute those additional instructions effectively among hardware threads. It will be a big boost to those who like to see a higher rate of instruction execution.
You would need to find sections of your code which speed up and investigate locally in more detail to gain understanding of how OpenMP might have helped practical performance.
You haven't told us enough about your application to guess whether AVX shows an advantage, but it's true that -fast engages AVX vectorization when compiling on an AVX CPU. Compilers are getting smarter about choosing between AVX-128 and AVX-256, which is a particularly critical choice for the current AVX CPUs.
If your invocation of OpenMP does improve data placement, that would be crucial for seeing an advantage from AVX-256 vectorization.

Hi Jim,

Try compiling with Architecture Host (only include the instruction path of your host system).

OpenMP may insert fewer alternate code paths.

Note,
if you are shipping out compiled code, then consider impact of
-arch:host (or whatever the option switch is for your variant of the
compiler).

Jim Dempsey

thanks for the suggestion, but I thought (correct me if I'm wrong) that not including -arch:host would create in the executable file more instructions (for other architecture types), but that would not influence much the number of executed instructions (i.e. the code would choose one route or other depending on the architecture? I guess there would be a few more instructions executed in order to decide on which architecture it was being executed, but I'm seeing a big difference in the number of instructions executed.).

In any case, I just tried again, compiled it with -O3 -arch ssse3

and perf still tells me that the instructions executed are almost exactly the same as per my previous post:

Timing report:
Timer Number Iterations Mean real time Mean CPU time Minimum Maximum
---------------------------------------- (s) (s) (s) (s)
pefrompg 14 0.1167E+01 0.1166E+01 0.1164E+01 0.1169E+01

Performance counter stats for '../../../mancha2D_h5fc.x mancha.trol':

23691.749803 task-clock # 0.999 CPUs utilized
538 context-switches # 0.000 M/sec
21 CPU-migrations # 0.000 M/sec
26,982 page-faults # 0.001 M/sec
62,995,709,319 cycles # 2.659 GHz
stalled-cycles-frontend
stalled-cycles-backend
50,516,716,464 instructions # 0.80 insns per cycle
1,836,493,981 branches # 77.516 M/sec
16,682,960 branch-misses # 0.91% of all branches

23.709022449 seconds time elapsed

Thanks,
ngel de Vicente

You get multiple execution paths only with the -ax compile options. As Jim suggested, -xHost allows you to get the same architecture choice as -fast without the IPO, and setting -O3 only if you want it. -fast is used mainly for situations where there is a rule against use of more than 2 or 3 compile options. However, I'm not betting on that being relevant to your question.

Hi Tim,

Adding OpenMP certainly will increase the number of instructions
executed, and should distribute those additional instructions
effectively among hardware threads. It will be a big boost to those who
like to see a higher rate of instruction execution.
You would need
to find sections of your code which speed up and investigate locally in
more detail to gain understanding of how OpenMP might have helped
practical performance.
You haven't told us enough about your
application to guess whether AVX shows an advantage, but it's true that
-fast engages AVX vectorization when compiling on an AVX CPU. Compilers
are getting smarter about choosing between AVX-128 and AVX-256, which
is a particularly critical choice for the current AVX CPUs.
If your
invocation of OpenMP does improve data placement, that would be crucial
for seeing an advantage from AVX-256 vectorization.

But the funny thing is that in my code, compiling with OpenMP does not increase the number of instructions executed, it decreases it (and the code doesn't have any OpenMP code).

Is there any way for me to see whether AVX-128 or AVX-256 is being chosen for the instructions in the routine in which I'm most interested? (I will try to get a reduced toy version of the code I'm compiling and see if I still get similar results, so I can post them here).

Thanks for your help,
ngel

I didn't entirely understand your statement about increase or decrease in number of instructions executed. Poor data locality would increase number of cache misses.
Compiling with -S to make a .s of your important Fortran files, or running in a profiler such as VTune or oprofile, would enable you to see whether avx-128 or avx-256 is executed in your important loops. avx-128 code uses xmm registers (like sse and scalar instructions) while avx-256 uses ymm registers. On current AVX implementations, AVX-128 may be faster where there isn't 32-byte alignment, or where loops are short, and the compiler tries to make decisions on those questions.

Hi Tim,

I didn't entirely understand your statement about increase or decrease
in number of instructions executed. Poor data locality would increase
number of cache misses.

Well, perhaps I misunderstood one of your earlier posts, but I thought you meant that compiling with OpenMP would increase the number of instructions that my code would end up doing, but according to the perf counters, the version of my code compiled with -fast -openmp ends up doing around 25% less instructions than the one compiled with only -fast. (According to the same tool, the cache misses are nearly identical for both executables). All this in a Fortran source code that doesn't have any OpenMP directives.

Compiling with -S to make a .s of your
important Fortran files, or running in a profiler such as VTune or
oprofile, would enable you to see whether avx-128 or avx-256 is executed
in your important loops. avx-128 code uses xmm registers (like sse and
scalar instructions) while avx-256 uses ymm registers. On current AVX
implementations, AVX-128 may be faster where there isn't 32-byte
alignment, or where loops are short, and the compiler tries to make
decisions on those questions.

Thanks, I will follow this up.

Leave a Comment

Please sign in to add a comment. Not a member? Join today