ICC vs. GCC: Strange scaling behavior of an OpenMP parallelized benchmark

ICC vs. GCC: Strange scaling behavior of an OpenMP parallelized benchmark

I am currently preparing two benchmarks of a new 240-core E7-x890v2 server. On a 60-core test machine (four sockets with E7-4890v2, HyperThreading and TurboBoost enabled; RHEL 7, Transparent Huge Pages activated) I get the following timings for "Benchmark A" with ICC v14.0.2 and GCC v4.8.2:

 

==  60 cores  ===============================================================

----  GCC executable  -------------------------------------------------------

        Finished in 187.65 second(s) CPU time, 3.144 second(s) WALL time.

        Finished in 186.34 second(s) CPU time, 3.122 second(s) WALL time.

        Finished in 205.52 second(s) CPU time, 3.461 second(s) WALL time.

----  ICC executable  -------------------------------------------------------

        Finished in 819.70 second(s) CPU time, 13.649 second(s) WALL time.

        Finished in 779.00 second(s) CPU time, 12.974 second(s) WALL time.

        Finished in 822.83 second(s) CPU time, 13.703 second(s) WALL time.

==  32 cores  ===============================================================

----  GCC executable  -------------------------------------------------------

        Finished in 169.27 second(s) CPU time, 5.295 second(s) WALL time.

        Finished in 169.35 second(s) CPU time, 5.295 second(s) WALL time.

        Finished in 169.25 second(s) CPU time, 5.292 second(s) WALL time.

----  ICC executable  -------------------------------------------------------

        Finished in 369.26 second(s) CPU time, 11.529 second(s) WALL time.

        Finished in 410.28 second(s) CPU time, 12.809 second(s) WALL time.

        Finished in 343.93 second(s) CPU time, 10.739 second(s) WALL time.

==  16 cores  ===============================================================

----  GCC executable  -------------------------------------------------------

        Finished in 172.54 second(s) CPU time, 10.776 second(s) WALL time.

        Finished in 168.89 second(s) CPU time, 10.546 second(s) WALL time.

        Finished in 196.67 second(s) CPU time, 12.284 second(s) WALL time.

----  ICC executable  -------------------------------------------------------

        Finished in 216.70 second(s) CPU time, 13.529 second(s) WALL time.

        Finished in 264.84 second(s) CPU time, 16.540 second(s) WALL time.

        Finished in 214.90 second(s) CPU time, 13.419 second(s) WALL time.

==   8 cores  ===============================================================

----  GCC executable  -------------------------------------------------------

        Finished in 183.34 second(s) CPU time, 22.893 second(s) WALL time.

        Finished in 183.68 second(s) CPU time, 22.937 second(s) WALL time.

        Finished in 183.40 second(s) CPU time, 22.902 second(s) WALL time.

----  ICC executable  -------------------------------------------------------

        Finished in 177.59 second(s) CPU time, 22.176 second(s) WALL time.

        Finished in 179.41 second(s) CPU time, 22.402 second(s) WALL time.

        Finished in 179.39 second(s) CPU time, 22.401 second(s) WALL time.

==   4 cores  ===============================================================

----  GCC executable  -------------------------------------------------------

        Finished in 159.01 second(s) CPU time, 39.709 second(s) WALL time.

        Finished in 159.29 second(s) CPU time, 39.780 second(s) WALL time.

        Finished in 160.02 second(s) CPU time, 39.962 second(s) WALL time.

----  ICC executable  -------------------------------------------------------

        Finished in 171.26 second(s) CPU time, 42.769 second(s) WALL time.

        Finished in 169.51 second(s) CPU time, 42.333 second(s) WALL time.

        Finished in 170.76 second(s) CPU time, 42.642 second(s) WALL time.

==   2 cores  ===============================================================

----  GCC executable  -------------------------------------------------------

        Finished in 158.64 second(s) CPU time, 79.233 second(s) WALL time.

        Finished in 160.43 second(s) CPU time, 80.127 second(s) WALL time.

        Finished in 158.63 second(s) CPU time, 79.228 second(s) WALL time.

----  ICC executable  -------------------------------------------------------

        Finished in 168.97 second(s) CPU time, 84.384 second(s) WALL time.

        Finished in 168.77 second(s) CPU time, 84.287 second(s) WALL time.

        Finished in 168.78 second(s) CPU time, 84.291 second(s) WALL time.

==   1 core   ===============================================================

----  GCC executable  -------------------------------------------------------

        Finished in 158.25 second(s) CPU time, 158.079 second(s) WALL time.

        Finished in 160.93 second(s) CPU time, 160.756 second(s) WALL time.

        Finished in 158.13 second(s) CPU time, 157.961 second(s) WALL time.

----  ICC executable  -------------------------------------------------------

        Finished in 167.94 second(s) CPU time, 167.762 second(s) WALL time.

        Finished in 168.40 second(s) CPU time, 168.213 second(s) WALL time.

        Finished in 169.90 second(s) CPU time, 169.717 second(s) WALL time.

 

 

Both executables were compiled with optimization level '-O2'. The behavior of this benchmark application on a Itanium-9560 (HP-UX 11.31, aCC compiler) server of same size corresponds—more or less—to the GCC executable on the LINUX server.

 

I would be very grateful for any thoughts, comments, explanations, suggestions… Please, do not hesitate to ask for more information.

 

Thank you for reading.

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

You didn't say anything about your affinity settings.  Both gcc and icc should understand OMP_PLACES.  I suppose something like OMP_PLACES=cores

I don't have experience myself with combining OMP_PLACES and OMP_PROC_BIND.  If your application depends on using contiguous cores for cache or NUMA locality, that would be OMP_PROC_BIND=close  or, if you prefer to spread out across as many cores as possible, OMP_PROC_BIND=scatter

Intel libiomp should giive you the full display about affinity settings by KMP_AFFINITY=verbose even if you use the standard OMP_P settings.  To avoid using OMP_PLACES and OMP_PROC_BIND you might try

KMP_AFFINITY="proclist=[1-120:1],explicit,verbose"

I don't know why documentation of these isn't clearer.

Intel 15.0 compiler announced the extension of KMP_PLACE_THREADS to Xeon.  That would be

KMP_PLACE_THREADS=60c,1t

meaning use up to 60 cores in sequence, 1 thread per core.

If you don't want "strange behavior" in the absence of affinity setting, disabling HyperThreading is a likely step.

Thank you very much for your thoughts and suggestions. The benchmark runs—as far as I know—on all tested architectures and operating systems with the default affinity settings; all other OpenMP parameters are hard-coded (schedule clauses etc.) or set by library functions (number of threads).

 

'OMP_PLACES=cores' lowers the execution time (combining with 'OMP_PROC_BIND=close' has no measurable effect, 'OMP_PROC_BIND=scatter' is unknown: "OMP: Warning #42: OMP_PROC_BIND: "scatter" is an invalid value; ignored."):

 

==  60 cores  ========================================

----  GCC executable  --------------------------------

        Finished in 205.78 second(s) CPU time, 3.460 second(s) WALL Time.

        Finished in 174.98 second(s) CPU time, 2.934 second(s) WALL Time.

        Finished in 183.70 second(s) CPU time, 3.076 second(s) WALL Time.

----  ICC executable  --------------------------------

        Finished in 633.43 second(s) CPU time, 10.546 second(s) WALL Time.

        Finished in 629.45 second(s) CPU time, 10.481 second(s) WALL Time.

        Finished in 629.35 second(s) CPU time, 10.480 second(s) WALL Time.

==  120 cores  =======================================

----  GCC executable  --------------------------------

        Finished in 305.42 second(s) CPU time, 2.584 second(s) WALL Time.

        Finished in 301.16 second(s) CPU time, 2.545 second(s) WALL Time.

        Finished in 300.13 second(s) CPU time, 2.536 second(s) WALL Time.

----  ICC executable  --------------------------------

        Finished in 2137.62 second(s) CPU time, 17.795 second(s) WALL Time.

        Finished in 2148.57 second(s) CPU time, 17.889 second(s) WALL Time.

        Finished in 2146.36 second(s) CPU time, 17.868 second(s) WALL Time.

 

 

'KMP_AFFINITY="proclist=[0-119:1],explicit" works like 'OMP_PLACES=cores':

 

==  60 cores  ========================================

----  ICC executable  --------------------------------

        Finished in 630.36 second(s) CPU time, 10.498 second(s) WALL Time.

        Finished in 633.60 second(s) CPU time, 10.551 second(s) WALL Time.

        Finished in 629.74 second(s) CPU time, 10.486 second(s) WALL Time.

----  GCC executable  --------------------------------

        Finished in 180.32 second(s) CPU time, 3.022 second(s) WALL Time.

        Finished in 178.92 second(s) CPU time, 2.999 second(s) WALL Time.

        Finished in 181.14 second(s) CPU time, 3.055 second(s) WALL Time.

==  120 cores  =======================================

----  ICC executable  --------------------------------

        Finished in 2150.54 second(s) CPU time, 17.902 second(s) WALL Time.

        Finished in 2149.38 second(s) CPU time, 17.894 second(s) WALL Time.

        Finished in 2134.36 second(s) CPU time, 17.771 second(s) WALL Time.

----  GCC executable  --------------------------------

        Finished in 299.85 second(s) CPU time, 2.534 second(s) WALL Time.

        Finished in 306.25 second(s) CPU time, 2.598 second(s) WALL Time.

        Finished in 300.12 second(s) CPU time, 2.541 second(s) WALL Time.

 

 

I ran this benchmark a week ago without HyperThreading, unfortunately with a similar result… (I only have 'ssh' access to this system, so I cannot change BIOS options myself, and a (proprietary) LINUX tool for doing this before a reboot isn't installed.)

I did some tests with a modified source code of this benchmark, especially with an altered OpenMP schedule clause for the central loop, but without any success.

Has anybody any further ideas or explanations for this ruinous performance of the machine code from ICC? May the "interplay" of Red Hat Enterprise Linux 7 with ICC 14.0.2 and GCC 4.8.2 cause this problem? (The upcoming 240-core E7v2 server will only be certified with RHEL7, unfortunately.)

On the 'Westmere EX' CPU generation the ICC (v13.x) executable was up to 35% faster (!) than that from the GCC (v4.7.x), but with the current result it is impossible to us to convince the major customer behind to buy Intel compiler/development suites for the new systems…

Just curious:  is your testing under a released Red Hat or a RHEL7 beta version?

I could imagine performance problems with a not-yet-fully-supported platform combination.  Among those problems might be a failure of the KMP_ and OMP_ pinning environment variables.  This could be investigated by setting the KMP_AFFINITY verbose option, and, if necessary, using the KMP_AFFINITY="proclist=[....]" options, not an easy task with 120 slots to handle.   The full proclist might make it possible to work around deficiencies of automatic placement (and offer hope for timely resolution of a problem report).  For example, I would hope that 60 threads could be spread across 60 HyperThreaded cores by KMP_AFFINITY="proclist=[1-119:1],explicit,verbose"  or some such incantation. 

I couldn't find direct references about plans to include RHEL7 support in future icc releases.  There is documentation that RHEL5 support (i.e. full testing) was to be dropped and then was reinstated but deprecated, which seems to imply a delay in RHEL7.

Leave a Comment

Please sign in to add a comment. Not a member? Join today