icc does not scale compared to gcc

icc does not scale compared to gcc

Hi,

I wrote a very simple OpenMP program to analyze how well does it scale with the number of threads:

int main (void)
{
const int N=1000000;
double sum;

#pragma omp parallel for reduction(+:sum)
for(int i=0; i<N; i++) {
double x = (double) i;
double y = 1./( (double) i+1);
sum += x*y;
}
return 0;
}

Compiling with "icc -o inner_prod -std=c99 -Wl,-rpath,/opt/intel/lib/intel64 -openmp inner_prod.c" and setting the number of threads using OMP_NUM_THREADS, leads to the odd behaviour that the program execution time for more than one thread does not vary and is rather slow:
1: 0.00524879secs
2: 0.20569937secs
4: 0.20614765secs
8: 0.20417278secs
when compiling with gcc everything behaves as expected:
1: 0.00670819secs
2: 0.00448828secs
4: 0.00276818secs
8: 0.00621841secs

My OS is Ubuntu 14.04 and the processor is an Intel i7 4770k. The icc executable uses libiomp5.so, and the gcc executable libgomp.so.1. I am using icc (ICC) 14.0.2 20140120. Has anybody experienced similar problems, or has got any idea what might be wrong?

Best regards,
Gerald

10 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

If it is of any help, I experience the same behaviour with clang and libiomp. Br, Gerald

Intel c++ forum seems better suited for such questions. First you must decide if you want to test specific aspects of a compiler optimizing away redundant operations. Next, neither compiler can be expected to assume the omp simd clause. but this is not ideal way to test.

The issue was simply the size of the problem. I checked the binary with vtune and the there were a couple of kmp routines, namely kmp_wait_sleep, kmp_x86_pause and kmp_static_yield, which most of their time were idle, leading to overhead. When setting N to a larger value the program scaled as expected with the number of threads. For those of your who are interested icc performed better than gcc. Nonetheless I still find it odd that gcc can cope with smaller problems.

There is definitely something odd going on, though. The routines you name are where you would expect runtime threads to be waiting at barriers or during a serial section. I am suspicious of the fact that there's apparently a 200ms difference between the values you might expect and what you are measuring, since 200ms is the default value of KMP_BLOCKTIME (the amount of time a thread actively waits before deciding to wait in the kernel).

If you try with KMP_BLOCKTIME=infinite and KMP_BLOCKTIME=0, what do you see?

(And, how are you actually timing this?)

Thanks, KMP_BLOCKTIME=0 did the trick. However, I am still confused why a thread would wait instead of starting right away.

I'm using time.time() in Python and cross checked with the shell command time.

Something somewhere is broken. Please can you run with KMP_SETTINGS=1 and KMP_VERSION=1. That should give us enough information to submit a bug report and see what it is...

Thanks

There you go:

Intel(R) OMP Copyright (C) 1997-2013, Intel Corporation. All Rights Reserved.
Intel(R) OMP version: 5.0.20131126
Intel(R) OMP library type: performance
Intel(R) OMP link type: dynamic
Intel(R) OMP build time: 2013-11-26 15:53:55 UTC
Intel(R) OMP build compiler: Intel C++ Compiler 14.0
Intel(R) OMP alternative compiler support: yes
Intel(R) OMP API version: 4.0 (201307)
Intel(R) OMP dynamic error checking: no
Intel(R) OMP thread affinity support: not used
Intel(R) OMP debugger support version: 1.1

User settings:

   KMP_SETTINGS=1
   KMP_VERSION=1

Effective settings:

   KMP_ABORT_DELAY=0
   KMP_ABORT_IF_NO_IRML=false
   KMP_ADAPTIVE_LOCK_PROPS='1,1024'
   KMP_ALIGN_ALLOC=64
   KMP_ALL_THREADPRIVATE=128
   KMP_ALL_THREADS=2147483647
   KMP_ASAT_DEC=1
   KMP_ASAT_FAVOR=0
   KMP_ASAT_INC=4
   KMP_ASAT_INTERVAL=5
   KMP_ASAT_TRIGGER=5000
   KMP_ATOMIC_MODE=2
   KMP_BLOCKTIME=200
   KMP_CPUINFO_FILE: value is not defined
   KMP_DETERMINISTIC_REDUCTION=false
   KMP_DUPLICATE_LIB_OK=false
   KMP_FORCE_REDUCTION: value is not defined
   KMP_FOREIGN_THREADS_THREADPRIVATE=true
   KMP_FORKJOIN_BARRIER='2,2'
   KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
   KMP_FORKJOIN_FRAMES=true
   KMP_FORKJOIN_FRAMES_MODE=0
   KMP_GTID_MODE=3
   KMP_HANDLE_SIGNALS=false
   KMP_INIT_AT_FORK=true
   KMP_INIT_WAIT=2048
   KMP_ITT_PREPARE_DELAY=0
   KMP_LIBRARY=throughput
   KMP_LOCK_KIND=queuing
   KMP_MALLOC_POOL_INCR=1M
   KMP_MONITOR_STACKSIZE: value is not defined
   KMP_NEXT_WAIT=1024
   KMP_NUM_LOCKS_IN_BLOCK=1
   KMP_PLAIN_BARRIER='2,2'
   KMP_PLAIN_BARRIER_PATTERN='hyper,hyper'
   KMP_REDUCTION_BARRIER='1,1'
   KMP_REDUCTION_BARRIER_PATTERN='hyper,hyper'
   KMP_SCHEDULE='static,balanced;guided,iterative'
   KMP_SETTINGS=true
   KMP_STACKOFFSET=0
   KMP_STACKSIZE=4M
   KMP_STORAGE_MAP=false
   KMP_TASKING=2
   KMP_TASK_STEALING_CONSTRAINT=1
   KMP_USE_IRML=false
   KMP_VERSION=true
   KMP_WARNINGS=true
   OMP_CANCELLATION=false
   OMP_DISPLAY_ENV=false
   OMP_DYNAMIC=false
   OMP_MAX_ACTIVE_LEVELS=2147483647
   OMP_NESTED=false
   OMP_NUM_THREADS: value is not defined
   OMP_PLACES: value is not defined
   OMP_PROC_BIND='false'
   OMP_SCHEDULE='static'
   OMP_STACKSIZE=4M
   OMP_THREAD_LIMIT=2147483647
   OMP_WAIT_POLICY=PASSIVE
   KMP_AFFINITY='noverbose,warnings,respect,granularity=core,duplicates,none'

Intel(R) OMP Intel(R) RML support: not using

 

Thanks for your patience and assistance, I've submitted a bug CQ DPD200256355 in case you need to refer to it again.

Just to let you know. We found a problem and have a fix that will make it into a future release. We hadn't noticed it before, because most peopl do not use OpenMP in codes whose whole execution is a few ms...

Thanks again for drawing our attention to this issue.

Leave a Comment

Please sign in to add a comment. Not a member? Join today