Hi, I'm threading loops via OMP and PThreads. I am working on a 4-quad core (4 packages with 4 cores per package) Xeon processor with an E7340 chip. Below (a)+(b) are two loops that I threaded with OMP. The only difference is in the for loop argument. I set the thread number via "export OMP_NUM_THREADS=4" and set the affinity via KMP_AFFINITY="explicit,proclist=[....]". The odd thing is that if I pin the 4 threads to different cores that reside on different packages I get a factor of 4 speed up for both loops. However, if I pin the threads to 4 different cores on the same package I get a factor of 4 speed up for loop (b) but no speed up for loop (a). I don't believe there is any cache thrashing going on (all have there own L1 cache and 2 cores per package share the same L2 cache (4 MB)) because I can set the omp parameters such that each thread acts on chunks larger than the cache, also this occurs for loop sizes many order of magnitude is size (tested up 2^28). I also don't believe this is an issue with OMP/Pthread initiation, since speed I get the speed up for loop (a) when all cores reside on different packages. This also occurs for pthreads where affinity is set via pthread_setaffinity_np(....). (Also, If I use all 16 cores I get a factor of 16 speedup for loop (b), but only 4 for loop (a))
I am using the 64 bit version of the MKL library 20100414Z, linking the following libraries: -L$(MKL_PATH) -liomp5 -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -lpthread -lrt and using the following compiler options -O3 -ip -axPTW -D_GNU_SOURCE -openmp (similar results occur if turn off optimizations -O0).
Can somebody explain why in both loops I get the expected speed up when the threads are pinned cores on different packages, but only speed up for some loops when the threads are pinned to cores on the same package?
(a) #pragma omp parallel for default (none) \
for (i=0;i<n;i++) c[i]=a[i]*b[i]
(b) #pragma omp parallel for default (none) \
for (i=0;i<n;i++) c[i]=(sin(a[i])+cos(b[i]))*exp(-a[i])