Less performance on 16 core than on 4 ?!

Less performance on 16 core than on 4 ?!

Hi there,

I evaluated my cilk application using "taskset -c 0-(x-1) MYPROGRAM) to analyze scaling behavior.

 

I was very suprised to see, that the performances increases up to a number of cores but decreases afterwards.

for 2 Cores, I gain a speedup of 1,85. for 4, I gain 3.15. for 8 4.34 - but with 12 cores the performance drops down
to a speedup close to the speedup gained by 2 cores (1.99).
16 cores performe slightly better (2.11)

How is such an behaviour possible? either an idle thread can steal work or it cant?! - or may the working packets be too coarse grained and the stealing overhead destroys the performance with too many cores in use?!

3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Can you post your example, or an example with similar behavior?  Does your machine have a single socket (processor chip) or multiple sockets?  Some ways the behavior can happen:

  • As you note, work chunks might be too coarse, and so some threads starve and just run interference trying to steal.  
  • The communication cost of stealing work might exceed the gains from parallelism.  For example, short back-to-back cilk_for loops with high bandwidth/compute ratios are sometimes slowed down by random work-stealing.
  • A thread calls a package that is internally multithreaded, thus causing oversubscription.

Shouldn't CILK_NWORKERS also be set for such a purpose?

As my 12-core box is a Westmere dual (HT disabled), to run on 8 cores, with Cilk(tm) Plus using data initialized under OpenMP,  I use:

 

export KMP_AFFINITY="proclist=[1,3,4,5,7,9,10,11],explicit"    (effectively setting OMP_NUM_THREADS=8)

export CILK_NWORKERS=8

taskset -c 1,3,4,5,7,9,10,11 ./a.out

 

Hoping to set up so as to use a single core on each of the 4 cache links per CPU,

and I get the whole expected range of scaling characteristics, some cases running 50% faster with 12 workers than with 8, one 60% faster on 8 cores than on 12.  I don't see an obvious way to predict which cases are of one characteristic or another,  Due to the 4 cache paths per CPU, it's not entirely a surprise to see some cases max out at 8 workers (possibly sooner when not carefully distributed).

Even with the verbose option set, it's not clear how taskset is affecting KMP_AFFINITY, but the assumption that taskset doesn't influence that but does (as you found) limit cilkrts seems to work out.

 

Leave a Comment

Please sign in to add a comment. Not a member? Join today