Very bad scaling with many cores

Very bad scaling with many cores

Hello everyone!

First of all I would like to thank everyone for the wonderful TBB library; this library looks very promising and I am right now prototyping its use for the open-source project Stan which is a MCMC program.

The key bottleneck of Stan is the calculation of the log-likelihood and its gradients with respect to the parameters. I have successfully implemented a toy example which evaluates iteratively the log-likelihood of a Poisson example. While the toy example performs nicely with these timings (units are ns):

1 cores: BM_tbbM_median        501958 ns     500734 ns       1208
2 cores: BM_tbbM_median        281973 ns     279780 ns       2413
4 cores: BM_tbbM_median        177745 ns     176584 ns       3890
6 cores: BM_tbbM_median        146703 ns     145824 ns       4433

I am getting terrible performance when running the same thing in the actual application:

       37.30 real        37.11 user         0.07 sys

       21.50 real        42.35 user         0.26 sys

       17.23 real        50.69 user         0.44 sys

       44.45 real       174.29 user         2.17 sys

      241.85 real      1270.00 user        52.29 sys

So you see that with 2-3 cores things speed up, but then the execution times explode. The real example includes 4000 terms and I have set a grainsize of 100 (same results basically with a grainsize of 1000).

I am really lost here at the moment as to why this happens. It looks to me as if the TBB scheduler gets totally off the rails due to the longer breaks in between the evaluations. So I was wondering if putting threads to sleep can be avoided - but I am really guessing into the dark here.

The toy example is here:

The TBB parallel_reduce is here:

The actual application code pieces would be here: (but that's maybe not too helpful)

I would very much appreciate any hints of how to debug this. In case more information is needed, please let me know.

Many thanks in advance.


1 post / 0 new