First of all I would like to thank everyone for the wonderful TBB library; this library looks very promising and I am right now prototyping its use for the open-source project Stan which is a MCMC program.
The key bottleneck of Stan is the calculation of the log-likelihood and its gradients with respect to the parameters. I have successfully implemented a toy example which evaluates iteratively the log-likelihood of a Poisson example. While the toy example performs nicely with these timings (units are ns):
1 cores: BM_tbbM_median 501958 ns 500734 ns 1208
2 cores: BM_tbbM_median 281973 ns 279780 ns 2413
4 cores: BM_tbbM_median 177745 ns 176584 ns 3890
6 cores: BM_tbbM_median 146703 ns 145824 ns 4433
I am getting terrible performance when running the same thing in the actual application:
37.30 real 37.11 user 0.07 sys
21.50 real 42.35 user 0.26 sys
17.23 real 50.69 user 0.44 sys
44.45 real 174.29 user 2.17 sys
241.85 real 1270.00 user 52.29 sys
So you see that with 2-3 cores things speed up, but then the execution times explode. The real example includes 4000 terms and I have set a grainsize of 100 (same results basically with a grainsize of 1000).
I am really lost here at the moment as to why this happens. It looks to me as if the TBB scheduler gets totally off the rails due to the longer breaks in between the evaluations. So I was wondering if putting threads to sleep can be avoided - but I am really guessing into the dark here.
The toy example is here: https://github.com/wds15/perf-math/blob/tbb/tbb-scale.cpp#L124
The TBB parallel_reduce is here: https://github.com/stan-dev/math/blob/1b6abbfc389cb8bfd803b5bac759dbd196...
The actual application code pieces would be here: https://discourse.mc-stan.org/t/proposed-parallelism-rfc-stan-language-b... (but that's maybe not too helpful)
I would very much appreciate any hints of how to debug this. In case more information is needed, please let me know.
Many thanks in advance.