I have one of those embarrasingly parallel applications that never really is embarrasingly parallel and hit a bit of a weird scheduling issue I don't know how to solve.
The basic workflow is that there are many tasks (one per pixel) that run independently. However some of the data they work on is produced lazily. At that point, I want all tasks to hop in and lend their hand as much as possible, so everyone can carry on. This LazyTask is mostly serial, but has some parallel components to it (parallel_for, really). And my problem is that when I run it in isolation, that is, I just run the task from the master thread, I get the expected speedup (see WorksFast in the example). However when I run it such that inside the main parallel_for (over each pixel) I create a task_group which contains just this lazy task and then wait on it, it becomes really slow (see SlowerThanExpected). Actually almost 2x slower than if I don't try to parallelize the lazy task at all (see WantToSpeedUp).
Looking at htop, I can see that when I run the lazy task in isolation, it runs clearly single threaded expect where extra threads are needed. However, in the task_group inside parallel_for scenario, all threads are running all the time (probably hitting the scheduler for tasks that are mostly not there). Is there any good way out of this mess?
I understand that the "WantToSpeedUp" cannot possibly utilize the parallel_for that is inside the LazyTask, since I've blocked all the other tasks. I thought that the 2x slowdown could be at least partially fault of turboboost not kicking in when all the other threads are spinning madly, but I tried to check the frequency in both cases and there isn't that much of a different. I run on 2x 12 core (+ HT, so 48 threads altogether), LazyTask takes about 45s single threaded with 10s in the for loop. When in "WorksFast", it the loop runs in 1.5s and the overall time is 33s. When in "WantToSpeedUp", it runs the same as single threaded. When SlowerThanExpected, the LazyTask takes about 74s altogether, even though the parallel_for does its job and takes only 1.5s. So it seems that the constant pressure of the tasks on the scheduling system is the culprit here, but maybe my pattern is just really bad and can be done in a better way? (The LazyTask cannot be run before the main loop, as there are many different such lazy tasks, all are memory hungry and not all are needed each time).
Sample code -- EDIT: I removed the sample code because it contained bugs. The code in the next post is an actual working example.