Have we made task_group a little too easy to use?

One of the things that was introduced with the TBB 2.2 release was support for functional parallelism or the high-level task scheduling API, whichever you wish to call it. This was the task_group class which allowed users to spawn tasks, which wrap user functions that could potentially run in parallel, and more functions could be added to the same task group dynamically. The TBB Reference Manual explains task_group using the ubiquitous Fibonacci example. There is another situation where task_group could be a very good fit as well:



task_group g;
g.run( [&]() { TickPhysics(...); } );
g.run( [&]() { TickParticles(...); } );
g.run( [&]() { TickAI(...); } );
// master thread does not join the team right away, but rather does UI,
// or monitors scene update until it's time to render.
g.wait();
// then master makes sure Physics, Particles and AI is done (helps if necessary)
// and goes on with rendering the scene using updated data

So why did I say "another situation …"? Because really there are just two good use cases for the task_group.

The first case is when a function, spawned through a task_group spawns more functions itself - that's what we see in the Fibonacci example. It's called recursive task spawning, the purpose of which is to minimize parallel overhead by (1) minimizing the amount of stealing during job distribution by ensuring that as many victim threads as possible have spare tasks in their "pockets" and (2) minimizing the amount of stealing during job execution by keeping a sufficient amount of locally available tasks.

The second case is exemplified by the pseudo-code above. Generally it's the situation where there are just a few functions that should be executed in parallel and you possibly want to keep the main thread to yourself to do polling, UI, etc. without over-subscribing the application. In that case, if those functions take a few hundred thousand CPU cycles to execute, the overhead of finding and stealing a task that represents this function would still be negligible.

It's noteworthy that even in the case of a few large functions, if the number of functions is fixed and there's no need to hold back the main thread (meaning we can easily have the master help the rest of the team), it's better to use parallel_invoke(), which spawns functions for potentially parallel execution, but does so recursively thereby maximizing the chance that the thief will find and steal a task faster.

What I'm seeing lately is people misusing the task_group API by putting something like this into their code:



task_group g;
for(int i = 0; i < N; ++i) {
g.run([](int i) { TickAIObject(i); });
}
g.wait();

Unless the function TickAIObject() spawns a large amount of parallel work, this code will not scale! All the available TBB worker threads try to steal from a random neighbor, but only the master thread has tasks that could be stolen from its task pool. Thus the worker threads attempts to steal from other workers will be unsuccessful (in the case that the application is executed on a 4- or 8-core machine) and time consuming. Eventually, the workers will line up in front of the master thread’s task pool to take out one task at a time. Only when a worker finally gets a task can it go off and execute it, in parallel with the other workers that have acquired a task so far. If that amount of parallel overhead is not shocking then I don't know what is :).
The correct solution for the case above would be parallel_for() since it works for dynamic ranges and spawns tasks recursively internally.

But all of this makes me wonder if we made the high-level task scheduling API a little too attractive. I've not seen anyone do anything like that with the raw task API … I guess it was scary enough for its own good ;).

For more complete information about compiler optimizations, see our Optimization Notice.