Assume you have N tasks to complete, all identically structured in terms of input and output data. If your data is stored in arrays, the simplest programming structure is to perform the calculations in a loop that iterates through the input arrays and writes results into the corresponding output array locations. A multithreaded program that performs these tasks could divide the tasks into equal groups and assign each group to a single thread. If each array index task takes about the same time to complete, and there are many more tasks than processors, then this will be a fairly efficient threading model.
But what if the time to complete each array index task is highly variable, and that time to complete cannot be predicted in advance? This condition adds significant complexity to the problem if your goal is to efficiently utilize the available power on a multiprocessor system. If you take the simple path described above, where each thread is assigned a range of indexes, some of the threads my complete their work long before others, resulting in the system being mostly idle as the program waits for the last thread(s) to complete.
Certainly there are ways to deal with this situation using traditional threads. You can create a queue, and assign a single index task to each thread, then receive notification each time a thread completes, lock the queue, grab the next index on behalf of the idle thread, unlock the queue, set the thread running performing the new index task, and again await notification that another thread has completed. This type of scheduling will certainly work, but it's much more cumbersome (and complex in terms of programming) than the simple method of simply assigning each thread an index subrange.
TBB's "task stealing" scheduling
Having worked on lots of problems of the second type, where scheduling has to be implemented if you're going to efficiently utilize your available processing power, I have from the start been very interested in TBB's "task stealing" scheduler strategy. Here's what the TBB Reference manual says (Section 8.1):
The scheduler employs task stealing. Each thread keeps a "ready pool" of tasks that are ready to run. The ready pool is structured as an array of lists of
task, where the list for the ith element corresponds to the tasks at level i in the tree.The lists are manipulated in last-in first-out order. A task at level i spawns child tasks at level i+1. A thread pulls tasks from the deepest non-empty list in the array. If there are no non-empty lists, the thread randomly steals a task from the shallowest list of another thread. A thread also implicitly steals if it completes the last child, in which case it starts executing the task that was waiting on the children.
This demonstrates a much more complex scheduling capability than what I've introduced with my simple case where loop indices occupy different amounts of time to complete. But, we can see that in my example case, the TBB scheduler should attain a level of efficiency at least similar to what my manually-created scheduling queue would attain.
So, if the performance is merely "similar," why would I choose to use TBB rather than create my own thread scheduler? Here are three reasons:
- Development time: a few lines of TBB code are required, versus perhaps 40 or 50 lines of traditional thread coding, including mutex locks for the queue, thread waits, etc., for my custom scheduler
- Reliability: TBB is well-tested software; my custom scheduler will require significant (and difficult) debugging
- Convenience: once a reliable wheel has been invented, I have no desire to continue spending my time constructing my own piecemeal versions, starting from the base materials, each time I need a "wheel" for a new project. When a "FILL" command became available to color the interior of a rectangle on the screen, I stopped stuffing individual pixels. I look at threading the same way.
Task scheduling complexity
Near the end of Chapter 1 of Intel Threading Building Blocks, James Reinders writes:
A number of concepts are fundamental to making the parallelism model of Threading Building Blocks intuitive. Most fundamental is the reliance on breaking problems up recursively as required to get to the right level of parallel tasks. It turns out that this works much better than the more obvious static division of work. It also fits perfectly with the use of task stealing instead of a global task queue. This is a critical design decision that avoids using a global resource as important as a task queue, which would limit scalability.
Clearly, a process that automates finding "the right level parallel tasks," and does so successfully, has enormous benefits for developers faced with the task of multithreading existing applications, or even new applications. Selecting the level at which you will apply multithreading is a big decision. For existing code, the level at which you choose to multithread has a drastic effect on the code changes that must be made (different sets of variables will be accessed by concurrent threads, etc.).
My simple top-level loop queue may work pretty well today on the computer for which I designed my multithreading strategy, but it does have the problem of a hard-wiring in a specific limit to scalability: I'm queuing the individual top-level index tasks, so if someday my app is run on a system with more processors than top-level tasks, my queue will leave processors idle; in addition, on that system program will take as long to complete as the longest individual loop index task. Overall processor use could be a tiny fraction of what's actually available.
TBB promises to do much better than this, through the methods James describes in the excerpt. I've already started experimenting with the scheduling and task-stealing aspects of TBB, and I'll have some preliminary results for you shortly!