This article introduces tasking as a general approach for performance scaling on multi-core processors. An example Task Manager is shown, backed by 3 scheduling APIs: Intel® Threading Building Blocks, Microsoft Concurrency Runtime, and a custom Windows* Threads based scheduler. The performance of these Task Managers is examined on multi-core processors running computations from two game development samples: Morphological Anti-Aliasing and Skeletal Animation [1,2]. All three back ends perform comparably, demonstrating that tasking provides a performance benefit not dependent on the use of a particular API in the scheduler.
IntroductionSeveral programming approaches have been developed to take advantage of the multiple cores in modern processors. Functional parallelism is an approach that splits a program into separate, mostly independent, processing units for each subsystem or type of computation. This approach is of limited effectiveness because there is an upper limit to the number of separate units a program can be split into. As an example: The primary systems in a game engine-such as rendering, physics, AI, input, and audio-can all be separated into individual threads. However, this adds extra complexity when communicating between systems and also will not evenly distribute the work between the cores.
An alternative approach is a form of data parallelism called fork and join. With fork and join, the program runs serially until it reaches a point where it can exploit the extra cores to run an algorithm concurrently. This approach will scale with the number of cores in the system during the concurrent segments, but performance will not improve during the serial portions. A game might use this approach to update particle systems or animations. But since many programs and games have limited fork and join opportunities, this approach alone will provide limited performance gains.
In attempt to solve these problems, the tasking approach was developed. As with fork and join, tasking scales to the number of cores but can use the extra cores more frequently than an otherwise serial program. To accomplish this, the program is split into tasks that are given to a scheduler to be executed as the extra cores have time. After the tasks have been submitted to the scheduler, the thread that submitted them is free to continue with the work it was previously working on or it can yield itself to the scheduler to be used to help with the work. Since not all of the tasks are going to be completely independent of each other synchronization points are needed. To help manage the complexities, many developers write a simple Task Manager to provide a higher level interface than working directly with the threads. This article explores the performance of a simple Task Manager written with Intel® Threading Building Blocks (TBB) as the back end scheduler. For performance comparison and to show that the gross performance benefits of tasking are not dependent on a specific scheduler implementation, alternate back ends were written with Microsoft’s Concurrency Runtime (ConcRT) and a custom scheduler using the Windows Threading API.
Task Manager Interface
The interface for the Task Manager is the program’s only interaction with the tasks and the scheduler. The TaskMgr class is responsible for the creation and provides functionality to wait on a given task set for completion. The interface is built around the idea that tasks will be submitted with other similar tasks and those tasks should be tracked as a set.
The program supplies a function pointer, data pointer, and the number of tasks in the set, and any dependencies to the TaskMgr. The function that the program supplies will take the data pointer, the ID of the thread executing the task, the task ID, the total number of tasks in the set, and any task sets this one requires to be completed before it executes. The thread and task IDs can allow access to data specific to the task or thread without the need for locks. The thread ID, for example, can be used to access a thread specific rendering context or to keep track of thread specific statistics. The task ID can be used to access the specific data needed for the task. The code for the Intel® sample of Multi-Threaded Animation uses the task ID to determine which model to update. The dependencies are needed in situations like the Morphological Anti-Aliasing sample where each task set needs data from the set before it. 
The TaskMgr gives the program a handle to a task set when it is created. The handle is used by the program to determine if a task set has been completed by waiting on it. By waiting on the task set to be completed, the thread yields itself to the scheduler to execute tasks until the set is completed.
Task Manager with Intel® Threading Building Blocks
Intel® Threading Building Blocks uses an inheritance based interface for creating and submitting tasks. All tasks inherit from the tbb::task class. This class holds all of the bookkeeping information TBB needs such as the current state of the task and what thread it is currently assigned to. Through inheritance a program can store all of the information it needs in the tasks as well.
To manage the task sets, the TaskMgr class keeps track of TaskSet instances. The TaskSet class stores all of the data that was passed into the creation function as well as a list of dependent TaskSets and a count of the completed tasks. When the TaskSet is ready to submit the tasks to the scheduler it spawns GenericTask instances. GenericTasks store all of the information needed to execute an individual task: function pointer, data pointer, task ID, task set size, and a handle to the parent TaskSet. As tasks that are passed to TBB, GenericTasks derive from tbb::task. To use the TBB wait functionality to yield a thread to the scheduler and to be notified when all of the GenericTasks are complete, TaskSet must also be derived from tbb::task. By deriving TaskSet from tbb::task, the TaskSet instance can be set as a parent task for the GenericTasks that it spawns. As far as TBB is concerned, the TaskSet is a tbb::task that is not completed until the child GenericTasks are completed.
Task Manager with Microsoft Concurrency Runtime
Concurrency Runtime takes a different approach to creating tasks than TBB. ConcRT uses a functional interface. Tasks are submitted as function pointers, function objects, or lambda functions. The preferred method of creating and submitting tasks is through the use of task_group objects. The task_group takes functions and passes them to the scheduler and is used to by the program to query the state of the tasks. 
As the tasks are now submitted as functions instead of objects, the GenericTask class is no longer needed. Instead the GenericTask::Execute method is integrated into the TaskSet class and tasks are submitted as lambda functions that call the new method. Lambdas are anonymous functions, functions with only a body and no name. Because these functions are defined in the body of another function they can capture data and variables from the creating function. This functionality is used to capture both the pointer to the TaskSet and the Id of the task to be executed. This allows each task to have a unique ID without locking or atomic operations.
Unlike TBB, ConcRT does not provide an interface for executing custom code when a new worker thread is created. This causes a problem due to the user supplied task functions which require a unique thread context id in the range of [0,n), n being the number of hardware threads. To get around the inability to run code at the creation of the thread to get the ID, the code was placed in the TaskSet::ExecuteTask method.
Task Manager with Windows Threads
Unlike the other back ends that use TBB and ConcRT to handle the scheduling of tasks, the Windows Threads back end required a custom scheduler. From the ConcRT back end it was seen that there is no need for a class to operate on individual tasks; the TaskSets could handle that. That idea was taken one step further in the Windows Thread back end: The scheduler does not accept submissions as individual tasks, only task sets. As a result, because task sets cannot be broken up into individual tasks a single work queue is shared between all of the worker threads. The work queue has to be a data structure that has minimal overhead and can allow for multiple readers and writers. For this an array of TaskSetHandles was used. The readers and writers hold IDs that match the slots each thread is currently interacting with.
To reduce wait times, each thread has its own reader. The array is treated as a circular buffer; when a reader goes past the end, it starts the search again from the beginning of the array. When an incomplete task set is found, the thread will take a task by incrementing a counter inside of the task set and then execute the task. Once the TaskSet is completed, the thread will mark the slot as empty and continue to the next TaskSet.
The writers are also unique to each thread. To prevent threads from searching the entire array every time a new task set is added, the scheduler keeps track of the last location that was written to. The thread that is writing will then start from that location and search until it finds an empty slot. When an empty slot is found, an atomic compare exchange is done to ensure no other thread has written to that slot.
The task scheduler keeps track of the total number of task sets that are available to be executed. This count is incremented first when a new task set is added to the work queue. This ensures that threads will not go to sleep immediately before new work shows up. Threads wait on a semaphore when the task set counter reaches zero. The semaphore is released when a new task set is added to the queue, waking enough threads to do the work. An alternative approach would have been to suspend the threads when there was no work with a Windows event. The event would have required an extra lock to ensure that it was not set or reset incorrectly when new task sets were being added to the queue, so a semaphore was used instead.
Performance ComparisonsBy keeping the interface the same between all three Task Manager back ends, it is possible to easily replace one with another for comparison. Here are some comparisons between each of the Task Managers on two to six core systems on the Morphological Anti-Aliasing and Multi-Threaded Animation samples to show that the multi-core performance scaling is comparable. An Intel® CoreTM i7 Extreme 965 was used in the 2 and 4 core system, with 2 cores disabled, and an Intel® CoreTM i7 Extreme 980 was used in the 6 core system.
It is clear from these results that tasking is a valuable performance scaling technique on modern, multi-core processors. The animation example showed an average of 43% performance improvement when going from 2 to 4 cores between the 3 task managers and another 25% average increase when increasing the core count from 4 to 6. The MLAA example the average increase is 60% between 2 and 4 cores and 20% between 4 and 6 cores. The use of a particular threading API does not have a significant impact on the performance benefits of tasking.
The results also show that the performance gains vary depending on the workload. The animation sample is made up of many small tasks with no dependency chains. The small tasks show the overhead of the schedulers, most notable with TBB and ConcRT. Both of those schedulers use a work stealing algorithm to balance to workload between the worker threads. The work steal algorithms work best when handling larger, irregular tasks. MLAA on the other hand is limited by dependencies. There are four main task sets that are used: two for horizontally processing the image and two more for vertical processing it. The dependencies are necessary to get the correct results, but leave the worker threads idle between sets.
Updating SamplesThe Morphological Anti-Aliasing and Multi-Threaded Animation samples were written using the original TBB based TaskMgr API. Because the alternate back ends were written to use the same API switching between the managers at compile time became changing a single compile time constant. To help compare the different scheduler implementations, runtime changing of schedulers was implemented. This feature was added by making two small changes: The first is that the TaskMgr is no longer accessed by a static global variable; a pointer to the TaskMgr is used instead. The second change is all three TaskMgr classes’ implementations of an interface, allowing all three to be accessed through the pointer. This does add a small amount of overhead due to the virtual function calls into the API that were previously not there when switching the back end scheduler at compile time.
- Morphological Anti-Aliasing
- Multi-Threaded Animation
- Using Tasking to Scale Game Engine Systems
- Microsoft Concurrency Runtime
About the AuthorJames Lenell is a software engineer intern in the Platform Application Engineering division. He is currently finishing a Bachelor’s degree in Real Time Interactive Simulation at DigiPen Institute of Technology.