Maximizing Performance with Fine-Grained Parallelism

Published:03/06/2012   Last Updated:03/05/2012

Implementing functional decomposition in games to get the most performance from the processor

By Andrew Binstock

In my previous article, I described how to use functional decomposition to break up games and visual programming apps into tasks that could be allocated to independent threads. By assigning tasks to individual threads, you enable those tasks to run in parallel. This gives you better overall performance and, because it relieves the main thread from doing all the work, it permits faster and snappier visual presentation.

To review quickly, functional decomposition divides work into discrete tasks. In other words, it breaks (or decomposes) the work by the function being performed, hence the name functional decomposition. The other principal approach, called data decomposition, decomposes work by giving different chunks of data to individual threads. These threads generally perform the same operation on the data chunks and then have some way of integrating the results or the transformed data. I will discuss data decomposition in an upcoming article.

To understand how functional decomposition is handled, we need to examine how threading models work. Once we see the fundamental dynamics, we can look at how threading has traditionally been done and how new thinking, influenced in part by the proliferation of cores in today’s processors, enables us to use different approaches that are far more efficient.

The Life and Times of a Typical Thread

Most threading APIs support a model in which a thread is created, assigned a task, and then terminated when it completes its work. The task assignment is done by giving the thread a single function to perform. This function, informally referred to as the ThreadFunc, can, of course, call numerous other functions and tasks. Listing 1 shows a simple example that employs Pthreads, the threading library used in Linux* and several versions of UNIX*. (A well-implemented version of Pthreads is available for Windows* at

01 #include <stdio.h>
02 #include <stdlib.h>
03 #include <pthread.h>
05 void *PrintThreads ( void * );
07 #define NUM_THREADS 9
09 int main()
10 {
11    int i, ret;
12    pthread_t a_thread;
14    int thdNum [NUM_THREADS];  //thread numbers go here   
15    for ( i = 0; i < NUM_THREADS; i++ )
16        thdNum[i] = i;
18    for ( i = 0; i < NUM_THREADS; i++ )
19    {
20        ret = pthread_create (
21                &a_thread,
22                NULL,
23                PrintThreads,
24                (void *) &thdNum[i] );
26        if ( ret == 0 )
27            printf ( "Thread launched successfully\n" );
28    }
30    printf ( "Press any key to exit..." );
31    i = getchar();
32    return ( 0 );
33 }
35 // Make the threads print out their thread number.
37 void *PrintThreads ( void *num )
38 {
39    int i;
41    for ( i = 0; i < 3; i++ )
42        printf ( "Thread number is %d%d%d\n",
43            *((int*)num), *((int*)num),*((int*)num) );
45    return ( NULL );
46 }

Listing 1. A simple example of creating threads and assigning them work in Pthreads. (From “Programming with Hyper-Threading Technology,” courtesy of Intel Press.)

This code creates NUM_THREADS amount of threads (line 20) and passes each one the address of a ThreadFunc it should execute (line 23) and the address of the parameter the ThreadFunc takes (line 24). The ThreadFunc (lines 37-46) simply prints out the number of the thread several times and returns.

When the ThreadFunc returns, the threading API terminates it. When all the dependent threads are killed and the main thread (the one that launched the threads) is complete, the program exits.

If this example were of the functional decomposition for an interactive poker game, for example, the individual tasks would be created as separate ThreadFuncs each one assigned to a separate thread. So, if the game’s start-up tasks included elements from this list: initialize graphics layer, connect to player database, initialize poker engine, begin card shuffler and load card shoes, locate sound files, etc., these tasks would be assigned to their own threads and executed in parallel. This implementation would depend on additional threading capabilities, such as waiting on a thread to know it’s complete (so that you can proceed to dependent tasks), and mutual-exclusion functions that enable threads to share data or capabilities without causing unwanted side effects. (These capabilities are part of all standard threading APIs and won’t be discussed in this article.)

You need to know about several aspects of the ThreadFuncs that affect how you deploy them. The first is that there is no guarantee that the threads will execute in any particular sequence. The scheduling of threads is an operating-system operation over which developers can exercise very little influence. Mature operating systems are excellent at this task. They schedule threads in the way that in most cases will allow for excellent performance given the threading needs and the available hardware resources. Because the OS juggles execution to deliver the best performance, threads often execute out of order. And, in fact, if you were to run the above code you would likely find that the thread numbers printed to the console are not consecutive-even though the threads were launched in order.

When a thread is created, the OS allocates the thread its own stack. Windows*, for example, defaults to a minimum of 1MB of RAM assigned to each thread. The OS also allocates various control structures for the thread and creates entries in a kernel table for the thread to be managed by the scheduler. As a result, thread creation is an expensive process, so you typically don’t want to create threads willy-nilly. Rather, you tend to prefer to keep threads running a long time.

For example, let’s examine the card-shoe preparation thread. This thread uses a random-number generator to deal decks of cards that it then places into shoes for the dealer to use at the table. At game start-up, the thread creates enough card shoes for all the tables and puts some additional ones in a queue. At that point, you don’t want to let the thread die as you would need to create a new one every time a card shoe is needed. Instead, you put it to sleep and wake it up when a new shoe is needed. Threading APIs all have capabilities for waking threads either by signaling them or on a scheduled basis. In this case, the shoe thread would more likely be signaled, as shown in the pseudo code that follows.

Pseudo-Code for Card-Shoe Preparation Thread (Coarse-Grained)

For every table, prepare two card shoes.
Prepare a card shoe:
Set up random-number generator
Deal first deck of cards
Deal second deck of cards
Put both decks into card shoe
Repeat second shoe (similarly)
Place card shoes in shoe queue for each table

During game:
Wait for signal from a table to prepare more card shoes
Prepare, as above
Return to waiting state for table signal

The model I have just presented is referred to as “coarse-grained” parallelism. It’s well-proven and works capably. However, it has some important limitations that can deprive an application of some of the performance benefits of parallelization. Let’s look at these.

Limitations of Coarse-Grained Parallelism

Coarse-grained parallelism in which one thread is associated primarily with one task has drawbacks that have become more pressing in the era of multi- and many-core computers. The central problem that appears in several forms is that it is easy for this model to use processor resources poorly.

For example, how many threads should be created? If you follow a pure functional decomposition model, you create as many threads as there are tasks to be done at one time. In the previous example, I enumerated five different start-up tasks to be performed by five threads (in addition to the main thread, which always performs the graphics rendering). So, this would be a six-thread application. On a modern processor such as the Intel® Core™ i7 chip, which has eight processing pipelines (four cores with two pipelines each), this application would leave two pipelines unused-or 25% of the available silicon. The same thing would occur if this app were running on an Intel® Xeon® dual-processor quad-core server.

Another limitation of coarse-grained parallelism is the poor use of the processor pipeline when a thread is not doing any work. Suppose we were running our game on a six-pipeline machine, so that all pipelines were in use. What happens when the card-shoe thread is asleep, waiting to be awakened to deal more cards? In one word - nothing. One pipeline lies fallow, while another could be processing at 100% capacity.

If this app were running on a four-pipeline machine, the OS scheduler would swap out the waiting but dormant thread and swap in a thread that has processing needs. But since we have carefully run the app on a system with exactly as many threads as we have pipelines, there is nothing to swap in and so the pipeline simply waits. There is no concept of workload balancing.

What we need is a mechanism that uses as many pipelines as are available in the hardware and that employs all pipelines to best advantage, such as when a thread is dormant. Enter fine-grained parallelism.

Fine-Grained Parallelism

Fine-grained parallelism approaches threading, not at the thread level, but at the task level. Returning to our previous example, the card-shoe thread comprises two principal tasks: capture/record a request for a new card shoe, and prepare the new shoe. That’s all it does - listen and deal.

Now, if those tasks were assigned to any generic thread they could be performed on any given processor pipeline and, in fact, could be handled when one pipeline had free time available to it because its existing thread went dormant.

The question is how to create a pool of threads that can accept tasks intermittently? The solution lies in a structure known as a thread pool. Most threading APIs have some concept of a thread pool although the design and quality of implementation varies. Typically they work as follows.

When the thread pool is started up by the program, it creates a certain number of so-called “worker” threads. These threads are just regular threads, except that when they complete their assigned task, the thread pool does not let them die off. Rather, it keeps them alive by putting them into a dormant state until there is new work to be done. These threads, in other words, stay alive as long as the thread pool is alive, which is generally until the program shuts down. This approach avoids the cost of creating and terminating threads, which is invariably an expensive process.

As shown in Figure 1, the front end of the pool consists of a queue into which tasks are added by the app. These tasks are then assigned available worker threads, which complete them and return for more work. Good thread pools are designed to allocate the right number of threads and maximize the use of the available processing pipelines. There is no question of leaving pipelines unused. And, if there is work sitting in the queue while a thread is waiting on a slow process or an event, the waiting thread is swapped out and the work is assigned to a thread that can start working right away. In this way, fine-grained parallelism enables load balancing across processor pipelines.

Figure 1. A thread queue showing the work queue one the left and a task being assigned to an idle thread

The thread pool, as described so far, is a model that exists in many of the major threading APIs; Java*, Windows* (Win32), and .NET. However, it does not exist in Pthreads. If you use Pthreads, you can find some implementations of thread pools built from Pthreads primitives on the Web. But you should first look at a similar concept made available at no cost in Intel® Threading Building Blocks (Intel® TBB), (/en-us/articles/intel-tbb/), which includes the full source code.

Intel® TBB uses a slightly different approach that it refers to as a thread’s ready pool. This is a TBB primitive that is usually abstracted in the library by higher-level constructs. It consists of a scheduler that assigns tasks to specific task queues that are associated with individual processor pipelines. The scheduler uses a complex algorithm for task assignment that takes into account factors such as the nature of the task, cache availability of data, and so forth. TBB is a C++ library that contains far more than just this solution. It also contains high-speed threaded data structures that can be very helpful. If you’re using C++, you should consider the TBB solution even if your thread API includes thread pools, as it provides convenience features that most APIs simply do not include.

Using Fine-Grained Parallelism

To make effective use of fine-grained parallelism, it becomes important to think of processing not so much in terms of long functional threads but smaller-sized tasks that can be handed to a thread pool (or work scheduler as in TBB). [From here on, when I refer to thread pool I am including the Intel® TBB implementation of the concept]. So, returning to our example of the card shoe, there is now just one task; preparing a card shoe. The main program no longer sends a signal that the card-shoe thread listens for, rather the program simply puts a card-shoe task into the queue of the thread pool. As shown in the following pseudo-code, this operation is now a series of one-off tasks, rather than an intermittent communication with a long-running thread.

Pseudo-Code for Card-Shoe Preparation Thread (Fine-Grained)

The best tasks for this kind of parallelism are ones, like the card shoe, that depend very little or not at all on other threads. The more of these tasks you can package as standalone items, the better your performance will be. Tasks that depend on other tasks or other threads are more complex and should be created only once you’re comfortable with how mutual exclusion works.

Under ideal circumstances, the workload from the tasks will keep the thread pool “always active.”  In your initial parallel coding, you should monitor the work queue. If it’s nearly always empty, chances are that you have not parallelized enough of the operations. [The opposite is not always true: a backlog of work in the queue at all times shows that the processor is being used fully by the application. However, if the queue is severely backed up, this would suggest the need to add more processing power. Fortunately, for most apps that use a thread pool, that means migrating the app to a larger hardware platform. The thread pool should automatically adjust for the extra processor pipelines without any change to the code.]

Fine-grained functional decomposition is one of the best approaches to extracting the excellent performance from your hardware. Look for opportunities to use it and you will find your games and even your business apps will speed along like never before.

Andrew Binstock writes technical white papers at Pacific Data Works LLC. He is also a senior contributing editor for InfoWorld and a columnist for SD Times. He is the author or co-author of several books on programming, including “Programming With Hyper-Threading Technology” from Intel Press. During his free time, he contributes to the open-source typesetting and page-layout project, Platypus, ( He can be reached through his blog at

Java is a trademark of Sun Microsystems, Inc. in the United States, other countries, or both.
Windows is a trademark of Microsoft Corporation in the United States, other countries, or both.
Intel and Xeon are registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.

Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804