How to use functional decomposition to get better performance by using all your processor cores-while making only small changes to your existing code.
Games and other software that present an extensive graphical or multimedia experience to users are often constrained by performance issues. Because of this, techniques that can maximize the performance of the hardware layer are of special importance in delivering the best possible user experience. While many game developers know this intuitively, they often shrink from the one technique that provides the greatest performance lift: parallel processing. By executing tasks in parallel, software can do more work faster. On multi-core processors, such as dual- and quad-core chips available from Intel today, much of the processor lies untapped if the software it’s running is written as one single thread. However, if that software is broken into chunks that can run across multiple threads, then all the processor cores can contribute to performance.
As yet, users have not complained much about software that sacrifices performance because it fails to use all available hardware resources. But that reticence is only a function of the fact that quad-core chips are a recent addition to desktop PCs and so users’ expectations have not been reset. As Intel delivers chips with more than four cores (what are called many-core processors), software that loads only one or even two cores while leaving the others unused will be abandoned in favor of better-performing products whose speed derives from their use of all available cores. Hence, the use of parallel processing is a smart move for performance reasons now; and it will become a basic requirement during the next few years.
Writing software that uses only a single thread means that all tasks have to be done sequentially. For example, suppose you’ve developing a blackjack game. When the dealer’s card shoe is empty, the software must shuffle the required number of decks. While the cards are being shuffled, the players are waiting. A better user experience would be to provide the dealer with a new pre-shuffled shoe of cards as soon as the previous shoe is exhausted. In this way, play can continue with no delay. To do prepare a card shoe while the game is in progress, you would write the program using threads. Whenever a new shoe is put into service, a separate thread is launched to load up a new shoe of cards. In this manner, the game and the shoe-filling happen in parallel.
This example exemplifies one of the two principal techniques for determining what work should be broken off to be handled by a thread. In the blackjack game, we decomposed the work into two parts: one part that consists of the main game thread, and another that consists of preparing the card shoe. This division is known as functional decomposition; that is, we define the work of new threads based on tasks. (The other approach, data decomposition, assigns tasks on the basis of some fraction of the work to be done. It’s used in cases where a similar operation has to be done to a large amount of data, and it gives each thread responsibility for a portion of that data. I will cover this data decomposition in an upcoming article on this portal.)
The card-shoe example of functional decomposition is typical of user-facing software. One thread controls the main program loop and the rendering of graphics, while various worker threads handle tasks that can be broken off cleanly. For example, many word processors use one thread to handle spell checking, another one do pagination, and so forth. This enables the user to work on a document while it’s being paginated in background.
In software such as games, there are many opportunities for functional decomposition. In particular, multiple threads can be used to handle the startup delay. Today, many games have a noticeable pause at startup that is often covered over by distracting the user with a splash screen and video effects. However, in a parallel programming model, many of the individual start up tasks are stand-alone chores and could be done in parallel to reduce the start-up delay. For example, reading data from disk, getting user preferences, making network connections, and so forth are tasks that can easily be done in parallel.
Another area in which functional decomposition is particularly helpful in game development is anticipating tasks-such as the example of preparing the card shoe. However, there are many, many other examples varying from the simple to the complex. As many developers know, movement prediction is an important part of providing a smooth visual experience to the user. In many cases, the prediction is easy to anticipate such as in a car racing game, for example, preparing to render objects that are known to be on the course and are about to come into view. In other interactive games, the ability to precompute actions and visual effects based on the possibility that the user will turn left or right, say, at a given crossroads are obvious cases.
You might wonder in the last example how you should guess whether the player will turn left or right. But when you have lots of cores at your disposal, you don’t have to guess: you do the calculations for both possibilities-each option being handled by a separate thread running on a separate core. Then, if the user turns left, you use the results of the left-turning thread and discard those from the right-turning thread. And vice-versa.
This concept of discardable computing cycles might seem a bit foreign if you’re accustomed to the single-thread model, because in a purely sequential approach, performance requirements steer us away from computation whose results we don’t use. However, today’s PCs have lots of spare processing pipelines. Consider that the typical quad-core chip with four pipelines will become the de facto silicon platform during the next year. And in November 2008, Intel began shipping its new i7 desktop processors that have four cores, each of which can run two threads simultaneously. The chip, therefore, provides eight processing pipelines. Moreover, Intel has made clear that it anticipates shipping processors with many more pipelines in the years ahead.
With so many pipelines available, speculative execution of instructions is not a detriment but a performance benefit and one you should begin exploiting. (Silicon note: under the hood, Intel processors have been performing speculative execution of instructions for more than a decade. Inside the Pentium, Core, and i7 processors is logic that examines the sequence of instructions that are scheduled for execution. The logic searches for branch instructions and tries to determine whether the code will actually take the branch or not. If it expects the code to take the branch, execution units in the processor begin pre-executing instructions along that branch. If the processor guessed correctly and the branch is taken, then the pre-executed instructions save time. However, if the branch is not taken, the pre-executed instructions are simply discarded.)
One nice thing about threads is that you do not need to rewrite your code in order to use them. You can add them incrementally and begin enjoying their performance benefits right away. The simplest way to find easy-to-thread functionally decomposable tasks is to look for actions that fulfill the following guidelines. The actions:
The refilling of the card shoe is an example that fulfills these requirements. It’s not part of the main thread; it’s conceptually one discrete task; it can be performed entirely independently of the other threads; and it shares no data items with the other running threads. (Note at some point, the filled shoe is handed over as a data item to the thread doing the dealing.)
Since all actions must integrate with some portion of the larger software’s operation, there will be points of contact with other threads. All the complexity that you might have heard of in regards to parallel programming derive from those points of contact. Essentially, there are two principal issues: one thread waiting on another thread and how to handle two threads changing a data item at the same time. Both areas have problems that can be subtle and difficult to debug, so unless you’re already familiar with parallel programming, it’s best to initially break out tasks that do not wait on other threads and that share few data points with them. The card shoe is an excellent example. Let’s see how it looks in rough pseudo-code.
I presume you have a pre-written routine that fills the card shoe. We create a thread that is passed a card shoe object to fill. We will also use a flag (that is, a boolean variable) that tells us when a card shoe is needed. The typical cycle looks like this:
When the main dealing program needs a new card shoe, it grabs the filled card shoe and sets the flag to true, which will awaken the worker thread to fill a new shoe. If the main dealing program needs a shoe but none is ready, it waits until the flag is reset to false, as this will tell it a shoe is now available.
In this example, one variable could be used by both threads at the same time: the flag. Hence, the operations to this flag need to use a locking mechanism so that only one thread can change it at a time. All threading APIs provide a lock called a mutex, which prohibits more than one thread at a time from executing code (here, the reset of the flag). All APIs provide other locking mechanisms as well.
If you were writing a casino game, you would need more than one shoe in reserve, as several tables could need shoes at the same time. In such a case, you’d likely maintain a queue of filled card shoes. Tables would remove the shoes from the head of the queue, while a producer thread would fill card shoes and put them in at the tail end of the queue. This arrangement in which one thread produces data items that it puts into a queue for consumer threads to use is a common pattern in parallel programming and it is frequently referred to simply as “producer-consumer.”
Producer-consumer situations are rife software and they are easy to segregate into separate threads. A good example comes from reading data from slow media or slow network connections. For example, many games at installation must read data from a CD or DVD. Optical media is notoriously slow, so you want to be working on the data while the read is going on. Producer-consumer comes to the rescue: one thread reads blocks of data from a file and puts them in a queue. A second thread will then retrieve the blocks and begin processing them. In this way, little time is lost.
Other possibilities for functional decomposition in a large poker game is to use one thread for each player. This boosts scalability. Now, expanding the number of players or tables scales well. Operating systems are good at balancing the needs of dozens, even hundreds of threads simultaneously. And by using parallel programming, you relieve the main thread of the overhead of handling each new player. Without threading, the design could not scale at all and each additional player would slow the game’s performance.
As I mentioned earlier, 3D games likewise have many opportunities for functional decomposition: start-up tasks, discrete subtasks, and especially prediction of motion and of possible actions that players or animated sprites might make.
There are numerous threading APIs available on x86 processors today. All of them provide the fundamentals of threads: thread creation, waking a thread, putting a thread to sleep, killing a thread, having a thread wait for the change in value of a variable, and various locking mechanisms. Most published threading APIs offer a superset of these capabilities.
For games written in C or C++, there are numerous interfaces. Native Windows thread APIs are among the most wide-ranging available. MSDN is a great place to start. The Windows APIs, however, are not portable, so you might prefer to use Pthreads, which runs on Windows and is the native thread API for Linux and most versions of UNIX-based OSs. The Pthreads implementation for native Windows is available at http://sourceware.org/pthreads-win32/ and it works well. The .NET languages all have a Windows API, as does Java. The one gaming language that lacks threads is Lua. If you’re writing in Lua, you won’t be able to use its thread implementation to do the tasks described above. Lua threads are really co-routines, a similar concept but with an implementation that is crucially different and does not allow processing in parallel.
My next column will continue examining functional decomposition for visual software in greater depth. We’ll look at some source code and examine how to decompose problems to deliver the maximum benefits to the user by employing fine-grained multithreading.
Andrew Binstock writes technical white papers at Pacific Data Works LLC. He is also a senior contributing editor for InfoWorld and a columnist for SD Times. He is the author or co-author of several books on programming, including “Programming With Hyper-Threading Technology” from Intel Press. During his free time, he contributes to the open-source typesetting and page-layout project, Platypus. He can be reached through his blog at http://binstock.blogspot.com
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804