by Will Damon
Multiprocessor machines are becoming more commonplace on the desktop with the introduction of Hyper-Threading Technology to the Intel® Pentium® 4 processor. While multithreaded programming is not new to games, the widespread adoption of multiprocessor machines means that programmers can seriously consider the threaded performance of algorithms, and software system design. This paper aims to introduce the two domains of threading methodology as they apply to game development, and address the most common pitfalls of game programming for Hyper-Threading Technology (HT) enabled systems.
Parallelism, the simultaneous processing of different data or tasks, is achieved through two models: data- and functional- decomposition (also known as data and task parallelism). As the names imply, these two models represent very different yet complementary methods of applying multiple threads to achieve a higher level of performance within a single process (like a game). Each parallel model has its own section below; data decomposition is covered first, followed by functional decomposition.
Data decomposition means that the same independent operation is applied to different data. Compute-intensive loops, like those found in game physics simulation solvers, for example, are good candidates for data parallelization. Each game has different situations in which data parallelism applies, but just for illustration, here is a short list that is by no means exhaustive:
- Generating procedural terrain, textures, or geometry
- Physics solvers (for collision, behavior, and even sound)
- Skinning and/or animation
- Various artificial intelligence algorithms
- Path finding
- Audio processing
Many other applications are also possible. The trick is determining when to use threads to implement parallel computation for higher performance.
The easiest way to find an area in a game that might benefit from data decomposition is to profile the game on a multiprocessor or HT-enabled system. The Intel® VTune™ Performance Analyzer is great for this because it tells exactly how much time is spent idle, as well as how much time running threads are consuming of each logical processor. A profile of a game on an HT-enabled system will give a clear indication of where the whole system is spending time, and how well the workload is balanced. Focusing on the heaviest time-consuming modules will point you to which modules may benefit from data decomposition, not to mention highlighting serial optimization opportunities, too. While threading can improve performance, additional wins may be achieved with other Intel® NetBurst™ microarchitecture optimizations. See item  in the Additional Resources section for more information on NetBurst microarchitecture optimizations.
Functional decomposition works at a higher level than data decomposition. In functional decomposition, also known as task-parallelism, independent work is mapped to asynchronous threads. This model applies to games particularly well because there are typically so many different subsystems executing simultaneously. Some professional game developers have been threading their games for years, spawning threads to handle various subsystems, well before the introduction of Hyper-Threading Technology on the desktop. The advantage of functionally decomposing a game today, however, reaches beyond the scope of programmer convenience. Threads can run concurrently on HT-enabled hardware, yielding higher performance and more responsiveness; something which end users will appreciate. Additionally, having one ready-to-run thread per logical processor means that your game can consume idle processor cycles, again working towards higher performance. The number of ways in which you can separate subsystems is infinite; however, always keep in mind the general rule of thumb for multithreaded programming: make sure running threads conflict as little as possible!
A quick example to illustrate one possible approach to threading a game engine is to allow separate threads to handle the following subsystems:
- Asynchronous Streaming (I/O)
- Other subsystems (video playback, garbage collection, etc.)
Potentially, it could make sense to combine some of these together into a single thread, but the idea is the same - separate decoupled systems into asynchronous threads for higher performance and more responsiveness in multithreading desktop environments. Threading your engine may allow you to do some interesting things that could potentially degrade performance on a serial (single-threaded) model, like execute more complex AI or run larger physics simulations. Additionally, depending on your game, threading might allow you to more easily run various decoupled subsystems at different refresh rates (e.g. physics at 100Hz, graphics at 60 Hz), yielding a smoother experience in a richer game environment for your players.
Three Threading Models
Understanding parallelism and how it may be applied are important, but knowing how it is most effectively implemented is equally important. The next three sections each discuss one of the three approaches to achieve parallelism; they are ordered from least invasive to most invasive in terms of the code base and time investment.
The simplest means of parallelizing loops (data decomposition) is to let the compiler do all the work for you. The Intel C++ Compiler, version 7.0 and later, has support to analyze loops and parallelize those that it considers good candidates. The -Qparallel switch enables the feature, and following a few simple guidelines increases the likelihood that the compiler will identify a parallel loop.
- First, expose the trip count of the loop. This does not mean defining the trip count as a compiler directive. This means that the compiler must be able to determi ne whether the trip count varies during loop execution. For example, altering a for-loop counter inside the loop body could violate this first rule, and the compiler may not identify the loop as parallel.
- Next, avoid branching from the loop. Branching can have a similar effect to failing the first guideline (expose trip count), in which the compiler cannot determine whether the loop is a parallel candidate.
- The third guideline is to avoid referencing global data or procedure calls from within the loop body. The compiler cannot guarantee correct parallel execution on the volatile elements, so loops with such references may not generate parallel code. However, referencing a pure function, a function which has no side effects, does not disrupt parallelism.
Failure to adhere to these guidelines does not automatically disqualify a loop from automatic parallelization. Use the -Qpar_report3 switch to have the compiler generate a report on which loops were successfully parallelized and the dependencies that prevent parallelization of others. (-Qpar_report[n] will allow varying degrees of reporting.)
Compiler-directed Parallelism with OpenMP
While automatic parallelization is great, there are times when the developer can, or would like to direct the compiler what to thread. After all, you know much more about the overlying algorithm than the compiler could ever attempt to understand. The OpenMP standard allows the developer the kind of flexibility s/he desires, while limiting the time investment necessary to thread code. OpenMP works on an explicit fork/join method, so the programmer must specify the start and end of a parallel region.
The Intel C++ Compiler version 7.0 supports OpenMP, but not all C++ compilers at the time of this writing do so. Use the -Qopenmp switch to enable processing of the OpenMP directives or pragmas to generate threaded code. Without this switch, the Intel C++ Compiler will do the same thing as compilers that do not support OpenMP, and simply ignore the directives.
The cool thing about OpenMP is that it allows easy parallelization of loops or regions of your code without large-scale modifications. In fact, the original serial code is left largely intact. The following example that calculates a height field for a patch of terrain demonstrates:
In the above example, the loop to generate all the terrain data is threaded with three short lines of OpenMP directives. Efficient parallelization is achieved with very minimal programming effort, and the function speeds up (versus the serial version) by a measurable factor on a Pentium 4 processor with Hyper-Threading Technology.
Last, but not least, the developer can use the thread libraries (e.g. Win32 threads, or pthreads) to drive explicitly defined threads. Thread libraries are flexible enough to address both data and functional decomposition, but using them for data decomposition is invasive, time consuming, and error-prone. Computations must be separated into functions that can be mapped to threads, and the work must be manually divided among the threads. Additionally, explicit synchronization must be added in order to guarantee corre ct results. To illustrate, we revisit the example from the previous section, and this time we'll thread it using Win32 threads:
As you can see, there is a bit more code involved here to accomplish the same data decomposition as in the previous example. However, the benefit of the threading libraries is the powerful flexibility and direct programmer control they provide. Using the threaded libraries correctly is important. Notice here that we create and delete the thread handles in the same class method. Generally this is something you should avoid except in one-time initialization situations. Creating threads is expensive; therefore, if using the threading libraries, it is probably worthwhile to generate a thread pool during initialization, and set events to wake up threads to perform their tasks when necessary. When the threads finish their tasks, they can go back to a resting state, releasing hardware resources for other tasks. See items  and  in the Additional Resources section for more information on thread pools.
Coding for Hyper-Threading Technology
Whether you're developing for a high-end 4-way or 8-way Intel® Xeon® processor system or a Hyper-Threading Technology enabled Pentium® 4 processor-based desktop system, paying attention to how threads are executing is important. Employing techniques to achieve good load balancing, cache blocking, and general effective utilization of the hardware is very important on an HT-enabled processor because the two logical processors share common execution resources. This means that even though two threads are running concurrently, they are competing for some of the same hardware resources. Keep in mind some of the following common programming pitfalls when targeting a Hyper-Threading Technology system, and avoid hard-to-diagnose performance problems later on during the optimization cycle.
Make sure running threads conflict as little as possible!
For synchronization on variables that you know are going to be imminently available, a simple spin-wait loop may suffice. Be careful, though, because spin-waits can quickly become a hindrance on performance. A spin-wait locks up processor resources because it runs so efficiently, therefore stalling both logical processors. Inserting an assembly-level pause instruction can alleviate the detrimental effects of spin-waits, and looks like so:
The pause instruction hints to the processor that the loop is spin-wait, and inserts a momentary pause into execution to allow the processor resources to be freed up for the other logical processor. Another way to slow down the tight spin-wait is to put the waiting thread to sleep. Either use the ANSI standard sleep(), or the Win32 Sleep(). The argument for the ANSI standard sleep() is the number of seconds for the thread to sleep, while the Win32 Sleep() uses milliseconds. Use whichever makes more sense for your game.
Lastly, when you know the thread will be waiting for some indeterminate amount of time, the best way to synchronize is to tell the OS what the thread is doing. Under Windows* using the thread blocking APIs (e.g. WaitForSingleObject(), WaitForMultipleObjects()) will let you do just that.
In addition to all the explicit synchronization your code is aware of, be careful of implicit synchronization. For example, allocating memory off the heap will cause synchronization to occur. Avoid such thread synchronization by using the thread local stack allocation API TlsAlloc() or _alloca(). Also be careful of accessing global variables, as doing so may also cause synchronization.
64k Alias Conflicts
A 64K-aliasing conflict happens when a virtual memory address references a cache line that is modulo 64 kilobytes apart from another cache line that already resides in the L1 cache (see item  in the Additional Resources section). If the VTune™ Analyzer reports a 64K-aliasing event number increase of 5x or more on an HT-enabled system versus the same system with Hyper-Threading Technology disabled, then it is likely possible to achieve a meaningful speedup by alleviating some of the 64K-aliasing pressure. Generally two cases cause such an impact when Hyper-Threading Technology is enabled versus not:
- Thread stacks aligned on one and two megabyte boundaries
- Threads accessing data structures which happen to fall 64K (or modulo 64K) apart
Avoid these two common pitfalls by following a couple of guidelines provided by item  in the Additional Resources section and reinforced with further explanation here. First, try adjusting thread stack allocation thereby changing the relative addresses between offending accesses. This is an important guideline because the Win32 default for thread stack allocation is on 1MB boundaries, and thus has the potential to cause serious performance penalties on HT-enabled systems. In the case of interleaving or interspersed loads and stores from the same 64KB address space, try doing multiple loads then multiple stores. As far as data structures are concerned, there are a few things to try. One is to simply have your memory allocator allocate structures of 68KB instead of 64KB. Alternatively, your memory allocator might pad and return random offsets that are multiples of 128 bytes (for cache-friendliness and avoiding stalls due to unaligned loads). Finally, for multiple arrays, pad those that are multiple of 64KB in size, especially those that are accessed with the same index, so that they will not be allocated contiguously in memory. The simplest way to pad arrays is to artificially increase their sizes, or interleave their declarations with the declaration of other variables.
For more information about 64K-aliasing and how to avoid it see items  and  in the Additional Resources section.
Effective Cache Usage and Locality
Effective cache utilization is critical to top-end performance on both single- and multi-threading processors. Two common issues rela ted to the cache that occur with multithreaded programming that can be avoided by developers are false sharing and poor data locality.
False sharing can cause some serious performance degradation on both dual- or multi-processor and HT-enabled systems. False sharing happens when multiple threads own private data on the same cache block. For Pentium 4 processors and Xeon processors, a cache block is effectively 128-bytes. Determining whether false sharing is an issue is easy with a profile from the VTune Analyzer, and fixing the problem is as easy as padding out data structures. Perhaps an even better fix is to structure data in a cache-friendly manner, on or at 128-byte boundaries. Note that these recommendations are very complimentary to those for avoiding 64K-aliasing, so watching out for one pitfall actually helps you prevent two or more! See item  in the Additional Resources section for a more in-depth explanation of false sharing.
The Cache Blocking Technique
The cache blocking technique restructures loops with frequent iterations over large data arrays by sub-dividing the large array into smaller blocks, or tiles. Each data element in the array is reused within the data block, such that the block of data fits within the data cache, before operating on the next block or tile. This technique is widely used in linear algebra, and is a common transformation applied by compilers and game programmers. Because HT-enabled processors share the L2 cache, the most effective target block size (or the sweet spot) is be somewhere around ~133KB per thread, or roughly a quarter of the full L2 cache size. See item  in the Additional Resources section for further details.
Write-Combining Store Buffers
Write-Combining Store Buffers are also a shared resource on HT-enabled systems. The Pentium 4 processor has six WC buffers, and in the past it has been recommended that programmers target four of the six, as the OS and other processes are likely consuming at least two. An example of targeting four WC buffers is writing to four addresses that fall on unique cache lines. With the WC buffers being shared, the recommendation for processors with HT technology is to only target two WC buffers per running thread. Again, the VTune Analyzer can help you determine whether WC buffer evictions are impacting performance. The solution to avoiding unwanted WC buffer evictions is to break apart a loop such that it only targets the recommended number of WC buffers.
While you can keep an eye out for some issues and common pitfalls when developing code, you will find yourself debugging and/or fine-tuning some part of your game. Intel offers a full suite of tools to help you both reduce development time and increase your productivity.
- Intel® C++ Compiler
- The VTune™ Performance Analyzer
- Intel® Thread Checker
- Thread Profiler
- Parallel Libraries
In this exhaustive introduction we visited the models to achieve parallelism in your code, and discussed some of the common pitfalls and respective workarounds for developing threaded code for Hyper-Threading processors. We wrapped up our discussion with a very brief introduction to the software development tools provided by Intel. I encourage you to try out the tools right away and start experimenting with what is possible on today's advanced high-end desktop systems. Be aggressive with the application of the tools and their features; and profile often!
- Threading Methodology: Principles and Practice
- Gerber, Richard. Optimizing for Hyper-Threading Technology Using Intel® VTune™ Performance Analyzer
- Kerly, Phil. Cache Blocking Technique on Hyper-Threading Technology Enabled Processors
- Parallel Programming and Multi-Core Developer Community
- VTune™ Performance Analyzer
- Intel® 64 and IA-32 Architectures Optimization Reference Manual
About the Author
Will Damon was a Technical Marketing Engineer within Intel's Software Solutions Group. He has a bachelor's degree in Computer Science from Virginia Polytechnic Institute and State University*, where he graduated with honors. He has been with Intel for over a year, helping game developers enable their titles to achieve the highest performance possible on Intel® Pentium® 4 processor-based PCs. He welcomes email regarding optimization, mathematics, physics, artificial intelligence, or anything else related to real-time 3D graphics, and gaming.
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.