by Jeff Andrews
Take advantage of Intel® dual-core processors in your game development and learn how Intel® Compilers can optimize your code.
Even with the benefits provided by Hyper-Threading Technology (HT Technology), threading game engines for performance reasons has usually been avoided due to its potential complexity. However, with the advent of the Intel® Pentium® D Processor with dual-core capabilities and the Intel® Pentium® 4 processor Extreme Edition with dual-core and HT Technology capabilities, for a total of four logical processors, the potential to nearly double the processing power of a game (or even more) is now at hand. However, in order to get this impressive speed increase, games need to be threaded for performance.
It is natural to feel wary of threading due to the potential complexities involved. This is especially true for those not familiar with programming for servers, since the desktop area has only recently seen a need to thread for performance. Games, particularly with their complex data structures and functional interdependence, seem almost impossible to thread. The reason games and other desktop applications seem difficult to thread, however, is that most weren't designed with threading in mind.
This paper is designed to inform and persuade game architects and developers to develop for threading. We hope you will start threading games currently under development to take advantage of the two and four processors currently available with Intel® products. Thinking ahead, future processor products with multiple available processors in one package will create an even greater demand for games to be threaded and provide a greater opportunity to do more with the available processing power.
We define "threading" as running different sections of code simultaneously on different processors. The diagram below shows the difference between single- and dual-threaded processes. The single-threaded code executes serially throughout its run. Notice that the dual-threaded code has two sections that execute concurrently during its run. By running these two parallelizable sections of code at the same time, the entire run now takes less time to execute (notice the dashed area).
Figure 1 shows an imbalance in the diagram. This imbalance refers to the two blocks of parallelized code not completing their executions at the same time. The reason these 2 blocks do not finish their executions simultaneously is because they do not have an equal amount of work to accomplish. This is important to note as balancing threads properly is crucial in threading a game.
The barrier in the diagram signifies a "wait" that enables all threads to finish their execution. Waiting is crucial, since you do not want the parallelized code to overlap its execution with the serial code. (In this example, the serial code is dependent on the results of the parallelized section.)
There are two types of parallelism models: data decomposition and functional decomposition. In a nutshell, data decomposition is threading a single function to operate on two or more blocks of data at the same time. Functional decomposition, on the other hand, has to do with placing different functional blocks of code on separate threads in order to take advantage of parallelism.
This paper deals only with functional decomposition because a.) it is applicable to more games than the data decomposition model and b.) it is more difficult to implement. For an in-depth discussion of the two different parallelism models in games, refer to the "Additional Resources" section of this paper.
Before the actual threading of your game can begin there are a few steps to take to help make your threaded code run optimally. These steps are necessary when threading a game for the functional decomposition threading model.
Determine Functional Blocks
First, determine what the different functional blocks in your game are. This should be a fairly easy procedure, because typical game design revolves around a document stating the functional blocks. Determining the interaction between the blocks and the program execution order are included in this step, which should result in a block flow diagram.
Note that this block flow diagram probably will need to be redesigned based on the outcome of the steps in the next two sections. These changes will occur during the design and implementation phases of your project.
Determine Dependencies Between Blocks
Once you have listed the functional blocks and interconnections between them, the data dependencies between all the blocks must be determined. This information helps determine whether a grouping of data should be duplicated or whether access to it should be synchronized. Several factors feed into this decision, but the primary one is the size of the data. It may be faster to make a copy of the data in the serial section of the code or to synchronize access to the data in the parallel section of the code.
(Click image for larger version)
Determine Thread Division
Different blocks of code can be matched up based on data dependency and processor usage. Try to strike a balance between which blocks of code should be placed on which thread. Place functions in each of the two or four threads in such a way that the threads will complete their execution at roughly the same time. That way, one thread is not waiting too long for the other to finish execution and, therefore, is not wasting time.
The other thing to consider is whether the proper execution balance will cause your program to spend more time making data copies or synchronizing for data access. Functions that access the same data should be on the same thread, if possible. That way they can execute sequentially and avoid the need for synchronization. The method you choose depends on how the data is accessed. If the data is accessed sequentially and the processor balance favors having the two functions on separate threads, it is better to synchronize access.
Unfortunately, there is not a one-size-fits-all solution when it comes to determining the balance. Some experimentation is necessary to get the best possible combination. Remember that even if the threading implementation isn’t perfect, it’s still better than letting a processor go to waste.
In thinking about balancing performance on multiple processors, it is important not to get too attached to a specific number of processors. As long as you are optimizing the code for threading, you should also optimize it for an unknown number of processors. Some of the chips we are developing support dozens of processors, so there is almost no way to predict how many processors your code will be able to utilize in the future. Design the parallelization so that the functions are queued up and executed as a processor becomes available. This way the code will support one to n processors in a system, where n is the number of tasks you manage to parallelize.
Once you have determined the blocks to be processed in parallel, you are ready to create the application's threads. There are several methods that can be used to thread the code. You can use APIs provided by the targeted OS or a cross-platform API, for example. Another option is to use OpenMP*, which provides built-in threading support within a compiler. The simplest of these methods is OpenMP, which takes care of thread creation and synchronization. Certain compilers, such as the Intel® C++ Compiler, provide this support.
The OpenMP sections directive is a quick way to thread different functional blocks. The section of code in Figure 4 shows how easy it is to break up execution of four different functions into two threads. Here it takes only three additional lines of code (not counting enclosing braces) to thread these functions.
While this is a convenient way to do functional threading, there is a drawback: the number of threads OpenMP creates is limited to the number of section groups defined or processors available, whichever is less. So even if the code in the example was run on a system with four processors, only two processors would be utilized. Conversely, if there are more section blocks than processors, OpenMP will decide how to schedule the blocks for execution, which may not be optimal.
The barrier in the example shows where the application waits for all the threads to complete.
Note that it is also possible to create unsafe threaded code using OpenMP and that the programmer must consider all threading conditions even when using OpenMP.
Intel's Task Queue
The Intel C++ compiler includes an extension to the OpenMP specification which allows for queuing of tasks within a thread pool. This makes it easier to take advantage of all the processors available in the system, provided you have as many parallelizable functions as there are processors.
The code in Figure 5 shows how to use task queuing to thread functions. Each task directive places the function into the queue. When a processor becomes available, the next function in the queue is executed on it.
This makes it much easier to target a variable number of processors. However, this method requires you to better plan how the functions are issued to the queue since the number of processors running the parallelized section will be unknown. Ultimately, this method helps OpenMP code sections take advantage of all available processors.
Threading a game need not be a difficult task. With proper planning, threading can be designed into a game from the start to avoid many potential hurdles. As the number of processors in desktop systems increases, threading will grow more and more important. Designing the threading algorithm with more than two processors in mind helps guarantee that the largest amount of processing power is being made available to your game.
In addition to support for OpenMP in the compiler, Intel provides several other tools to help simplify the threading process. You can find these tools on Intel Software Developer Products Web site.
For a more in-depth discussion of threading a game engine or threading in general, refer to the following papers (and much more) located at http://software.intel.com/en-us/performance:
Jeff Andrews is an Application Engineer with Intel Corporation specializing in optimizing code for ISVs. He also researches software technologies that enhance the performance of applications on Intel® processors.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804