This paper describes the steps in the process of characterizing and optimizing the already parallel SMOKE Gaming Demo  using Intel's software suite of tools. Code characterization was done mainly with the Intel® Thread Profiler component of the Intel® VTune™ Performance Analyzer optimization tool, while the parallel optimizations were done mainly using the Intel® Threading Building Blocks (Intel® TBB)  template library. It is demonstrated that the use of Intel® TBB's work-stealing task scheduler [3, 4] can significantly improve CPU utilization and frame update rate for the SMOKE Gaming-Demo. We argue that the techniques used here are applicable to other gaming codes as well.
The main frame processing loop of a typical computer game is composed of a few functional blocks. Typically these are the Artificial Intelligence (AI), Physics, Particles, and Rendering computational functions. A popular strategy for parallelizing this loop is to represent the functional blocks as stages of a pipeline which can then be run in parallel across available processors, where the normal dependencies between the stages of the pipeline apply. Naturally, in this approach theoretical scalability is also limited by the number and weights of the stages in the pipeline. Most implementations of this approach assign one stage to a physical processor. As a result, on machines where the number of processors is greater than the number of pipeline stages, this approach obtains no benefit from the additional processor resources available. On the plus side, the method can be implemented to maintain data locality with respect to processor cache as each data item is run through the pipeline.
In pure serial implementations of the main computational loop, the heterogeneous tasks which make up the body of the loop are executed in an ordered fashion. In the parallel case one may attempt to relax this ordering and attempt a more asynchronous execution of the tasks as long as the fidelity of the output is not affected or is maintained to some acceptable degree. Such an approach to parallelization of the frame loop is intriguing in that scalability is now limited only by the availability of tasks ready to execute and cores to process them - subject of-course to any and all constraints that apply between the different tasks. Further, one may parallelize each of the high level tasks into subtasks and exploit any fine grained parallelization that may be available. If one employs a thread pool, to avoid oversubscription of the cpus, the problem then becomes one of scheduling a pool of tasks to a pool of threads subject to any constraints that may apply.
This paper examines the realizable benefits of the afore-mentioned task parallel approach as applied to the SMOKE computer game demo. The task parallel implementation is done with the Intel® TBB C++ template library. In a nutshell, the library provides generic parallel algorithms and concurrent containers [5, 6] which enable users to write parallel programs without directly creating and managing threads. Indeed, with this library, users need only focus on representing code in terms of tasks. This step that can be done implicitly via library provided high level algorithms or explicitly via derivations of a base task class, also provided by the library. All aspects of thread management and the mapping of tasks to threads are handled by the library in a manner transparent to the user. Internally, the library treats tasks as user-level objects that are scheduled for execution by the Intel® TBB task scheduler. The task scheduler maintains a pool of native threads and a set of queues (one queue per thread) of tasks ready for execution. At initialization, the Intel® TBB task scheduler creates an appropriate number of threads in the pool (by default, 1 per hardware thread). During code execution the scheduler distributes tasks to threads using a randomized work-stealing algorithm. The decentralized (each thread has its own queue of tasks) work stealing mechanism is what enables the scheduler to achieve near optimal load balance and high scalability of parallel programs [3, 4]. All Intel® TBB algorithms are tested and tuned for the current generation of multi-core processors, and they are designed to scale as the core count continues to increase.
A detailed review of Intel® TBB features specifically as they apply to computer games has already been provided by one of us in the past . In this paper, we only focus on how Intel® TBB has been implemented in an optimized version of the SMOKE gaming demo. The source code we discuss here is publicly available on the Intel Developer Zone .
The organization of the remainder of the paper is as follows: the next section describes the use of Intel® Thread Profiler to obtain a detailed characterization of the SMOKE code. The section also covers the code changes made to alleviate the main bottlenecks detected by Intel® Thread Profiler. Following that, we report the performance improvements obtained as a result of the code changes. We end with a summary of the present work and point to some directions for future investigations.
Code Characterization, Parallelization
SMOKE was originally developed as a multi-threaded computer game demo implementing most of the major features of modern 3D games. It has been used to demonstrate efficient parallelization techniques applicable to the wide range of CPU-intensive computer games. As always, one can find room for improvement, and indeed initial runs of the SMOKE code under Intel® Thread Profiler revealed a sub-optimal overall concurrency level. A screenshot of the actual profile obtained is shown in Figure 1. The profile view (top pane) shows a concurrency level of slightly greater than 2 on 4 cores. The timeline view (bottom pane) shows the time based execution flow of the parallel code, namely the initialization step, the main computational loop, and the application wrap-up before termination. For SMOKE, this view also shows a region of considerable contention (marked in yellow by the tool). This part of the timeline corresponds to the main computation loop of the SMOKE code.
Figure 1:Intel Thread Profiler output for the initial run of the SMOKE binary. The top pane of the picture is the "Profile View"; the bottom pane is the "Timeline View".
With the initial profile in hand, the next step is to take a closer look at the timeline view. This is conveniently done using the zoom feature of Intel® Thread Profiler. One can select and magnify any region of the timeline view and, in the process, not only get a better view of the behavior of the threads but also automatically get the average concurrency level for the selected region. A zoomed view of just a few iterations (Figure 2) of the main computation loop showed two obvious things:
- The concurrency level did not change from iteration to iteration, implying that a very small subset could be chosen for detailed analysis
- The iterations suffer from ending with a critical section, where changes to the game scene get distributed among worker threads before they start working on the next frame
Figure 2:A zoomed-in view of a section of the timeline view is shown in the lower pane. This view focuses on only a few sample iterations of the main computational loop in the SMOKE code. The serialization between iterations is also clearly identified in this view. The concurrency level for these few iterations is shown in the top pane.
Ultimately, any analysis with Intel® Thread Profiler has only one main goal: to uncover issues with an application's multi-threading design or implementation and associate those issues with particular lines in source code. As such, the last step of our investigation is to isolate the source code corresponding to the serialization mentioned above. Intel® Thread Profiler allows one to do this via a menu item (obtained by right clicking anywhere near but not on the yellow line of a transition) called "Transition Source View". In this case the view shows the transition from four worker threads to one main thread, when it enters the critical section. The source view of the tool also displays the function call stack, and walking up this stack for the main thread, we arrive at the location of two function calls at the top level of the main computation loop. These function calls distribute changes made to various game objects in the iteration just completed. This is as far as Intel® Thread Profiler can take the analysis, and now the scrupulous work of reading the code begins.
Further investigation of the source showed that the "DistributeChanges()" functions responsible for updating game world objects and scenes cannot be parallelized in the manner written. Both of the functions contain data dependencies and use shared data structures. Their underlying data structures and the way threads work with them had to be modified in order for this functionality to become parallelizable. The changes to the relevant data structures were approached in the following manner:
- SMOKE uses an associative container (a map) to collect the change notifications made to game world objects. Find-and-Erase functionality in a thread-safe map introduces noticeable overhead. This overhead can be avoided if the map is replaced by a vector for which each thread has its own range of elements holding notifications for only it to process.
- The number of change notifications is not known in advance, so in order to be able to divide up the vector between all threads, each thread first needs to generate the notifications in Thread Local Storage (TLS), count them, grow the vector atomically to the necessary length, and then concurrently copy change notifications from TLS to the common container. For all of this no synchronization is required.
- After the above changes were implemented it turned out those functions that process different types of change notifications are not thread-safe either. Guarding access to certain common resources fixed this problem.
The changes outlined above were introduced into the code and Intel® Thread Profiler was used to assess the effect. Figure 3 below depicts a zoomed-in view of a previously single-threaded section of application run time. It is seen that a majority of the application run time is now spent at the concurrency levels of three or four.
Figure 3:Zoomed in view of a previously single threaded section of code. The changes described in the text have rendered the section parallel with a concurrency level of ~3 to 4.
The yellow bar in Profile View represents the time spent on synchronization between worker threads (i.e. overhead). However, it should also be noted that even with this overhead the changes made deliver an improvement to overall code performance. As with any synchronization point, the mechanism chosen to effect the synchronization is quite critical. The most commonly used mechanisms are: atomic operations, user-level and kernel-level mutexes, fair mutexes, and reader-writer mutexes. Each has its own advantages and disadvantages, and some experimentation is required when making the choice of mechanism for a particular application and workload. Here, Intel® Thread Profiler can assist in the following way. The timeline view of the region of interest shows that for this application and workload, worker threads "ping-pong" on the mutex/lock often and for a very short period of time. This indicates that user-level spin mutexes could be a good fit. Naturally, to be sure of the choice, one has to benchmark the application with respect to all possible mutex types and over the most commonly processed workloads on each of the target platforms. This approach will help account for hardware and software variations in the many possible environments that the code may be run in. In the case of SMOKE, these kinds of benchmarks confirmed the advantage of user-level synchronization. For this purpose then either an Intel® TBB spin_mutex or a Critical Section with a spin count would be good candidates.
Having addressed the serialization-between-iterations issue, we now turn to something fundamental in parallel processing: load balance. For the case at hand, the main loop is composed of a few different types of tasks each of which take different times to execute and offer different levels of potential parallelism. For example, the Rendering Task usually takes a long time and, barring any low level parallelism, is a single instance for each frame. AI tasks on the other hand usually take relatively small amounts of time to complete, with several instances possible per frame. AI tasks represent every "thinking" object in the game, and usually all of them can be executed in parallel on all available worker threads. Temporally, Physics tasks represent something in between AI and Rendering tasks. Physics tasks are therefore of "average" size and can appear in large numbers. The collision computation associated with them usually implies communication, which limits the concurrency level achievable in the handling of these tasks.
To understand the implications of all this in practice, we once again rely on the timeline view of Intel® Thread Profiler. We utilize a standard feature that allows one to mark certain activities for each worker thread. This allows a detailed analysis of tasks sizes, their execution order, and therefore worker thread load balance. Figure 4 illustrates the situation for SMOKE.
Figure 4:Intel® Thread Profiler screenshot showing load imbalance
To address this situation, we use the task scheduler available in Intel® TBB library. The scheduler maps available tasks to a thread pool in a manner designed to maximize concurrency. As mentioned in the introduction, the pool of threads is created and managed by the library in a fashion transparent to the user, so the user only focuses on representing code in terms of tasks. Tasks themselves are distributed to thread local queues and when a thread runs out of tasks in its own queue it can steal from another thread's queue. This is the feature that enhances concurrency [2-4].
This is only part of the story, because the order in which tasks are spawned (and thus submitted to the scheduler) will also be expected to have an effect on execution times. Indeed, if AI tasks get spawned first, Intel® TBB will balance the load via task stealing, and the threads of the thread pool will finish their execution of this group of tasks at almost the same time. Spawning the Rendering Task last will mean that when all worker threads finally finish with the AI tasks for example, only one thread will be able to pick up the serial Rendering task. The rest of the team will have to wait for the Rendering task to be executed.
With these considerations in mind the Intel® TBB task scheduler was used to improve the load balance on the CPUs. For the present work, explicit task-to-thread affinity was enforced to effect task prioritization. Indeed this could also have been done using Intel® TBB's affinity partitioner. In this approach, an affinity ID (a relative Intel® TBB thread ID) is set for a certain task making it a prioritized task in a way. A thread with this ID picks up the task as soon as it runs out of local work (tasks in the local queue). Thus, the high priority tasks, which in our case are Rendering and Physics, would get assigned an affinity ID. As soon as an iteration starts, one of the worker threads picks up the Rendering task, and then another thread picks the main Physics task, while the AI, Geometry and Particles tasks get divided among the rest of the worker threads by normal stealing. This would presumably result in finer balanced load and reduced wait times.
In the preceeding section, two main optimizations to SMOKE were described. First the underlying data structures for object change notifications and the way worker threads accessed and processed these notifications were restructured. This gave roughly a 12-15% overall performance improvement.
The other optimization was the inclusion of the Intel® TBB task scheduler to improve the load balance of the parallel execution. Prior to rework, the CPU load on a 2x 4 core machine  was in the range of 55% to 65%. After rework the load on the cores was in the 90% to 95% range. Frame rate improvements due to rework also improved by roughly 45% to 60%.
Summary and Conclusions
In summary, straightforward use of Intel® Thread Profiler identified that (i) the code spent a noticeable amount of time undersubscribed (ii) a significant amount of serialization existed in the main computational loop (iii) the concurrency levels did not change from iteration to iteration of the main computational loop (iv) under subscription occurred as a result of synchronization between iterations (v) under-subscription was root caused to two functions responsible for object change notifications.
Examination of the source code pointed the way for the functions to be restructured and parallelized with limited points of synchronization. The resulting code however still suffered from load imbalance. The Intel® TBB task scheduler was used to improve the overall CPU load and the balance of the code.
This work has demonstrated the effectiveness of the Intel® Thread Profiler in conjunction with the Intel® TBB library to achieve performance improvements to the SMOKE gaming demo code. One future opportunity for performance gain could be to examine the use of Intel® TBB's affinity partitioner. Other avenues have been suggested and discussed in . All of these considerations are expected to apply to gaming codes in general with only limited specificity if any to SMOKE itself.
We acknowledge the generous support received from Intel's Developer Products Division, Visual Computing Software Division, and Software Solutions Group during the course of this work & the writing of this paper.
- SMOKE Game-Technology Demo, Intel Developer Zone, available at http://software.intel.com/en-us/articles/smoke-game-technology-demo
- James Reinders, Intel Threading Building Blocks, O'Reilly Media, Inc, Sebastopol, CA, 2007.
- Robert D. Blumofe and Charles E. Leiserson, "Scheduling Multithreaded Computations by Work-Stealing," in Proceedings of the 35th Annual IEEE Conference on Foundations of Computer Science, Sante Fe, New Mexico, November 20-22, 1994.
- Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall and Yuli Zhou, "Cilk: An Efficient Multithreaded Runtime System," in Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '95), Santa Barbara, California, July 19-21, 1995.
- Michael Voss, "Demystify Scalable Parallelism with Intel Threading Building Block's Generic Parallel Algorithms," DevX.com, Jupiter Media, October 2006, at http://www.devx.com/cplus/Article/32935.
- Michael Voss, "Enable Safe, Scalable Parallelism with Intel Threading Building Block's Concurrent Containers," DevX.com, Jupiter Media, December 2006, at http://www.devx.com/cplus/Article/33334.
- Bradley Werth, "Optimizing Game Architectures with TBB", at http://www.gamasutra.com/view/feature/3970/sponsored_feature_optimizing_game_.php.
- The system used for testing was an Intel X5355: 2x4, 2.66 GHz, 8G RAM, Windows XP x64 Pro SP2, GeForce 8800 GTX. For more info see: http://en.wikipedia.org/wiki/Xeon#5300-series_.22Clovertown.22
About the Authors
Andrei Marochko is a Senior Development Engineer in the Performance Analysis and Threading (PAT) group in Intel's Developer Products Division. With almost 20 years of experience as a software developer, he has worked in a wide range of areas including numerical analysis, GUI development for distributed client-server applications, threading libraries, threading runtimes and threading tools. He holds a MS degree from Ivanovo State Chemistry and Technology University. Bradley Werth is a Senior Software Engineer in Intel's Entertainment Technical Marketing Engineering group.
Bradley Werth is a Senior Software Engineer in Intel's Entertainment Technical Marketing Engineering group. He received his Computer Science MS from University of Oregon in 2005 and his BS from USC in 1996. His focus at Intel is on developing and optimizing game features that take maximum advantage of the PC platform. He has spoken at GDC and Austin GDC about effective methods for threading game architectures.
Anton Pegushin has worked for Intel, Russia, for a little over 6 years. He started as part of the development team for the Intel® MPI Library and related cluster tools products. He is currently a Senior Technical Consulting Engineer for Intel's Threading and Performance Analysis group specializing in threading and in the use of Intel® Threading Building Blocks. He holds a MS in Applied Math from Nizhny Novgorod State University (NNSU) which he received in 2003. In 2007 he earned a PhD from Saratov State University.
Michael D'Mello has spent the last 18 years in the computer industry specializing in Parallel Computing. Past employers include Thinking Machines Corporation, Convex Computer Corporation, and the Hewlett-Packard Company. He has been with Intel since 2003 and is currently a Senior Technical Consulting Engineer in Intel's Developer Products Division. He holds a Ph.D. in Chemical Dynamics from the University of Texas, Austin.
*Other names and brands may be claimed as the property of others.