Optimizing Game Engines with the New Intel® Graphic Performance Analyzers 3.0 Platform View

  • This article was originally written using Intel GPA Version 3. However, many of the performance hints and techniques discussed here are independent of a specific version of Intel GPA.
  • The latest version of Intel GPA can be downloaded from the Intel GPA Home Page.
  • Though this article mentions being able to use Intel GPA with Intel TBB, the latest versions of Intel GPA and Intel TBB use different instrumentation API's, so you will not be able to see Intel TBB activity within the Intel GPA Platform Analyzer tool. Your only option at this time is to manually instrument your code using the ITT API as documented in the Intel GPA SDK Reference Guide (available from the Windows Start Menu for Intel GPA).


Game software developers need analysis tools that enable them to not only measure performance over time but also quickly and easily zero in on bottlenecks and other issues that may be hindering performance. To do this effectively, developers must be able to map performance back to their specific game engine environments. The new Intel® Graphics Performance Analyzers (Intel® GPA) 3.0 release includes a tracing infrastructure that makes this contextual analysis possible. Intended as a "how-to" guide for developers, this paper details the process of using Intel GPA 3.0 trace capabilities to achieve new levels of analysis and time-saving efficiency.


Intel GPA is a suite of software tools that provides in-depth, real-time analysis for games running on multi-core platforms. The new 3.0 release of Intel GPA includes a tasking feature set designed to provide full system-level performance analysis for Microsoft DirectX* games. This new feature set is based on tracing technology.

Intel GPA 3.0 ships with a tracing API that enables developers to instrument games contextually. After tracing a game engine, the developer can then use Intel GPA Platform View to obtain context-specific performance analysis based on the areas of the engine that were instrumented. Mapping performance to the specific game context in this way makes it easier to precisely identify bottlenecks and experiment with changes.

To provide objective best practices for using the tasking feature set and tracing technology, Intel developers instrumented and analyzed Smoke, a game engine application similar to game software used by Intel customers (Figure 1). Smoke was developed in-house at Intel and employs Intel® Threading Building Blocks (Intel® TBB) for scalable parallelization. The process of instrumenting and analyzing Smoke was quick and easy, requiring only a few hours. This white paper offers key takeaways for developers planning to use the Intel GPA tracing technology and Intel GPA Platform View in their own environments.

Figure 1: The Smoke game engine with its internal performance HUD-enabled

Exploring the Value of Instrumented Intel TBB

For game engines that use Intel Threading Building Blocks, dropping in an instrumented version of Intel TBB (available online on the Intel TBB Web page) is a highly efficient and useful way to speed analysis. The Intel developer team began by swapping out the existing version of Intel TBB in the Smoke code and replacing it with an instrumented version. The team simply rebuilt the code, ran Smoke, and created a trace file (Figure 2).

The team knew that Smoke was running at a poor frame rate and gained its first insight into the cause using this instrumented Intel TBB-only configuration. Note that the vtasks bar chart in Figure 2 is displaying CPU durations for individual parallel-for executions. The visualization shows that a single parallel-for construct is consuming all threads within the system for 75 percent of the frame time. It does not pinpoint the offending code, but by highlighting a parallel-for construct, it indicates the area of the game engine to investigate.

The Intel TBB-only approach, with no other tracing enabled, allows a developer to immediately view the utilization of the system from the perspective of the Intel TBB API. If a game engine is attempting to use Intel TBB for all parallelization needs, the developer can see how effectively the game engine is scaling on the current machine simply by zooming out to a single frame level and viewing the density of the Intel TBB constructs (how much white space exists). Figure 2, for example, shows that Intel TBB is consuming seven out of eight threads. The scheduled utilization across those seven threads is quite good.

Figure 2: Tracing Smoke using the instrumented Intel TBB-only approach

When developing code to an API, viewing performance at the API level can help to determine the length of each API call in relation to the others, and so on. However, it is very useful to extend the trace visualization into the bodies of the Intel TBB parallel constructs (parallel-for in Smoke). This adds game engine-specific context to the generic Intel TBB trace. Once this is done and visualized, the developer can understand the engine-specific usage of Intel TBB. For example, a long parallel-for can now be absolutely identified and linked to a specific game engine code segment.

Figure 3 shows how this context can definitively pinpoint game engine bottlenecks. Note that the user can now tell exactly what is going on with the parallel-for that was taking 75 percent of frame time. In this case, one thread on the CPU is stuck in a vertex buffer lock call and all other threads are blocked within a spin-lock waiting for that resource. This is an actionable item that can be understood and possibly fixed.

Figure 3: Close-up view showing four of eight threads, revealing spin-lock behavior

Determining If a Game Is GPU or CPU Bound

The Intel team assumed at first that the Smoke game engine would be bound by the CPU on most available systems that have a combination of a CPU and a GPU. Prior to using Intel GPA, the team had done basic analysis of CPU utilization from OS perfmon counters as well as CPU/GPU frame wall time analysis. In the first case, the perfmon counters from the Microsoft OS showed that the CPUs were heavily utililzed-in most cases, above 80 percent. The frame wall time analysis showed that the CPU frame was slightly larger than the GPU frame. Both of these indicators created the perception that the game was CPU bound. However, this was not the case. The workload was actually GPU bound, and the team was able to determine this by using Intel GPA.

Initial analysis with Intel GPA, accomplished by simply looking at the CPU and GPU frame timelines in the GPA tool GUI, revealed that the GPU was usually a couple of frames behind (Figure 4). This frame visualization did not require any end-user tracing, but rather was performed by GPA automatically "under the covers." Further analysis revealed a single parallel-for that consumed nearly the entirety of the CPU frame. This parallel-for affected all threads in the system

Figure 4: GPA automatically visualizes CPU and GPU frame durations and relationships to each other.

After adding some trace code to Smoke itself, the Intel team soon determined the reason that the parallel-for was taking so long: the code running underneath was spin-locked the entire time. The code in question was within the Ogre* middleware used in the demo for graphics scene management and rendering. Ogre spends time updating data for the next frame as part of the post-rendering process. With the addition of a few more traces, it became clear that one thread was waiting in a vertex buffer lock call while all other threads were in a spin-lock waiting to access that buffer or some other data structure.

With Ogre found to be stuck in a vertex buffer lock call, the team focused on the graphics processor. Although the hardware system in use had a high-end, multi-core CPU, the GPU was an entry-level NVIDIA* 9500. The team swapped out the 9500 for a more powerful NVIDIA 8800 GTS and captured a new version of the same trace. Comparing full-frame views of the NVIDIA 9500 trace (Figure 5a) and the 8800 GTS trace (Figure 5b), the shorter red horizontal bars in the second visualization showed the problem was much shorter in duration with the faster GPU, confirming the game was GPU bound. With an even faster GPU, the lock would no longer consume any time at all.

Once a workload is found to be GPU bound, Intel GPA enables the developer to immediately take a frame capture and transition to the Intel GPA Frame Analyzer for deep GPU analysis (see the Intel GPA Web site for more details on the GPA Frame Analyzer). Best-practice workflow with Intel GPA is to use the Platform View as a starting point to determine if a game is CPU or GPU bound, then transition to Intel GPA Frame Analyzer if GPU bound, or continue to use the Platform View or Intel® Parallel Studio if CPU bound.

Figure 5a: Trace visualization with Smoke running on an NVIDIA 9500

Figure 5b: The same trace with Smoke running on an NVIDIA 8800 GTS

Examining Back-End Game Engine Processing

After working through the GPU-bound finding, the team inspected the back-end trace once again. Visible in both the timeline view and the textual summary view was the fact that all the threads were taking turns writing to thread-synchronized data structures. Most of the time, the majority of threads were stuck in a spin-lock, waiting to access a vertex buffer or other structure.

The team found that Intel TBB parallel-for was used to parallelize the Smoke code across N threads, but access to the shared data structures was effectively serialized. A view of back-end frame processing just after rendering completed proved instructive on this point (Figure 6). Red tasks in the visualization are spin-locks, and dark green areas represent the time that the thread "owned" the mutex protected by the spin-lock. The Intel team concluded that its Smoke engine code would benefit from increased parallelization of access to the shared data structures-a valuable insight provided by Intel GPA analysis.

Figure 6: Smoke back-end serialization, indicating the need for parallel access to data structures

Using Virtual Tasks for Subsystem Analysis

Virtual tasks provide the ability to analyze performance at any level of context that a developer desires. In the case of games, it is very helpful to understand the relative impact of each subsystem, per frame, within the game engine. The Intel team was able to gain this understanding of Smoke by instrumenting all the tasks within the game engine and then adding virtual tasks, one per subsystem, with a relationship defined between each task and the subsystem virtual task.

Virtual tasks are visualized in a bar chart at the top of the user interface (Figure 7). The set of displayed virtual tasks is configurable by the end user via a pull-down combo box at the top left of the bar chart. In this example, two subsystems have been selected for visualization: artificial intelligence (AI) and Fire. The tall bars represent the time spent in the Fire subsystem per frame, and the smaller bars are the time spent in the AI subsystem per frame. The results are not surprising, since the Smoke demo is heavy on Fire processing time.

The tool GUI automatically mirrors selection sets between the virtual task bar chart and the task timeline, as well as all other visualization panels. This feature allows the user to select a tall virtual task bar in the bar chart and let the timeline automatically select and highlight all tasks that are related to that virtual task. In this example, the user can select a tall Fire subsystem bar and immediately see all tasks that contribute to the heavy Fire subsystem activity. This Intel GPA view is extremely helpful because it enables the developer to determine the impact of each subsystem, the balance between subsystems, and the interactions between subsystems-with Smoke, for example, another subsystem might be starving the Fire simulation.

Intel GPA can make subsystem analysis much more efficient and less time-consuming for the developer. For example, as more and more fire is added to the scene in Smoke, or as the camera pans across the scene, a developer would be able to see how those changes affected the balance within various game engine subsystems. By selecting a subsystem bar in the bar chart at the top of the screen, the developer can highlight all related tasks in the timeline below, and other work can be hidden (Figure 8). The user can toggle the "hide state" for nonselected tasks by hitting the H key while the mouse is highlighting the timeline. The selected tasks are also displayed in the summary panels to the right.

Once the tasks of a subsystem are selected, the developer can sort the summary view by "percent total time" and have a link from "slow frame" to "expensive engine subsystem" to "expensive tasks within a subsystem," which ultimately provides an actionable block of code to optimize. Intel GPA is the only toolset that provides this performance analysis capability.

Figure 7: Timeline and summary visualization of the Smoke front end with Fire subsystem task set selected

Figure 8: Timeline and summary visualization with the same Fire task set selected and other work hidden


With the 3.0 release of Intel GPA, the goal was to provide developers with a highly contextual multi-core processor framework for game analysis, one that maps performance to the specific game engine. The tasking feature set and tracing technology in Intel GPA 3.0 enables developers to easily instrument game engine applications with tracing code and analyze them in depth and in context. Capabilities include immediately viewing the utilization of the system, determining if a game is GPU or CPU bound, examining game engine processing in detail, and using virtual tasks for subsystem analysis.

Together, these capabilities add up to industry-leading performance analysis capability. Task tracing in Intel GPA 3.0 is also an efficient method in terms of overhead; with the Smoke game engine fully instrumented, its frame rate was nearly identical to the original pre-instrumentation time, indicating a near-zero trace overhead for Intel GPA 3.0.

For more information about Intel GPA, visit www.intel.com/software/gpa and for more information about Ogre, visit www.ogre3d.org.

About the Author

Christopher Cormack is the lead product designer for the Intel GPA suite of graphics performance tools at Intel, most recently Intel GPA 3.0.
For more complete information about compiler optimizations, see our Optimization Notice.