by David Landau and Xin Liu
The University of Southern California’s GamePipe Laboratory is part of the USC Viterbi School of Engineering and one of the leading game development degree programs in the United States. Each academic year, senior undergraduate students and master students from the GamePipe program participate in a cross-discipline collaboration with students in the USC School of Cinematic Arts’ Interactive Media program to deliver faculty-approved projects from concept to a highly polished game.
Tales from the Minus Lab was one of the chosen projects for the 2011‒2012 academic year with development beginning in the Fall of 2011. Minus Lab is a first person adventure/exploration game focused around the player’s ability to change sizes between 6 millimeters and 1.5 meters while remaining in the same environment. Due to the unusual physics precision and implementation required for this mechanic, we identified the need to build a custom game engine that could integrate with Havok Physics for the project.
Midway through the project, the team began to notice performance degradation as more assets were added to the game. Utilizing Intel® Graphics Performance Analyzers (Intel® GPA) and Intel® Parallel Studio XE suite of tools, we were able to identify various performance bottlenecks. Based on the analysis of the results, the best improvement that could be done was to split off the rendering step into its own thread and run it asynchronously from the main game update. Through the use of Intel® Threading Building Blocks (Intel® TBB), an average performance increase of 15‒20% was realized across most test systems, with as much as 100% increase on systems that were particularly CPU-bottlenecked. This resulted in a smoother, more responsive experience as the input handling was now processed synchronously with rendering and combined with higher frame rates, which led to a reduction in input lag.
Identifying the Bottleneck
We started by using Intel GPA’s in-game HUD to monitor our various CPU, GPU, and Microsoft Direct3D* metrics. We quickly discovered that the game was CPU-bound—making as many as 900 draw calls in a frame in certain situations due to the lack of sophisticated visibility culling in our rendering middleware. This was a limitation that could not be worked around without substantially modifying the rendering system and suffering a significant amount of downtime in production, something the team could not afford. Thus, we felt that minimizing the amount of time the CPU spent preparing each frame was the best way to improve rendering speed.
Detailed CPU Performance Analysis
Intel® VTune™ Amplifier provides a suite of tools to trace CPU performance from lightweight activity sampling to detailed tracing, capturing, and concurrency analysis, giving insight to all aspects of the CPU performance picture. To facilitate CPU profiling with VTune Amplifier, calls to the ITT API were integrated into the engine. This allows for in-game toggling of CPU capturing when the Amplifier Profiler is attached, resulting in more accurate and relevant data being captured. By capturing data only when the engine was most bottlenecked, we were able to discover that around 80% of the CPU time was spent in the rendering system with the remaining 20% spent in the game update.
Hotspot Analysis, showing 84% of our CPU time being used to render a frame.
We also noticed that Havok was being directly accessed by portions of our draw loop. Havok Physics is inherently multithreaded when run as a continuous physics simulation, but it should have been powered entirely by the update loop. To look into any possible performance implications caused by physics being mixed in with rendering, we ran a Locks and Waits analysis in VTune. This test allows the user to see how much time threads and functions spend waiting for and acquiring locks.
Locks and Waits analysis in single threaded mode. Havok [name] is causing wait time in our Engine Draw.
This resulted in the second bottleneck discovery: that the draw loop was blocking and being blocked by the physics simulation. Because our engine ran Havok in fixed-timestep simulation mode, Havok required that the physics world be updated a minimum number of times per frame. This in turn meant that if physics had to be stepped several times in a frame to catch up with the elapsed game time, rendering would be blocked waiting for that to complete. With rendering itself potentially taking a long time to complete, this sometimes resulted in a negative feedback loop that would grind the engine to a halt.
Thus, the conclusion we came to was to separate the rendering and updating loops into their own threads using Intel TBB.
Breaking Off Rendering
To maximize the performance gained by separating rendering and updating into their own threads, each system must synchronize as little as possible with the other. Through analysis of the engine design, we determined that the rendering thread only needed to poll the physics world for transformation data to draw the objects in the world. With the use of Intel TBB’s concurrent_queue and spin_rw_mutex, we were able to keep our synchronization overhead per frame very low and avoid expensive operating system context switches. However, some overhead could not be avoided as other gameplay systems, predominantly UI updating, were too tightly coupled with rendering and had to be made thread-safe by locking down rendering while the actions were performed.
After these steps were completed, we used Intel GPA to measure our performance gains:
Multithreaded performance. We see an FPS gain of 20% and 2.43x increase in CPU utilization.
As can be seen in the results above, by almost eliminating the dependency of rendering on updating and allowing rendering to run as fast it can, we were able to gain an additional 10‒20% FPS increase for essentially free without needing to reduce the level of detail or visual quality.
With multi-core computing becoming a staple in mainstream PCs, developers have recognized the need to parallelize their games to better utilize the hardware and enable more sophisticated visual effects and gameplay. Using Intel GPA and Intel Parallel Studio XE, we were able to identify several key bottlenecks, which we then addressed by leveraging Intel Threading Building Blocks to perform some basic parallelization of our game engine that netted us a meaningful performance boost without having to sacrifice any gameplay features or visual quality. With Intel TBB being freely available for academic and non-commercial use, high performance, multi-core enabled game engines are no longer purely the realm of AAA studios.
Intel provides developers with tools that help them analyze and optimize their games, media, and graphics-intensive applications. To find out more about the tools mentioned in this article, see the following links:
To learn more about the USC GamePipe Laboratory, visit:
About the Authors
Xin Liu is a recent graduate from USC’s GamePipe program (class of 2012) with a B.S. in Computer Science (Games). He is currently a Graphics Engineer at EA Maxis Emeryville where he works on the rendering engine for the upcoming SimCity (2013).
David Landau is currently pursuing a M.S. in Computer Science with specialization in game development at USC. In his free time he is typically being a badass in online games with his wife and contributing to open source projects. He is the lead engineer on the upcoming student project Core Overload.