NOTE: This article was based upon optimizations using Intel GPA Version 3.0. Though not the latest version of Intel GPA, the techniques discussed here are independ of a specific version of the product. Check out the GPA Home Page to download the latest release.
GSC Game World is the leading Ukrainian game development company and one of the leading studios in Eastern Europe. Based in Kiev, they are known worldwide for their S.T.A.L.K.E.R. and Cossacks game series. In February of 2010 the company released their latest game, "S.T.A.L.K.E.R.: Call of Pripyat" in the United States as well as internationally. This game continues the successful tradition of letting players experience the post-apocalyptic atmosphere of "The Zone" and Chernobyl Nuclear Station.
GSC's programmers joined the Intel® Graphics Performance Analyzers (Intel® GPA) beta program in late 2008, and since that time have actively used the tools in their development process. During S.T.A.L.K.E.R.: Call of Pripyat development, they used Intel GPA to optimize the game for Intel Integrated Graphics-based systems as well as a variety of DX-based non-Intel GPUs (which is also possible with Intel GPA). This article discusses the specific methods and features of Intel GPA that GSC used to identify bottlenecks in the game and improve game performance to maximize the range of playable PC configurations.
To analyze performance with Intel® GPA, developers used a configuration of two PCs; one being the analysis machine equipped with an Intel® GMA X4500 Series Graphics processor (for running the game), and the other being a client machine (for running Intel GPA tools). The client PC was linked to the development engineer's workstation through a standard TCP/IP network.
Analyzing System Performance
Modern first person shooter (FPS) games have potential to push graphics hardware to its limits, and optimizing a game to run on the wide variety of hardware that players are likely to use is critical. GSC Game World uses their proprietary in-house developed game engine, which supports the latest graphics APIs (Microsoft DirectX* 9, 10, 10.1). They have designed the engine to offer the best possible balance between stunning graphics and high performance.
The first step of system performance analysis that GSC implemented was to run game benchmarks at different resolutions and settings. This allowed them to analyze which settings offered reasonable performance for the targeted graphics hardware. Intel GPA System Analyzer helped them quickly monitor performance and identify bottlenecks and potential optimizations. Intel GPA System Analyzer showed the game's "heartbeat" through a list of real-time graphs for CPU and GPU counters, Microsoft DirectX* calls, and memory subsystem usage. It also provided the ability to perform "what if" experiments, such as setting "null driver", "null hardware", and disabling z-test or alpha blending. These experiments helped GSC to visualize whether their game was CPU or GPU bound, and identified which GPU pipeline stage was most likely to be a bottleneck. The real time graphs also allowed GSC developers to see drops in frame rate for particular game scenes or situations, enabling them to focus their analysis and optimizations on the most-likely performance hotspots.
Figure 1. Intel GPA System Analyzer screenshot showing the S.T.A.L.K.E.R.:Call of Pripyat "heartbeat".
The S.T.A.L.K.E.R.: Call of Pripyat game engine has several lighting models. The benchmark results showed that the Static Lighting mode provided the most reasonable frame rate on the Intel® GMA X4500 chipset, and as a result GSC decided to use this as the default setting as well as the optimization target. GSC's benchmarks also indicated the frame-rate was significantly lower in the scenes with multiple characters on the screen. To ensure the game would use the proper settings on players' PCs, developers pre-defined the configuration by using Microsoft Direct 3D* (D3D) Caps' deviceID identifications. A list of Intel deviceIDs is publicly available on Visual Adrenaline site: /en-us/articles/intel-gma-3000-and-x3000-developers-guide.
Moving to Frame Analysis
Once the specific bottleneck had been identified using Intel GPA System Analyzer, GSC Game World used the Frame Analyzer to drill down within specific low-performance frames and draw-calls, taking advantage of the tool's rich set of instruments for pinpointing specific bottlenecks and problems.
The Erg Visualization Panel (upper histogram on Figure 2) displays a histogram of captured frame events (draw calls and render/z-buffer clears). The X-axis data represents events, with the Y-axis representing event time in milliseconds. GSC developers also took advantage of Intel GPA Frame Analyzer's modes for displaying the histogram; for example, GPU breakdown mode displays the percentage of vertex/pixel shader workloads for every draw call (enabling you to easily find the calls which are shading or geometry limited). GSC developers found that character rendering was vertex shader limited, making this the most significant bottleneck in frames where multiple characters are visible on the screen. Game developers typically want a good frame rate in dynamic scenes where many characters are visible, such as combat situations, so character rendering optimization was identified as one method for improving game playability.
Figure 2. Intel GPA Frame Analyzer showing an un-optimized frame.
Figure 2 shows an original frame from S.T.A.L.K.E.R.: Call of Pripyat. The GPU breakdown graph indicates there are two groups of vertex shader limited calls which correspond to two character rendering passes -- rendering to a shadow map and the main rendering passes. In this figure the white color indicates the percentage of time spent in vertex shader. Also, notice that on the Shader tab you can see that a vertex shader for one of these calls has a relatively high 46 instructions, which may result in slower performance for any primitives using this vertex shader.
Optimization Strategy and Results
The Intel® GMA 4500 Series have a unified architecture, what means that an array of unified Execution Units on the chip are able to process pixel, vertex, geometry shaders, or other special threads (Figure 3). This gives the GPU the ability to load balance among available workloads, so that optimizing one type of workload frees up resources which are now available for other workloads. For example, when you optimize the geometry workloads this may indirectly improve pixel processing performance. These types of trade-offs should be considered as you optimize your game, so that as you make improvements in one area you should then re-analyze your game's performance to see the impact upon the frame rate.
Figure 3. Intel® Series Express Chipsets Architecture Diagram
In S.T.A.L.K.E.R.: Call of Pripyat, the vertex shaders used in character rendering utilized vertex skinning animation. GSC decided to implement software vertex skinning operations to offload expensive calculations from the GPU to the CPU. This approach was beneficial for the following reasons:
- Vertex skinning is quite suitable for software implementation (particularly when your game is rendering the same animated models for shadow, mirrors, and reflections), since the animation data can be processed once per-frame, then use the transformed geometry for all the rendering passes.
- By offloading geometry processing to the CPU, more resources are available in the GPU for pixel processing, thereby improving the frame rate.
- Integrated Graphics chipsets utilize system memory for the GPU, so there is no need to transfer large chunks of data from the CPU to the GPU, saving an expensive operation on discreet GPUs.
When implementing the vertex skinning in software, DirectX 9 allows the use of default and mixed software vertex processing, but does not support re-use of the processed geometry for several passes. This technique is required by the Static Lighting model in the game engine, so the GSC programmers developed their own implementation. Table 1 shows the results for different implementations that were evaluated:
|Hardware skinning (vertex shaders)||30 fps|
|Default software skinning (software vertex processing)||53 fps|
|SSE-intrinsics optimized skinning||80 fps|
|Fully SSE-optimize assembler code skinning||102 fps|
Figure 4. Intel GPA Frame Analyzer showing optimized frame.
On Figure 4 you can see the final results - the frame with software skinning for character rendering now has a much faster execution time. Also, the vertex shader was significantly simplified, and was reduced to just five instructions for the shadow mapping pass, compared to the 46 for the previous implementation.
This success story for the game S.T.A.L.K.E.R.: Call of Pripyat is an example of how game developers have had success using Intel® Graphics Performance Analyzers to improve the performance of games running on multiple PC configurations, including those with Integrated Graphics. With the help of Intel GPA, the GSC Game World programmers improved the framerate performance in the combat scenes by up to 2X!
In the meantime, GSC Game World continues to use Intel GPA in their development pipeline, and has just started using the newest release of the product, version 3.0 of Intel® GPA. Intel GPA 3.0 includes a new feature, Intel® GPA Platform View, which helps analyze multi-core performance within Intel GPA. In this latest release many new features have been introduced, such as 64-bit applications support and new hardware metrics in systems based on Intel® HD Graphics chipsets. GSC's graphics programmer Serguei Ivantsov said: "Our first impression is that the new version of Intel GPA is very powerful and useful. We will definitely be profiling our new engine with it. By the way, our new simple test scene with deferred shading works at 80-100 fps with Intel HD Graphics, very impressive!"
About the Author
Philipp Gerasimov is a technical consulting engineer in Intel's Advanced Visual Computing group within the Intel® Software and Services Group (SSG). Philipp works with different development groups at Intel, and with Independent Software Developers (ISVs) on Intel® GPA support and future graphics architecture development. He also presents at numerous game development and computer development conferences. For the game performance optimizations you can see his work on many modern computer games, such as Crysis*, Painkiller 2*, Call of Juarez*, and Pacific Fighters*.