Section 8: Prejudice Case Study

Download Article

Section 8: Prejudice Case Study (PDF 1.67MB)

Abstract

This case study demonstrates how the performance of Section 8: Prejudice was increased 1.17x, from 24FPS to 28FPS on 2nd Generation Intel® Core™ processor family (codenamed Sandy Bridge). Intel engineers performed analysis using the Intel® Graphics Performance Analyzers (Intel® GPA) on the game's built-in benchmark scene. Our mission was to improve performance to 25 fps and we had only two weeks to do it. Bottlenecks were identified in the terrain rendering and post processing effects, analysis was written by Intel engineers and shared with TimeGate Studios engineers, who were able to remove some bottlenecks. Turning off ambient occlusion (AO) was sufficient to bring the performance up to a playable 28FPS.

Introduction/Goal

The 2nd Generation Intel® Core™ processor family couples powerful CPU and graphics performance into a single chip. These processors are ideally suited for playing games on mainstream machines. PC game developers can tune their games for these popular processors, as TimeGate Studios* did with Section 8: Prejudice*.

Section 8: Prejudice* includes a built-in benchmark, showing a fly through of a small scene in which 10 soldiers are battling each other (Figure 1). The scene shows off many of the effects that a player would see in any typical game level. This scene was chosen because it offered an exactly replicable scene in which to test any changes.

The resolution chosen to profile was 1280x800, as this represents the typical first-run experience. The base settings are referred to as “medium”, and consist of all effects sliders at level 3 (Figure 2).

The minimum frame rate for playability for this particular title is 25fps. In order to achieve this with a 2nd Generation Intel® Core™ processor, Intel® engineers worked with TimeGate* to identify draw calls that were relatively expensive for their effect, and to modify the game and settings to modify or remove these calls.


Figure 1: The benchmark scene

 


Figure 2: Default, "medium" settings

 

Analysis with Intel® Graphics Performance Analyzer

The first time the benchmark level was run, the results were as follows:


Figure 3: The original results

Intel® GPA was used to determine exactly what the scene’s bottlenecks were:


Figure 4: The test scene with Intel® GPA running

A capture of this frame was taken and analyzed in Frame Analyzer as shown in Figure 5. Changing the X axis to GPU Duration and Y axis to GPU Breakdown gives a better understanding of costly draw calls. GPU Breakdown sets the height of each call to GPU duration and colors each call based on time spent in each stage: Blue=>fixed function (e.g. clipper), White=>vertex shader, and Black=>pixel shader.


Figure 5: An overview of all the draw calls in the scene

The tallest bars in Figure 5 represent the most expensive draw calls in the scene, which then became the focal points for deeper analysis.

Multiple overrides are available with GPA. These overrides alter the render pipeline behavior, and can help illuminate specific areas of the GPU workload that are a problem. Intel® engineers used 5 overrides in their analysis: Null Hardware, Disabled Draw Calls, 2x2 Textures, Simple PS and 1x1 Scissor Rect. The effect of each of these overrides is described below.

Null Hardware disables the hardware, but not the driver overhead. Using this override helps determine if the hardware is the bottleneck. If the frame rate improves significantly with Null Hardware, it means that the driver was waiting for the hardware to finish the rendering. Disabled Draw Calls helps determine if the driver is the bottleneck instead of the hardware. When Disabled Draw Calls is on, the draw calls are never even passed to the driver. The driver just returns as soon as anything is called. If this yields a significant performance benefit, it tends to mean that the driver is the bottleneck. For this code, Null Hardware did not provide a signficant performance benefit.

2x2 Textures is an override to determine if texture sampling is one of the main bottlenecks. Textures in games are often very large. This can add a significant amount of work to the texture sampler that has to load the giant texture from memory. 2x2 Textures reduces that load by effectively replacing all textures, no matter the size, with a simple 2x2 texture that is much faster to sample. If this yields a performance benefit, it means that texture sampling is one of the slow aspects of rendering.

Simple PS replaces every pixel shader in the game with one that simply returns a color. This is the absolute smallest, simplest pixel shader that there is. Most games are pixel bound, this override can help determine how much of a bottleneck it is.

1x1 Scissor Rect is an override that disables pixel processing from the pipeline. On Intel® HD Graphics, this involves removing all pixels except for one pixel in the upper left hand corner. This keeps all pixel work still happening (sampling, etc.) but only with one pixel. This helps determine if pixel shader complexity is a problem. This behavior is hardware/driver specific, so the results may vary when using anything other than Intel® HD Graphics. See the Intel® GPA Documentation for more information.

The Null Hardware override and hardware metrics are only available with Intel® HD Graphics.

The increase in FPS from 2x2 Textures, Simple PS, and 1x1 Scissor Rect direct our attention to the pixel shader work with the likelihood texture resolution/loading/sampling is a high contributor.


Figure 6: GPU Overrides

*Null hardware override is applied for 2 seconds

 

Most expensive draw calls

The single most expensive draw call in the scene is a post processing effect that implements a light-scattering effect on the scene, helping to “fade” further away objects:


Figure 7: The “fade” draw call selected in Frame Analyzer


Figure 8: The post processing effect being applied to the scene

In Figure 8, only changed pixels are shown with any color - unaffected pixels are shown in black. Figure 8 demonstrates that the impact of this draw call is relatively small for its GPU duration cost. Given the low visual impact on the final frame that this draw call has, removing this step could be a potential performance improvement with minimal visual side effects.

The next two most expensive draw calls are both rendering the terrain. Terrain rendering is often a very expensive bottleneck in games due to the resolution and size of the terrain.


Figure 9: The two expensive terrain draw calls selected

 


Figure 10: The first terrain draw call

 


Figure 11: The second terrain draw call

Figures 10 and 11 show the terrain rendering. These are expensive because rendering the terrain involves vertex processing of a large amount of vertices and heavy amounts of texture sampling/compositing in the pixel shader.

The fourth most expensive draw call is an ambient occlusion pass. Ambient occlusion adds small but high detail lighting to the nooks and crannies in the scene.


Figure 12: The ambient occlusion draw call selected

 


Figure 13: The ambient occlusion that is applied to the scene.

Figure 13 demonstrates how much of the scene is modified by the ambient occlusion pass. This was the first thing that did not have a dramatic graphics quality difference, but did take a substantial amount of time. This draw call was ultimately eliminated to improve performance while sacrificing little visual quality.

Beyond single calls, the most expensive set of draw calls is all the post processing. In total, all the post processing accounts for 25.6% of the total frame time. Each call included a render target change. Render target changes aren't particularly costly, but they aren't free. Every render target change comes with some overhead. When working with multiple render targets, redundant rebinding of render targets should be avoided.


Figure 14: All the post processing effects selected.

 

Execution Unit Usage

Intel® GPA allows you to see the performance metrics of the execution units (EUs) in both the vertex and pixel shader stages of the rendering pipeline. The metrics were collected for each of the most expensive draw calls:


Figure 15: Execution unit activity for expensive draw calls

The reason the pixel shaders are stalled so much is because they’re waiting on the results from the texture samplers. This indicates that the game might be texture bound, and further performance improvement may be possible by reducing texture resolution, texture format, or total number of textures sampled. In contrast, the post processing effects are relatively independent of the scene, and can be tweaked much more easily.

Pixel Shaders

Here is a list of all of the pixel shaders that take over 2% of the final frame time:


Figure 16: Pixel Shader Distribution

The most expensive of these pixel shaders is only used 14 times and has very simple assembly. This shader just appears to apply post-processing effects to the final frame buffer. Because of this, all the textures it’s using are high-resolution and heavily tax the texture samplers.


Figure 17: The draw calls using the most expensive pixel shader

	//
	// Generated by Microsoft ® HLSL Shader Compiler 9.26.952.2844
	//
	// Parameters: 
	//
	// float4 TextureComponentReplicateAlpha;
	// sampler2D _Texture; 
	// 
	// 
	// Registers:
	//
	// Name 						  Reg   Size 
	// ------------------------------ ----- ---- 
	// TextureComponentReplicateAlpha c0       1 
	// _Texture 					  s0       1 
	// 
		ps_3_0 
		dcl_texcoord v0.xy
		dcl_texcoord1 v1
		dcl_2d s0
		texld r0, v0, s0
		dp4 r0.w, r0, c0
		mul oC0, r0, v1
	// approximately 3 instructions slots used (1 texture, 2 arithmetic)

Figure 18: The most expensive pixel shader

 

Texture Resolution

Because there is so much expensive texture sampling, additional analysis was done after reducing the texture detail. With lower resolution textures, the sampling should become faster. Using the in-game graphics configuration, texture detail was brought down to 2 (from 3) while all other settings stayed the same.


Figure 19: Lower texture detail settings

Unfortunately, this did not give a very substantial difference in the frame rate. This is because lowering the texture detail only affects the textures on the objects. Post-processing is where the largest amount of texture work occurs, but these textures don’t decrease in resolution or detail.


Figure 20: The performance results of lower texture detail

Further analysis was done with modified culling. Culling was increased to be more aggressive, but this didn’t change performance at all. This is because the scene is very small and everything was in view at once, so nothing could be culled. When TimeGate* engineers interpreted the Intel® analysis, they made the recommendation to disable the ambient occlusion effect. This was done with a command line command, “DisableAO”. The settings were reverted back to Medium before more performance metrics were re-recorded.


Figure 21: The performance with Ambient Occlusion disabled

 


Figure 22: Frame capture with AO off

 

The performance improved by 16%, bringing the title above the 25FPS target, with almost no perceivable difference in visual quality.


Figure 23: The scene with ambient occlusion off


Figure 24: The original scene

Conclusion

Intel® GPA was crucial in determining the actual bottlenecks, and in confirming that removing these calls wouldn’t negatively impact the scene. When those results were shared with TimeGate* engineers, they very quickly offered suggestions about how to improve the performance. Although disabling one post processing effect is a relatively small change, that doesn’t make it any less impactful. A 17% boost in frame rate with no significant impact on the final quality is a very good outcome.

 

About the Author

Kyle Weicht is a software engineer in the Intel Software and Services Group, where he supports Intel graphics solutions in the Developer Relations Division. He holds a B.S. in Game Development from Full Sail University.

 

Appendix A: System Information

Hardware

Software

Game Configuration

Appendix B: Tools

  • Intel® Graphics Performance Analyzer 4.0

Appendix C: Additional GPU Counters

For more complete information about compiler optimizations, see our Optimization Notice.
Tags: