Let’s look at a typical game analysis case. You want to find the main bottlenecks in a game, so you can optimize the right places in the game. Intel® Graphics Performance Analyzers (Intel® GPA) include a set of tools that help you do that.
To start, install your game and Intel GPA on your test system and Intel GPA on an analysis system. Start the Intel GPA Monitor on both systems. To help you get up and running, check out this blog about simple self-testing.
Now that Intel GPA and your game are installed and the Intel GPA Monitor is running on both systems, start the game on your test system. Start the Intel GPA System Analyzer on the analysis system. Make sure the game is running with typical settings, and capture the settings for your records. Your options, quality settings, and resolution will definitely affect how the game runs and where you’ll see bottlenecks.
We usually just take screen shots of the config menu screens. In System Analyzer, connect to the game, and bring up some simple metrics as described in the earlier blog.
A good first step is to just watch how the game runs. Get the game into steady gameplay. Watch the frame rate and CPU/GPU utilization over time, although you may need a second person to do this. A good way to do this is with the Intel GPA System Analyzer, which can show you a collection of metrics in real-time.
Do you see any large spikes of activity on any one metric, is the frame rate consistent or are there occasional (or worse) slow frames? Get a feeling for how your game is running overall.
As you continue playing, use the CSV button in System Analyzer to capture a CSV file with the currently-displayed metrics. Let it run for 30 seconds or so and then stop the CSV capture, playing the game continuously. This file will let you study those realtime metrics offline, which can show you patterns that you may have missed when watching them in real-time. Now, take two more captures in System Analyzer. It takes a few seconds to complete each capture in System Analyzer, so wait a few seconds for the system to return to normal between captures. Take a frame capture with the “camera” button, wait briefly, then take a trace capture with the red “record” button.
This is also a good time to run some of the real-time experiments in System Analyzer, but we’ll talk about those another day.
You’ve just collected all the data you need, so you can exit the game.
Now that the running game won’t interfere with the analyzer tools, you can switch over to the test system to run all analysis. It’s possible to run all the analyzers with the two-system configuration, but there’s a lot of data to copy between systems, so it’s faster if you can just run the tools on the test system. Capture files are stored on the test system. (CSV files are stored on the analysis system, though, in case you’re wondering where they are.)
First, use Intel GPA Platform Analyzer to open the trace capture file you just took. Here you can confirm if the game is CPU-bound or GPU-bound. You’ll see a section for GPU Engine (Render and GPGPU). This shows the GPU activity across the capture. You can watch how a frame progresses through the GPU queue by watching for the cross-hatch marked calls in the GPU activity. Below GPU activity, you'll see Thread Lifetime, which shows the CPU activity. If there are any gaps in the GPU activity, then it is not GPU-bound. If there are gaps in the CPU activity, it is not CPU-bound.
If the game is CPU-bound, you'll want to study this platform trace in more detail. This is mostly comment now on the lowest-power systems; more typical laptop systems tend to be GPU-bound.
If the game is GPU-bound, open the frame capture in Frame Analyzer. First, look for any patterns in the units of work ("ergs") across the top. Are there a few long-running ones? Are there lots of little ones? Are there a lot of calls (>1500)? If there are a lot of calls, regardless of their length, the game is more likely to be CPU-bound in the game or driver. If there are a few long-running ones, the game is more likely GPU-bound. Compare what you see here with what you learned form the platform trace earlier.
Next, look for any very large ergs. Adjust the scale at the top of Frame Analyzer to show some time-based scale, if necessary. Pick the longest-running erg or ergs by clicking them, and remember them. Note how long they run. You'll come back to them in a minute. In the experiments tab on the bottom right, choose "disable ergs". This will eliminate these ergs temporarily from the frame workload, so you can get a better sense of how the rest of the frame. This will also reset the scale for duration so you can get a more accurate view of how long the rest of the slowest ergs run. This will tell you overall how much you'll need to focus on one or a few ergs vs. many of them.
To return to the original view, click the "revert all changes" button in the top bar of the window.
Now you should use experiments to understand the slowest ergs. For any changes that you make, Frame Analyzer will apply the changes right away, re-render the frame, and show you an updated execution time in the windows on the left. This gives you quick feedback about each change. First, pick "2x2 textures" to see how much time is spent in texture lookups. It replaces all textures with a simple 2x2 texture, so lookups are very fast. Then, look at the "simple pixel shader". This will show how much time is spent doing both texture lookups and math in the shader. Compare the two results.If the simple pixel shader is a lot faster, you should look at the amount and type of math in shaders and try to simply.
Now uncheck the others and check "1x1 scissor rect", which runs a single pixel through your game's pipeline. This will show the work done to render a single pixel, which can tell you if your game is doing too much work per pixel.
It's common to find that 2x2 textures doesn't speed up the game too much. Often these cases will be able to do some more texture lookup, especially if it lets you eliminate math in the shader. For cases doing too much math in the shader, look at ways to reduce this load. Perhaps data can be calculated on the CPU side and passed as a constant. Maybe it can be passed as a texture. Look for any work being done per-frame that can be reused on later frames. Find some way to offload and reduce the overall workload.
Check what resolution the game was using when the data was collected. Consider if it's the "right" resolution for the job. Any reduction in resolution will reduce the number of pixels, which will tend to soften any of the pixel limits you explored above.
Look at the longest running bars in the top window, and click on them. In the Shaders tab, check the GPU time of each, and select the longest running shader. Check how many execution slots are used by it; look in the Assembly or HLSL view and scroll to the bottom to see how many instruction slots were used. If it's using a lot, consider shortening the shader To better understand the relative costs of the operations across the shader, pick "profile" to run the shader multiple times and report execution times of each line. Once you know how long the different parts of the shader take, you can look at ways to modify the shader to run faster.
Check the shader for any [const] loads, which access memory and will be slow.
Once you have studied a slower shader, you can right-click on it in the shaders tab and pick "select all ergs that use this shader". This will show you the relative impact of the shader, and give you a hint how much you may speed up the frame by modifying this shader.
As you look at individual shaders, in case it's not clear what the shader does, you can disable the erg and then study the next erg. Does the absence of this erg's output make the next erg look different? Perhaps the next erg uses it as input. This can also give you a sense at what the first erg does. For example, maybe the next erg is doing a SSAO calculation with the first frame's output.
Another way to study the frame is to switch the top bar to show render targets. If it's currently showing "erg graph", you can switch the dropdown to "render targets". More than several RTs might indicate a performance overhead. To better understand whether they're really all separate render targets, you can go to the View menu and pick "API details". in the filter box, type "SetRenderTarget". Study the results.
If there are multiple render targets set in a row, this may indicate a redundant RT. You can see the render targets used along the left edge of the main window, and compare the RTs used in the ergs vs. the API details shown for each. Study the early RTs and check what they might be doing. Are they perhaps building up shadow maps? If you scroll the RT list up to the top in the main window, you can pick the "color" view for an RT. If it's not clear what the RT contains, expand the histogram portion of the viewer, and move the slider to change the gray scale and see if patterns emerge.
Back in the API details view, look for multiple uses of the same RT number, and study when and how they're used. AS you scroll through the API view, watch for redundancies.
If you find early ergs are writing to a shadow map, pick those ergs and then pick the state tab in the main window. This shows all the state settings used on this call. If there's a Z write enable, check if there's a color write enable too. These writes will usually write only a depth value. But in DirectX 9, depth values are written to color since you can't directly read from the depth buffer. In DirectX 11 you would not see this pattern so color write enable would not be set.
If you do find a large erg writing to R and B, then perhaps the other 2 channels of the target can be eliminated since they're unused. Check if the RT can be modified and experiment with that.
Check further down the state tab to see if culling is being done. If there's inconsistency, study the ones without culling set. You can look at their textures (on the Texture tab) to understand what they are. Often they'll be trees. Check if the targets have alpha and if it's used. Study the PS code for these ergs, and profile them to spot slow operations. Sometimes texkill operations can be slow here on different hardware.
Check if Z writes are enabled and if the shader contains texkill or if stencil is on.
For ergs where culling is being done, usually true for other terrain, look at the depth pass and check cull mode. Be sure there is some cull mode on the depth prepass or shadow mode.
Once you find code that does shadows, look for cascaded shadow map or shadows from different light sources. If it's a cascade, you can tell by seeing a detailed front section with complex geometry up to a certain depth. If the distant shadows are rough but close shadows are good, this usually indicates a cascade. Perhaps both are generated and a runtime decision is made which to use? Depending on the quality settings and resolution, perhaps the shadows are too detailed.
In the main window of Frame Analyzer, look at the Frame Overview, and scroll through looking for any values that seem excessive. For large numbers of Early Z Failed, perhaps there's some view frustum culling that can be added? Check where the geometry is removed and see whether a lot of geometry is lost. If so, culling should be added.
If you find an object with early Z kills but that's still having samples written, then it's partially in the scene.
Can you find any items submitted that aren't written?
In cases where there are a lot of PS invocations, that might indicate excessive draws, or it might indicate that use of lots of RTs. To better understand it, switch the top view to render target view, and show pixels rendered on the Y axis. If there's one or a few RTs that render far more pixels, perhaps they're doing too much overdraw.
Study the regions of the frame as it's drawn, and figure out what each pass does. In the depth pass, look at the shaders tab and study the vertex shader(s). Study the vertex signature. For code like "decl position v0", check if the value is used in the shader. Check especially for code like depth pass that is taking color values, since it will often be unneeded. This may help you find cases where the vertex signature is bigger than it needs to be. If position is the only part being used, then there may be extra bytes in the vertex signature that could be eliminated.
For tree ergs, they're using alpha so they'll need texture coordinates. Check those vertex signatures for possible trimming.
If the number of samples written is high, perhaps MSAA is enabled. Try comparing samples taken with MSAA enabled and disabled, to see the impact of each. For MSAA cases that run slowly, perhaps a PP AA approach can be used instead?
We've looked at many different parts of studying your game's performance, and we hope that this helps you spot and fix performance bottlenecks!