GPU Metrics for Cherry Trail

This section describes all metrics accessible with Intel® GPA when analyzing OpenGL ES* workloads on Cherry Trail.

Metric Name

Description

Avg GPU Core Frequency

Represents the average GPU Core Frequency in the measurement.

GPU Frequency

Represents the GPU frequency during the measurement period. The latest Intel GPUs support the Intel® Turbo Boost Technology 2.0 and can dynamically change frequency depending on CPU and GPU workloads.

GPU Busy

Represents the percentage of time when the GPU was busy processing ergs.

Examples

For GPU-bound workloads, the value of the GPU Busy metric is 100%. A value less than 100% indicates that the GPU is spending time in an idle state waiting for data from the CPU (in which case your game or application might be CPU-bound).

Improving Performance

If GPU Busy is consistently less than 100% and you are encountering performance issues, consider threading your game and using the Trace Analyzer to understand the interaction between the CPU and GPU.

GPU Core Clocks

Represents the total number of GPU core clocks elapsed.

GPU Duration

Represents the total GPU time for the frame, or for the selected ergs within that frame.

The graphics driver 9.17.10 introduces a new notion of deferred clears. For the sake of optimization, the driver decides whether to defer the actual rendering of clear calls in case subsequent clear and draw calls make it unnecessary. As a result, when clear calls are deferred, the Graphics Frame Analyzer shows their GPU Duration and Samples Written as zero. If later it turns out that a clear call needs to be drawn, the work associated with that clear call gets included in the duration of the erg that was being drawn when this clear call was deferred, not necessarily a clear call. This means that in the Graphics Frame Analyzer metrics associated with a clear call accurately reflect the real work associated with that erg.

Examples

If GPU Duration is 80,000, it means that GPU spends around 80 milliseconds to render selected ergs.

Improving Performance

When using GPU Duration as a metric to help understand the performance of your game or application, it is important to understand the following:

  • If this value is too large, examine the underlying components of the rendering pipeline to see if one or more of these areas are too complex, and therefore causing potential performance bottlenecks.
    Check: Pixel Shader Duration, Vertex Shader Duration, Geometry Shader Duration metrics.
  • How effective is the GPU working for the selected ergs?
    Check: GPU EUs Active, GPU EUs Stalled.

 

Represents the number of rendering primitives assembled and put into the input assembly stage of the pipeline.

 

Represents the number of vertices that entered the pipeline.

 

Represents the number of Tesselation Control Shader invocations. The Tesselation Control Shader is invoked once per patch.

Improving Performance

The Tesselation Control Shader is not usually a performance bottleneck, but it can definitely cause performance issues further down the rendering pipeline. If the Tesselation Control Shader specifies large tessellation factors, or as the TCS Invocations value increases, it results in more work for the fixed function tessellator as well as an increased number of TES Invocations and GS Invocations.

 

Represents the number of Tesselation Evaluation Shader invocations. The Tesselation Evaluation Shader is invoked once per fixed function tessellator output point.

Improving Performance

The purpose of a Tesselation Evaluation Shader is to calculate the vertex positions for subdivided points that are output by the fixed function tessellator. The best way to improve performance is to minimize the number of TES Invocations, which can be done by decreasing the amount of tessellation performed. You can do this by either decreasing the number Tesselation Control Shader Invocations or decreasing the tessellation factors in the Tesselation Control Shader.

 

Represents the number of vertex shader invocations - the vertex shader is invoked once per vertex. The number of vertex shader invocations depends both on the vertex and primitive counts and the operation of the post-transform vertex cache (VCache). In an optimal situation, the GPU fetches already-processed vertices from the cache rather than recalculating this data, which could impact the value of this metric.

 

Represents the number of geometry shader invocations. The value is 0 if no geometry shader is associated with the rendering call.

Examples

If GS Per Triangle Invocations is 1000, it means that the geometry shader was invoked for 1000 primitives.

Improving Performance

The only way to minimize the number of geometry shader invocations is to minimize the number of input primitives. The impact on rendering performance of reducing the invocation count is highly dependent upon your specific game or application.

Post-GS Primitives Represents the number of primitives that flowed out of the geometry shader (GS), if enabled, to the clipper. This metric is important if a geometry shader was associated with the selected rendering calls, and even more important if the number of primitives spawned by geometry shader code is dynamic.

NOTE

If the GS was not enabled for the selected rendering calls, the metric returns a value of 0.

Examples

If Post-GS Primitives is 1000 and Primitive Count is 100, it means that 1000 primitives were constructed in the geometry shader from the original 100.

Clipper Invocations Represents the number of primitives processed by the Clipper.

 

Represents the number of primitives that flowed out of the clipper. The metric includes original primitives which passed the trivial clipping test (trivial accept) and new primitives that were created by the clipper as a result of the clipping operation.

Examples

  • If you render 100 triangles and clipping is enabled and all the triangles are trivially accepted, the Clipper Primitives is 100.
  • If you render 100 triangles and clipping is enabled and all the triangles are trivially rejected, the Clipper Primitives is 0.
  • If you render 100 triangles and clipping is enabled and one or more triangles are partially located within the viewing frustum, the Clipper Primitives returns a value which could be more or less than 100 depending on the number of triangles that were clipped. If value is significantly higher than 100, it means that many triangles were partially clipped, and the clipper created additional triangles.

Improving Performance

In most cases you do not have to care about the clipper performance on Intel® HD Graphics 2000/3000 GPUs because these graphic processors implement an efficient clipping algorithm in silicon.

Early Hi-Depth Test Fails Represents the total number of pixels dropped on early hierarchical depth test.
Early Depth Test Fails Represents the number of pixels that failed the early depth/stencil tests.

 

Represents the number of fragment shader invocations.

 

Represents the total number of samples dropped in fragment shaders.
Samples Blended Represents the total number of samplers or pixels written to all render targets.
Samples Written Represents the number of pixels/samples written to render targets.

The graphics driver 9.17.10 introduces a new notion of deferred clears. For the sake of optimization, the driver decides whether to defer the actual rendering of clear calls in case subsequent clear and draw calls make it unnecessary. As a result, when clear calls are deferred, the Graphics Frame Analyzer shows their GPU Duration and Samples Written as zero. If later it turns out that a clear call needs to be drawn, the work associated with that clear call gets included in the duration of the erg that was being drawn when this clear call was deferred, not necessarily a clear call. This means that in the Graphics Frame Analyzer metrics associated with a clear call accurately reflect the real work associated with that erg.

 

Represents the number of compute shader invocations. The Compute Shader is invoked once per thread per thread group. The number of threads per thread group is defined by the Compute Shader’s numthreads attribute (numthreads(tX, tY, tZ)). The number of thread groups executed is determined by the parameters to the Dispatch call (Dispatch(gX, gY, gZ)). CS Invocations is equal to (gX*gY*gZ)*(tX*tY*tZ).

Examples

  • If the numthreads attribute is numthreads(4, 4, 1) and Dispatch is called as Dispatch(16, 16, 16), the CS Invocations value is equal to (16*16*16)*(4*4*1) = 65536.
GTI Read Throughput Represents the total number of GPU memory bytes read from GTI.
GTI Write Throughput Represents the total number of GPU memory bytes written to GTI.
EU Active %

Represents the percentage of time during which the GPU execution units (EUs) were actively executing pixel, geometry, or vertex shader instructions.

Examples

If EU Active % is 80, it means that EUs where active 80% of the rendering time for selected ergs.

Improving Performance

If the EUs are not active, it means that they are either stalled waiting for a request to be fulfilled, or idle. You can see how much of the non-active time is caused by stalls by examining the EU Stall % metric. If the total EU busy time (EU Active % + EU Stall %) is significantly lower than 100%, this indicates that there are stalls elsewhere in the rendering pipeline.

EU Stall %

Represents the percentage of time during which the GPU execution units (EUs) were stalled. An EU becomes stalled when all of its threads are waiting for results from fixed function units (for example, a pixel shader requests texels from the texture sampler).

Examples

  • If EU Stall % is 50, it means that EUs where stalled for 50% of the rendering time for selected ergs.
  • If EU Stall % is 0, it means that there were no stalls in EUs or stall time is very small.

Improving Performance

If this metric is unexpectedly high, especially when compared with the EU Active % metric, you can analyze where the stalls happen by looking at the VS EU Stall % | GS EU Stall % | PS EU Stall % metrics. If any of these metrics show that most of the stall time is in one particular shader, examine your shader code to determine why this shader would be causing the EUs to stall.

VS EU Active

Represents the percentage of overall GPU time that the execution units (EUs) were actively executing Vertex Shader instructions.

Examples

  • If VS EU Active is 50%, half of the overall GPU time was spent actively executing Vertex Shader instructions.
  • If VS EU Active is 0%, no Vertex Shader was associated with the selected draw calls, or the amount of time actively executing Vertex Shader instructions was negligible.

Improving Performance

  • This metric is important if vertex processing seems to be a bottleneck for selected rendering calls. If VS EU Active accounts for most of the EU active time, then to improve performance you should simplify the vertex shader or simplify and optimize the geometry of your primitives.
  • If VS EU Active is significant, you should examine your vertex shader code to find the reasons that might be causing stalls.
VS EU Stall

Represents the percentage of overall GPU time that the execution units (EUs) were stalled in Vertex Shader instructions.

NOTE

This metric does not include the total amount of time stalled in the vertex shader, but only the fraction of the time when vertex shader stalls were causing the entire EU to stall. The entire EU stalls when all of its threads are stalled.

Examples

  • If VS EU Stall is 50%, it means that half of the overall GPU time was spent stalled on Vertex Shader instructions.
  • If VS EU Stall is 0%, it means that no Vertex Shader was associated with selected rendering calls or Vertex Shader threads were not causing EUs stalls.

Improving Performance

  • This metric is important if vertex processing seems to be the bottleneck for selected rendering calls. If VS EU Stall accounts for most of the EU active time, then to improve performance you might need to simplify the vertex shader or simplify and optimize geometry.
  • If VS EU Stall is significant, you need to concentrate on vertex shader code to find the reasons that are causing stalls.
VS FPU0 Pipe Active Represents the percentage of time in which the execution unit (EU) FPU0 pipeline was actively processing a vertex shader instruction.
VS FPU1 Pipe Active Represents the percentage of time in which the execution unit (EU) FPU1 pipeline was actively processing a vertex shader instruction.
VS Send Pipe Active Represents the percentage of time in which the execution unit (EU) send pipeline was actively processing a vertex shader instruction.

 

Represents the percentage of overall GPU time that the execution units (EUs) were actively executing Fragment Shader instructions.

Examples

  • If FS EU Active is 50%, it means that half of the overall GPU time was spent actively executing Fragment Shader instructions.
  • If FS EU Active is 0%, it means that no Fragment Shader was associated with the selected draw calls, or that the amount of time actively executing Fragment Shader instructions was negligible.

Improving Performance

  • This metric is important if fragment shading seems to be the bottleneck for selected rendering calls.
  • If FS EU Active accounts for most of the EU active time, then to improve performance you might need to simplify the fragment shader.
  • If FS EU Active is larger than you would expect and you are encountering slow rendering times, you should examine the fragment shader code for potential reasons why these stalls are occurring.

 

Represents the percentage of overall GPU time that the execution units (EUs) were stalled in Fragment Shader instructions. 

NOTE

This metric does not show total amount of stalled time in the fragment shader, but only the fraction of time when fragment shader stalls caused the entire EU to stall. The entire EU stalls when all of its threads are stalled.

Examples

  • If FS EU Stall is 50%, it means that half of the overall GPU time was spent stalled on Fragment Shader instructions.
  • If FS EU Stall is 0%, it means that no Fragment Shader was associated with selected rendering calls or Fragment Shader threads were not causing EU stalls.

Improving Performance

  • This metric is important if fragment shading seems to be the bottleneck for selected rendering calls. If FS EU Stall accounts for most of the EU active time, then to improve performance you might need to simplify the fragment shader.
  • If FS EU Stall is larger than you expect and you are encountering slow rendering times, you need to concentrate on fragment shader code to find reasons for these stalls.

 

Represents the percentage of time in which the execution unit (EU) FPU0 pipeline was actively processing a fragment shader instruction.

 

Represents the percentage of time in which the execution unit (EU) FPU1 pipeline was actively processing a fragment shader instruction.

 

Represents the percentage of time in which the execution unit (EU) send pipeline was actively processing a fragment shader instruction.

 

Represents the percentage of time in which both execution unit (EU) FPU pipelines were actively processing.
Samplers Bottleneck Represents the percentage of time that the texture sampler is a bottleneck. The sampler is stalling Execution Units (EUs) due to a full input FIFO and starving EUs due to a lack of results.

NOTE

This metric is unreliable when protected HD media content is being played back on a system with Intel® HD Graphics 5000/ 4600 / 4400 / 4200, Intel® Iris® graphics 5100, or Intel® Iris® Pro graphics 5200 configuration.

Examples

If Samplers Bottleneck is 90, then the texture sampler is a bottleneck (stalling some EUs and/or causing other EUs to idle) 90% of the time.

Improving Performance

The following techniques could improve the texture sampler performance:

  • Reducing the size of textures by using a lower resolution or lower color precision (such as RGBA4444 instead of RGBA8888).
  • Using texture compression to reduce the amount of memory to transfer textures.
  • Using mipmapping, so that smaller textures (mipmaps) can be used.
  • Reducing the number of textures in the scene.
  • Using a different filtering algorithm.

For example, anisotropic filtering is more expensive to compute than a simpler algorithm, such as bilinear filtering To help minimize overhead in this area, capture a typical frame while the game is running, use this frame as input to the Graphics Frame Analyzer, and try one or more of the following techniques:

  • The 2x2 Textures experiment in the Experiments tab to see if textures are a bottleneck.
  • The Texture tab to see the texture size, format, and mip level.

NOTE

This metric might show incorrect results and will be disabled with the next driver update.

Samplers Busy

Represents the percentage of time the texture sampler was busy handling texel fetch requests (that is, was either active or stalled).   

NOTE

This metric is unreliable when protected HD media content is being played back on a system with Intel® HD Graphics 5000/ 4600 / 4400 / 4200, Intel® Iris® graphics 5100, or Intel® Iris® Pro graphics 5200 configuration.

Examples

  • If Samplers Busy is 50, it means that texture sampler was active 50% of the rendering time for selected ergs.
  • If Samplers Busy is 0, it means that texture sampler was not used or time when it was active is very small.

Improving Performance

When the Samplers Busy is running, this might lead to execution unit stalls especially if texture fetch latency does not occur in parallel with mathematical instructions (as the shader compiler attempts to optimized shader code to cover such latencies). Examine the EU Stall % metric to see the amount of EUs stalls. If the percentage is high and Samplers Busy is close to 100%, most likely you have a texturing bottleneck. Try the 2x2 textures experiment to see if this is the case.

Sampler Texels

Represents the number of texels returned from the texture sampler.

NOTE

This metric is unreliable when protected HD media content is being played back on a system with Intel® HD Graphics 5000/ 4600 / 4400 / 4200, Intel® Iris® graphics 5100, or Intel® Iris® Pro graphics 5200 configuration.

Examples

If Sampler Texels is 1000, it means that 1000 texels were delivered to execution units (EUs) from the texture sampler.

Improving Performance

A high number of texels fetched from textures leads to a higher texture bandwidth and a higher number of texture sampler unit stalls, which might cause a high number of EU stalls caused by shaders awaiting texels from the sampler unit.

Note that this metric could indicate that the shader stalls while fetching texture data inside branching logic. For example, if the shader fetches texture samples only inside an if() block in the code, this metric can help you understand how often the shader takes the branch.

NOTE

This metric is accurate only to four texels, and generally is slightly larger than the actual number of texels used. This is because the texture sampler returns data in 2x2 texel quads. When sampling along angular edges, this inaccuracy becomes more pronounced.

EU Idle Represents the percentage of time when the GPU execution units (EUs) were idle. An EU is idle when it is neither actively executing shader instructions nor stalled trying to execute shader instructions.

Examples

  • If EU Idle is 50, it means that EUs where idle for 50% of the rendering time for selected ergs.
  • If EU Idle is 0, it means that the EUs were either active or stalled for the entire duration of the rendering time for the selected erg.

Improving Performance

If EU Idle is significantly higher than 0%, this indicates that there are stalls elsewhere in the rendering pipeline.

 

For more complete information about compiler optimizations, see our Optimization Notice.