User Guide

  • 2020.3
  • 07/10/2020
  • Public Content
Contents

This section describes all the GPU metrics accessible from the Intel® GPA.
The table below provides an overview of all GPU metrics available for Intel GPUs starting from the 3rd Generation Intel Core Processors.
NOTE
For DirectX 11 targets, metrics are collected for a given application being profiled. For DirectX 12 targets,  metrics are collected system-wide, including all running applications. While profiling a DirectX 12 application, it is recommended to stop all other running graphic applications.
 
NOTE
For products formerly named Kaby Lake G, see GPU metrics description at https://gpuperfapi.readthedocs.io/en/latest/counters.html.
 
Metric Name
Description
Main Metrics
GPU Duration
Represents the total GPU time for the frame, or for the selected event for Graphics Frame Analyzer within that frame.
Examples
If
GPU Duration
is 80,000, it means that the GPU spends around 80 milliseconds to render the selected ergs.
Improving Performance
When using
GPU Duration
as a metric to help analyze the performance of your game or application, it is important to understand the following:
  • If this value is too large, examine the underlying components of the rendering pipeline to see if one or more of these areas are too complex, resulting in potential performance bottlenecks. Check:
    Pixel Shader Duration
    ,
    Vertex Shader Duration
    ,
    Geometry Shader Duration
    metrics.
  • How effective is the GPU working for the selected ergs? Check:
    GPU EUs Active
    ,
    GPU EUs Stalled
    .
GPU Frequency
Represents the average GPU core frequency during the measurement period. The latest Intel GPUs support the Intel® Turbo Boost Technology 2.0 and can dynamically change frequency depending on CPU and GPU workloads.
Examples
For Intel® HD Graphics 3000, the
GPU Frequency
increases to its maximum frequency when a heavy GPU load occurs.
Improving Performance
Typically the system automatically adjusts the
GPU Frequency
to optimize total system performance between the CPU and the GPU.
When running the System Analyzer HUD, if the GPU frequency is always at its peak value for a particular system configuration, this could indicate that your system is GPU bound; if the GPU frequency is always at the lower end of the range, this could indicate that either you are CPU bound and/or that the GPU is not being fully utilized.
When running the Graphics Frame Analyzer, currently this metric does not provide an accurate measure of GPU performance, since the CPU is not being utilized as it would be during the running of your game when the frame was captured.
Note
If the Intel graphics device supports multiple GPU frequencies, to minimize variation in metric values the Graphics Frame Analyzer locks the GPU at the maximum frequency available.
Avg GPU Core Frequency, MHz
Represents the average GPU Core Frequency in the measurement.
GPU Core Clocks
Represents the total number of GPU core clocks elapsed during the measurement period.
GPU Busy
Represents the percentage of time when the GPU is busy.
Examples
For GPU-bound workloads, the value of the
GPU Busy
metric is 100%. A value less than 100% indicates that the GPU is spending time in an idle state, waiting for data from the CPU, in which case your game or application might be CPU-bound.
Improving Performance
If
GPU Busy
is consistently less than 100% and you are encountering performance issues, consider threading your game and using the Graphics Trace Analyzer to understand the interaction between the CPU and GPU.
HUD Overhead Time
Represents the Head's-up Display overhead time.
Non-Culled Polygons
Represents the number of polygons processed that were not culled.
GTI Metrics
GTI Write Throughput
Represents the total number of GPU memory bytes written to GTI.
GTI Read Throughput
Represents the total number of GPU memory bytes read from GTI.
DRAM LLC Throughput, bytes
Represents the total number of successful LLC cache lookups done from the GPU.
LLC GPU Accesses, messages
Represents the approximate amount of GPU memory bytes transferred between LLC and DRAM controller.
NOTE
This metric might show incorrect results and will be disabled with the next driver update.
LLC GPU Throughput, bytes
Represents the total number of GPU memory bytes transferred between GPU and LLC.
LLC GPU Hits, messages
Represents the total number of LLC cache lookups done from the GPU (64B reads, 32B writes).
NOTE
This metric might show incorrect results and will be disabled with the next driver update.
EU Array Metrics
EU Idle %
Represents the percentage of time when the GPU execution units (EUs) were idle. An EU is idle when it is neither actively executing shader instructions nor stalled trying to execute shader instructions.
Examples
  • If
    EU Idle %
    is 50, it means that the EUs were idle for 50% of the rendering time for selected ergs.
  • If
    EU Idle %
    is 0, it means that the EUs were either active or stalled for the entire duration of the rendering time for the selected erg.
Improving Performance
If
EU Idle %
is significantly higher than 0%, this indicates that there are stalls elsewhere in the rendering pipeline.
EU Active %
Represents the percentage of time when the GPU execution units (EUs) were actively executing pixel, geometry, or vertex shader instructions.
Examples
If
EU Active %
is 80, it means that the EUs were active 80% of the rendering time for the selected events.
Improving Performance
If the EUs are not active, it means that they are either stalled waiting for a request to be fulfilled, or idle. You can see how much of the non-active time is caused by stalls by examining the
EU Stall %
metric. If the total EU busy time (
EU Active %
+
EU Stall %
) is significantly lower than 100%, this indicates that there are stalls elsewhere in the rendering pipeline.
E
U Stall %
Represents the percentage of time when the GPU execution units (EUs) were stalled. An EU becomes stalled when all of its threads are waiting for results from fixed function units (for example, a pixel shader requests texels from the texture sampler).
Examples
  • If
    EU Stall %
    is 50, it means that EUs were stalled for 50% of the rendering time for the selected ergs.
  • If
    EU Stall %
    is 0, it means that there were no stalls in EUs or the stall time is very small.
Improving Performance
If this metric is unexpectedly high, especially when compared with the
EU Active %
metric, you can analyze where the stalls happen by looking at the
VS EU Stall %
|
GS EU Stall %
|
PS EU Stall %
metrics. If any of these metrics show that most of the stall time is in one particular shader, examine your shader code in the Graphics Frame Analyzer to determine why this shader might be causing the EUs to stall.
EU AVG IPC Rate, Number
Represents the average rate of IPC calculated for two FPU pipelines.
NOTE
This metric might show incorrect results and will be disabled with the next driver update.
VS Duration
Represents an approximation of the total GPU time spent executing vertex shader code.
Examples
  • If
    Vertex Shader Duration
    is 50,000, it means that GPU spends around 50 milliseconds to execute vertex shaders for selected ergs.
  • If
    Vertex Shader Duration
    is 0, it means that time spent in vertex shaders for selected ergs is very small.
Improving Performance
If the
Vertex Shader Duration
time is significant compared to
GPU Duration
, vertex processing optimizations might be needed. In this situation, optimize the geometry by minimizing the
Vertex Count
,
Primitive Count
, and
Vertex Shader Invocations Count
. If you are using triangle lists, try to convert them to a single triangle strip to minimize the number of vertices sent to pipeline. Also optimize the geometry for VCache (see
Vertex Shader Invocations Count
metric description).
To see whether optimizations are possible, examine your vertex shader code in the Graphics Frame Analyzer. Refer to the Graphics API Performance Guide to find recommendations for vertex shader optimizations.
VS EU Active %
Represents the percentage of overall GPU time that the EUs were actively executing Vertex Shader instructions. 
Examples
  • If
    VS EU Active %
    is 50%, half of the overall GPU time was spent actively executing Vertex Shader instructions.
  • If
    VS EU Active %
    is 0%, no Vertex Shader was associated with the selected draw calls, or that the amount of time actively executing Vertex Shader instructions was negligible.
Improving Performance
  • This metric is important if vertex processing seems to be a bottleneck for selected rendering calls. If
    VS EU Active %
    accounts for most of the EU active time, then to improve performance you should simplify the vertex shader or simplify and optimize the geometry of your primitives.
  • If
    VS EU Active %
    is significant, you should examine your vertex shader code to find reasons that might be causing stalls.
Inspect the shader code in the Graphics Frame Analyzer.
VS EU Stall %
Represents the percentage of overall GPU time that the EUs were stalled in Vertex Shader instructions. 
NOTE
This metric does not include the total amount of time stalled in the vertex shader, but only the fraction of the time when vertex shader stalls were causing the entire EU to stall. The entire EU stalls when all of its threads are stalled.
Examples
  • If
    VS EU Stall %
    is 50% it means that half of the overall GPU time was spent stalled on Vertex Shader instructions.
  • If
    VS EU Stall %
    is 0% it means that no Vertex Shader was associated with selected rendering calls or Vertex Shader threads were not causing EUs stalls.
Improving Performance
  • This metric is important if vertex processing seems to be the bottleneck for selected rendering calls. If
    VS EU Stall %
    accounts for most of the EU active time, then to improve performance you may need to simplify the vertex shader or simplify and optimize geometry.
  • If
    VS EU Stall %
    is significant you need to concentrate on vertex shader code to find reasons causing stalls.
Inspect the shader code in the Graphics Frame Analyzer.
VS Invocations
Represents the number of vertex shader invocations - the vertex shader is invoked once per vertex. The number of vertex shader invocations depends both on the vertex and primitive counts and the operation of the post-transform vertex cache (VCache). In an optimal situation the GPU fetches already-processed vertices from the cache rather than recalculating this data, which could impact the value of this metric.
Therefore, when the
VS Invocations
and the
Vertex Count
have similar values, it means that the geometry is not optimized to take advantage of the VCache.
Examples
The OptimizedMesh sample from the Microsoft* DirectX* SDK is a good example to illustrate the
Vertex Count
and VCache optimizations:
  • When rendering one un-optimized mesh as a triangle list, the
    Vertex Count
    is equal to 141K and the
    VS Invocations
    is 112K.
  • When rendering the same mesh as a triangle list that has been reordered for optimum VCache usage, the
    Vertex Count
    is still the same but the
    VS Invocations
    number drops to 27K, which is almost four times less.
  • When rendering the same mesh as a VCache-optimized triangle strip, the
    Vertex Count
    drops to 52K and the
    VS Invocations
    drops to 25K.
Improving Performance
To improve vertex processing performance and reduce the number of vertex shader invocations, try to reorder the geometry for optimum VCache usage. The D3DX utility library contains functions that reorder the geometry to improve VCache utilization (
ID3DXMesh::Optimize, ID3DXMesh::Optimize, D3DXOptimizeFaces, D3DXOptimizeVertices
).
NOTE
If you render point sprites, the metric is always equal to
Vertex Count
and
Primitive Count
(that is, no optimizations are necessary).
The size of the VCache varies for different GPU models, so you may see different metric values when using the same geometry on different hardware.
VS Send Pipe Active %
Represents the percentage of time in which EU send pipeline was actively processing a vertex shader instruction.
VS FPU0 Pipe Active %
Represents the percentage of time in which EU FPU0 pipeline was actively processing a vertex shader instruction.
VS FPU1 Pipe Active %
Represents the percentage of time in which EU FPU1 pipeline was actively processing a vertex shader instruction.
HS Duration
Represents the total amount of time the GPU spent executing hull shader code.
Examples
  • If
    HS Duration
    is 50,000 it means that the GPU spent 50 milliseconds executing hull shader code for the selected ergs.
  • If
    HS Duration
    is 0, it means that either the time spent executing hull shader code was negligible, or there was no hull shader in use.
The heading in this template is a special field for topic titles, so generally you do not need to edit it.
Improving Performance
If the
HS Duration
is larger than you expect, you can examine your hull shader code in the Graphics Frame Analyzer to investigate possible optimizations.
HS EU Active %
Represents the percentage of overall GPU time that the EUs were actively executing Hull Shader instructions.
HS EU Stall %
Represents the percentage of overall GPU time that the EUs were stalled in Hull Shader instructions. A shader thread will stall when it reaches an instruction that cannot complete until some time-consuming operation is completed.
NOTE
This metrics does not include the total amount of stalled time in the Hull Shader, but only the amount of time when the Hull Shader was causing the entire EU to stall. The EUs in the Intel® HD Graphics are hyperthreaded, which means that each EU can very quickly (within 2 clock cycles) switch from a stalled shader thread to another shader thread. Therefore, it is possible at any given time for a number of shader threads to be stalled on an EU, but for the EU to continue actively executing instructions on another shader thread. The entire EU is considered to be stalled only when all of its threads are stalled.
Improving Performance
If a large amount of stall time seems to be occurring in a particular shader, then you should examine that shader to see whether you can reduce or eliminate some of the stalls.
Short shaders might normally stall for a majority of their execution time, since in such situations instruction or data fetch (texels, constants) latency cannot be ‘hidden’. If a large stall time occurs in longer shaders, it usually indicates inefficient shader execution and possible optimization opportunities.
Inspect the shader code that was executed for a given draw call and experiment with optimizations in the Graphics Frame Analyzer.
HS Invocations
Represents the number of Hull Shader invocations. The Hull Shader is invoked once per patch.
Examples
The SimpleBezier11 sample from the Microsoft* DirectX* SDK is a good example to understand Hull Shaders. This sample renders a Mobius strip comprised of four patches with 64 control points per patch. Execution of this sample will result in an
HS Invocations
value of four.
Improving Performance
The Hull Shader is not usually a performance bottleneck, but it can definitely cause performance issues further down the rendering pipeline. If the Hull Shader specifies large tessellation factors, or as the
HS Invocations
value increases, it will result in more work for the fixed function tessellator as well as an increased number of
DS Invocations
and
GS Invocations
.
DS Duration
Represents the total amount of time the GPU spent executing domain shader code.
Examples
  • If
    DS Duration
    is 50,000 it means that the GPU spent 50 milliseconds executing domain shader code for the selected ergs.
  • If
    DS Duration
    is 0, it means that either the time spent executing domain shader code was negligible, or there was no domain shader in use.
Improving Performance
If
DS Duration
is larger than you expect, you can examine your domain shader code in the Graphics Frame Analyzer to investigate possible optimizations.
DS EU Active %
Represents the percentage of overall GPU time that the EUs were actively executing Domain Shader instructions.
DS EU Stall %
Represents the percentage of overall GPU time that the EUs were stalled in Domain Shader instructions. A shader thread will stall when it reaches an instruction that cannot complete until some time-consuming operation is completed.
NOTE
This metrics does not include the total amount of stalled time in the Domain Shader, but only the amount of time when the Domain Shader was causing the entire EU to stall. The EUs in the Intel® HD Graphics are hyperthreaded, which means that each EU can very quickly (within 2 clock cycles) switch from a stalled shader thread to another shader thread. Therefore, it is possible at any given time for a number of shader threads to be stalled on an EU, but for the EU to continue actively executing instructions on another shader thread. The entire EU is considered to be stalled only when all of its threads are stalled.
Improving Performance
If a large amount of stall time seems to be occurring in a particular shader, then you should examine that shader to see whether you can reduce or eliminate some of the stalls.
Short shaders might normally stall for a majority of their execution time, since in such situations instruction or data fetch (texels, constants) latency cannot be ‘hidden’. If a large stall time occurs in longer shaders, it usually indicates inefficient shader execution and possible optimization opportunities.
You can inspect the shader code that was executed for a given draw call and experiment with optimizations in Graphics Frame Analyzer.
DS Invocations
Represents the number of Domain Shader invocations. The Domain Shader is invoked once per fixed function tessellator output point.
Examples
The SimpleBezier11 sample from the Microsoft* DirectX* SDK is a good example to understand Domain Shaders. This sample renders a Mobius strip comprised of 4 patches with 64 control points per patch.
Increasing the Patch Divisions slider increases the tessellation factors of the Hull Shader which results in and increased number of inputs into the Domain Shader. When the Patch Divisions slider is set to 4.0, the DS Invocations value will be 192. When the Patch Divisions slider is set to 5.0, the
DS Invocations
value will be 320.
Improving Performance
The purpose of a Domain Shader is to calculate the vertex positions for subdivided points output by the fixed function tessellator. The best way to improve performance is to minimize the number of
DS Invocations
. This can be done by decreasing the amount of tessellation performed by either decreasing the number Hull Shader Invocations or decreasing the tessellation factors in the Hull Shader.
GS Duration
Represents the approximate total GPU time spent executing geometry shader code.  
Examples
  • If
    GS Duration
    is 50,000, it means that GPU spends around 50 milliseconds to execute geometry shaders for selected ergs.
  • If
    GS Duration
    is 0, it means that time spent in geometry shaders for selected ergs is very small or no geometry shaders were associated with selected ergs.
Improving Performance
If you are encountering performance issues and the
GS Duration
time is more than 20% to 40% of the total
GPU Duration
, geometry shader code optimizations may be needed.
Examine geometry shader code in the Graphics Frame Analyzer to see if optimizations are possible.
Refer to the Graphics API Performance Guide for recommendations on how to optimize the geometry shader.
GS EU Active %
Represents the percentage of overall GPU time that the EUs were actively executing Geometry Shader instructions.
Examples
  • If
    GS EU Active %
    is 50% it means that half of the overall GPU time was spent actively executing geometry shader instructions.
  • If
    GS EU Active %
    is 0% it means that no geometry shader was associated with the selected draw calls, or that the amount of time actively executing geometry shader instructions was negligible.
Improving Performance
  • This metric is important if geometry shader seems to be the bottleneck for selected rendering calls. If
    GS EU Active %
    accounts for most of the EU active time, then to improve performance you may need to simplify the geometry shader or simplify and optimize the geometry of the scene.
  • If
    GS EU Active %
    is more than a nominal amount, you may need to examine your geometry shader code to find reasons for what might be causing these stalls.
Inspect the shader code using the Graphics Frame Analyzer.
GS EU Stall %
Represents the percentage of overall GPU time that the EUs were stalled in Geometry Shader instructions. 
NOTE
This metric does not include the total amount of stalled time in the geometry shader but only the fraction of time when the geometry shader stalls were causing the entire EU to stall. The entire EU stalls when all of its threads are stalled.
Examples
  • If
    GS EU Stall %
    is 50%, it means that half of the overall GPU time was spent stalled on Geometry Shader instructions.
  • If
    GS EU Stall %
    is 0%, it means that no Geometry Shader was associated with selected rendering calls or Geometry Shader threads were not causing EUs stalls.
Improving Performance
  • This metric will be important if you think that geometry shader looks like the bottleneck for selected rendering calls. If
    GS EU Stall %
    accounts for most of the EU active time, then to improve performance you may need to simplify the geometry shader or simplify and optimize geometry.
  • If
    GS EU Stall %
    is more than a nominal amount, you may need to examine your geometry shader code to find reasons for what might be causing these stalls.
Inspect the shader code using the Graphics Frame Analyzer.
GS Invocations
Represents the number of geometry shader invocations. The value is 0 if no geometry shader is associated with the rendering call.
NOTE
See Microsoft* DirectX* SDK for a description of the shader invocation count.
Examples
If
GS Invocations
is 1000 it means that the geometry shader was invoked for 1000 primitives.
Improving Performance
The only way to minimize the number of geometry shader invocations is to minimize the number of input primitives. The impact on rendering performance of reducing the invocation count is highly dependent upon your specific game or application.
Post-GS Primitives
Represents the number of primitives that flowed out of the geometry shader (GS), if enabled, to the clipper. This metric is important if a geometry shader was associated with the selected rendering calls, and even more important if the number of primitives spawned by geometry shader code is dynamic.
NOTE
If the GS was not enabled for the selected rendering calls, the metric returns a value of 0.
Examples
If
Post-GS Primitives
is 1000 and
Primitive Count
is 100, it means that 1000 primitives were constructed in the geometry shader from the original 100.
Improving Performance
Analyze the geometry shader code using Graphics Frame Analyzer.
PS Duration
Represents an approximation of the total GPU time spent executing pixel shader code.  
Examples
  • If
    Pixel Shader Duration
    is 50,000 it means that GPU spends around 50 milliseconds to execute pixel shaders for the selected ergs.
  • If
    Pixel Shader Duration
    is 0 it means that time spent in pixel shaders for selected ergs is very small.
Improving Performance
Examine the
Pixel Shader Duration
time versus the
GPU Duration
; when
Pixel Shader Duration
is high you may improve overall rendering performance by optimizing your pixel shader code.
Refer to the Graphics API Performance Guide to find advice for pixel shader optimizations.
PS EU Active %
Represents the percentage of overall GPU time that the EUs were actively executing Pixel Shader instructions.
Examples
  • If
    PS EU Active %
    is 50% it means that half of the overall GPU time was spent actively executing Pixel Shader instructions.
  • If
    PS EU Active %
    is 0% it means that no Pixel Shader was associated with the selected draw calls, or that the amount of time actively executing Pixel Shader instructions was negligible.
Improving Performance
  • This metric is important if pixel shading seems to be the bottleneck for selected rendering calls.
  • If
    PS EU Active %
    accounts for most of the EU active time, then to improve performance you may need to simplify the pixel shader.
  • If
    PS EU Active %
    is larger than you would expect and you are encountering slow rendering times, you should examine the pixel shader code for potential reasons why these stalls may be occurring.
PS EU Stall %
Represents the percentage of overall GPU time that the EUs were stalled in Pixel Shader instructions. 
NOTE
This metric does not show total amount of stalled time in the pixel shader, but only the fraction of time when pixel shader stalls caused the entire EU to stall. The entire EU stalls when all of its threads are stalled.
Examples
  • If
    PS EU Stall %
    is 50% it means that half of the overall GPU time was spent stalled on Pixel Shader instructions.
  • If
    PS EU Stall %
    is 0% it means that no Pixel Shader was associated with selected rendering calls or Pixel Shader threads were not causing EUs stalls.
Improving Performance
  • This metric is important if pixel shading seems to be the bottleneck for selected rendering calls. If
    PS EU Stall %
    accounts for most the EU active time, then to improve performance you may need to simplify the pixel shader.
  • If
    PS EU Stall %
    is larger than you expect and you are encountering slow rendering times, you need to concentrate on pixel shader code to find reasons for these stalls.
PS Invocations
Represents the number of pixel shader invocations. The pixel shader is invoked once per pixel.
Examples
If you render a quad with 8x8 pixels size that is located entirely within the viewing frustum, the
Pixel Shader Invocation Count
is 64.
Improving Performance
Usually
 PS Invocations
workloads are one of the most expensive in the rendering pipeline due to the processing time required within the pixel shader. Therefore, keeping the number of invocations as low as possible will likely improve your rendering performance.
NOTE
For Intel® microarchitecture code name Ivy Bridge and Bay Trail, this metric includes pixels rejected by Early-Depth test, even though the pixel shader was not actually invoked for these pixels.
PS Send Pipeline Active %
Represents the percentage of time in which EU send pipeline was actively processing a pixel shader instruction.
PS FPU0 Pipe Active %
Represents the percentage of time in which EU FPU0 pipeline was actively processing a pixel shader instruction.
PS FPU1 Pipe Active %
Represents the percentage of time in which EU FPU1 pipeline was actively processing a pixel shader instruction.
EU FPU0 Pipe Active %
Represents the percentage of time during which the EU FPU0 pipeline was actively processing.
EU FPU1 Pipe Active %
Represents the percentage of time during which the EU FPU1 pipeline was actively processing.
EU Both FPU Pipes Active %
Represents the percentage of time in which both EU FPU pipelines were actively processing.
EU Send Pipe Active %
Represents the percentage of time during which the EU Send pipeline was actively processing.
CS Duration
Represents the total amount of time the GPU spent executing compute shader code.
Examples
  • If
    CS Duration
    is 50,000 it means that the GPU spent 50 milliseconds executing compute shader code for the selected ergs.
  • If
    CS Duration
    is 0 it means that either the time spent executing compute shader code was negligible, or there was no compute shader in use.
Improving Performance
If
CS Duration
is larger than you expect, you can examine your compute shader code in the Graphics Frame Analyzer to investigate possible optimizations.
CS EU Active %
Represents the percentage of overall GPU time that the EUs were actively executing Compute Shader instructions.
Examples
  • If
    CS EU Active %
    is 0%, it means that no compute shader was associated with the selected draw calls, or that the amount of time actively executing compute shader instructions was negligible.
CS EU Stall %
Represents the percentage of overall GPU time that the EUs were stalled in Compute Shader instructions. A shader thread will stall when it reaches an instruction that cannot complete until some time-consuming operation is completed.
Examples
  • If
    CS EU Stall %
    is 0%, it means that no Compute shader was associated with the selected draw calls, or that the amount of time stalled on Compute shader instructions was negligible.
NOTE
This metric does not include the total amount of stalled time in the Compute Shader, but only the amount of time when the Compute Shader was causing the entire EU to stall. The EUs in the Intel® HD Graphics are hyperthreaded, which means that each EU can very quickly (within 2 clock cycles) switch from a stalled shader thread to another shader thread. Therefore, it is possible at any given time for a number of shader threads to be stalled on an EU, but for the EU to continue actively executing instructions on another shader thread. The entire EU is considered to be stalled only when all of its threads are stalled.
Improving Performance
If a large amount of stall time seems to be occurring in a particular shader, then you should examine that shader to see whether you can reduce or eliminate some of the stalls.
Short shaders might normally stall for a majority of their execution time, since in such situations instruction or data fetch (texels, constants) latency cannot be ‘hidden’. If a large stall time occurs in longer shaders, it usually indicates inefficient shader execution and possible optimization opportunities.
Inspect the shader code that was executed for a given draw call and experiment with optimizations in Graphics Frame Analyzer.
CS Invocations
Represents the number of compute shader invocations. The Compute Shader is invoked once per thread per thread group. The number of threads per thread group is defined by the Compute Shader’s
numthreads
attribute (
numthreads(tX, tY, tZ)
). The number of thread groups executed is determined by the parameters to the Dispatch call (
Dispatch(gX, gY, gZ)
).
CS Invocations
is equal to
(gX*gY*gZ)*(tX*tY*tZ)
.
Examples
  • If the numthreads attribute is numthreads(4, 4, 1) and Dispatch is called as Dispatch(16, 16, 16), the
    CS Invocations
    value will be equal to (16*16*16)*(4*4*1) = 65536.
Sampler Metrics
Sampler Busy %
Represents the percentage of time the texture sampler was busy handling texel fetch requests (that is, was either active or stalled).
NOTE
This metric is unreliable when protected HD media content is being played back on a system with Intel® HD Graphics 5000/ 4600 / 4400 / 4200, Intel® Iris® graphics 5100, or Intel® Iris® Pro graphics 5200 configuration.
Examples
  • If
    Sampler Busy %
    is 50, it means that texture sampler was active 50% of the rendering time for the selected ergs.
  • If
    Sampler Busy %
    is 0, it means that texture sampler was not used or the time during which it was active is very small.
Improving Performance
When
Sampler Busy %
is running this might lead to execution unit stalls, especially if texture fetch latency does not occur in parallel with mathematical instructions (as the shader compiler attempts to optimize shader code to cover such latencies). Examine the
EU Stall %
metric to see the amount of EUs stalls. If the percentage is high and the
Sampler Busy %
is close to 100%, most likely you have a texturing bottleneck. Try the 2x2 textures experiment in the Experiments Tab in the Graphics Frame Analyzer to see if this is the case.
Sampler Texels, texels
Represents the number of texels returned from the texture sampler.
NOTE
This metric is unreliable when protected HD media content is being played back on a system with Intel® HD Graphics 5000/ 4600 / 4400 / 4200, Intel® Iris® graphics 5100, or Intel® Iris® Pro graphics 5200 configuration.
Examples
If
Sampler Texels, texels
is 1000, it means that 1000 texels were delivered to the execution units (EUs) from the texture sampler.
Improving Performance
A high number of texels fetched from textures leads to a higher texture bandwidth and a higher number of texture sampler unit stalls, which might cause a high number of EU stalls caused by shaders awaiting texels from the sampler unit.
Note that this metric could indicate that the shader stalls while fetching texture data inside branching logic. For example, if the shader fetches texture samples only inside an
if()
block in the code, this metric can help you understand how often the shader takes the branch.
NOTE
This metric is accurate only to four texels, and generally is slightly larger than the actual number of texels used. This is because the texture sampler returns data in 2x2 texel quads. When sampling along angular edges, this inaccuracy becomes more pronounced.
Sampler Cache Misses, messages
Represents the number of bytes of texture data read from memory by the GPU due to texture cache misses when rendering this frame. Note that the Texture Sampler reads data from memory in 64-byte blocks, so this metric can be used to calculate the number of texture cache misses as follows:
Examples
  • If
    Sampler Cache Misses, messages
    is 64000, it means the Texture Sampler missed the cache 1000 times and needed to read 64000 bytes of memory.
  • If
    Sampler Cache Misses, messages
    is 0, it means that no texture data was read from memory for the selected ergs.
NOTE
This metric is unreliable when protected HD media content is being played back on a system with Intel® HD Graphics 5000/ 4600 / 4400 / 4200, Intel® Iris® graphics 5100, or Intel® Iris® Pro graphics 5200 configuration.
Improving Performance
Usually a higher value for this metric leads to a higher percentage of Texture Sampler stalls. Therefore, utilize techniques that minimize the number of texture reads, such as shown in the "Improving Performance" section of the Sampler Stalled metric.
Sampler Bottleneck %
Represents the percentage of time that the texture sampler is a bottleneck. The sampler is stalling Execution Units (EUs) due to a full input FIFO and starving EUs due to a lack of results.
NOTE
This metric is unreliable when protected HD media content is being played back on a system with Intel® HD Graphics 5000/ 4600 / 4400 / 4200, Intel® Iris® graphics 5100, or Intel® Iris® Pro graphics 5200 configuration.
Examples
If
Sampler Bottleneck %
is 90, then the texture sampler is a bottleneck (stalling some EUs and/or causing other EUs to idle) 90% of the time.
Improving Performance
The following techniques may improve the texture sampler performance:
  • Reducing the size of textures, by using a lower resolution or lower color precision (such as RGBA4444 instead of RGBA8888)
  • Using texture compression to reduce the amount of memory to transfer textures
  • Using mipmapping, so that smaller textures (mipmaps) can be used
  • Reducing the number of textures in the scene
  • Using a different filtering algorithm
For example, anisotropic filtering is more expensive to compute than a simpler algorithm, such as bilinear filtering. To help minimize overhead in this area, capture a typical frame while the game is running, use this frame as input to the Graphics Frame Analyzer, and try one or more of the following techniques:
  • the
    2x2 Textures
    experiment in the
    Experiments
    tab to see if textures are a bottleneck
  • the
    Texture
    tab to see the texture size, format, and mip level
NOTE
This metric might show incorrect results and will be disabled with the next driver update.
Sampler Stalled
Represents the percentage of time the texture sampler was stalled. The texture sampler is stalled when its output queue is full, which can occur when it returns texture requests faster than the EUs can process them. When the texture sampler is stalled, it cannot process new requests.
Examples
  • If
    Sampler Stalled
    is 50%, it means that half of the time when texture sampler was busy it was waiting for space to open up in its output queue.
  • If
    Sampler Stalled
    is 0%, it means that texture sampler never stalled.
Improving Performance
  • Reduce the number of texture fetches in the shader code.
  • Reduce texture size and texture filtering setting under the Texture Tab in the Graphics Frame Analyzer to see if this helps improve performance without adversely affecting image quality.
  • Minimize anisotropic filtering, because it requires a high number of additional texel fetches and is therefore "expensive" to use.
  • Modify the texture fetching pattern in the shader code to optimize texture cache utilization.
To inspect shader code, see the Shaders Tab in the Graphics Frame Analyzer.
3D Pipe Metrics
Early Hi-Depth Test Fails, pixels
Represents the total number of pixels dropped on the early hierarchical depth test.
Early Depth Test Fails, pixels
Represents the number of pixels that failed the early depth/stencil tests.
Clipper Invocations
Represents the number of primitives processed by the Clipper.
Examples
  • If you render 100 triangles and clipping is enabled, the
    Clipper Invocations
    is 100.
  • If you render 100 triangles and clipping is disabled, the
    Clipper Invocations
    is 0.
Improving Performance
In most cases you do not have to care about the clipper performance on Intel® HD Graphics 2000/3000 GPUs because these graphic processors utilize a fast clipping algorithm implemented in silicon.
For more information on enabling/disabling hardware clipping read the Microsoft* DirectX* SDK documentation.
Post-Clip Primitives
Represents the number of primitives that flowed out of the clipper. The metric includes original primitives that passed the trivial clipping test (trivial accept), and new primitives that were created by the clipper as a result of the clipping operation.
Examples
  • If you render 100 triangles and clipping is enabled and all the triangles are trivially accepted, the
    Post-Clip Primitives
    is 100.
  • If you render 100 triangles and clipping is enabled and all the triangles are trivially rejected, the
    Post-Clip Primitives
    is 0.
  • If you render 100 triangles and clipping is enabled and one or more triangles are partially located within the viewing frustum, the
    Post-Clip Primitive
    count returns a value which could be more or less than 100 depending on the number of triangles that were clipped. If the value is significantly higher than 100 it means that many triangles where partially clipped, and the clipper created additional triangles.
Improving Performance
In most cases you do not have to care about the clipper performance on Intel® HD Graphics 2000/3000 GPUs because these graphic processors implement an efficient clipping algorithm in silicon.
For more information on enabling/disabling hardware clipping read the Microsoft* DirectX* SDK documentation.
Samples Killed in PS, pixels
Represents the total number of samples or pixels dropped in pixel shaders.
Primitive Count
Represents the number of primitives sent to the 3D hardware.
NOTE
For Microsoft* DirectX* 9: the
Primitive Count
metric matches the
PrimitiveCount
parameter in the rendering calls.
Examples
  • If you render 100 points, the IA stage assembles 100 point primitives and the
    Primitive Count
    is 100.
  • If you render two triangles as a triangle list, the IA stage assembles two triangles and the
    Primitive Count
    is two.
  • If you render two triangles as a triangle strip, the IA assembles two triangles and the
    Primitive Count
    is two.
Improving Performance
If geometry/vertex processing becomes a bottleneck, try to reduce number of primitives sent to GPU for each frame by:
  • Simplifying your rendering geometry; for example, show small geometry details using bump maps instead of triangles, use lower detail models for far away objects, or use textures with multiple mip maps.
  • Optimizing your scene through various culling techniques; for example, use Binary Space Partitioning (BSP), Portal rendering, or Octrees.
Vertex Count
Represents the number of vertices sent to the 3D hardware pipeline during the D3D Input Assembler (IA) stage. The number of vertices depends on the primitive type and the number of primitives. The following formulas are used:
Primitive type
Vertex Count
Point list
Number of Primitives
Triangle list
Number of Primitives *3
Triangle strip
Number of Primitives +2
Line list
Number of Primitives *2
Line strip
Number of Primitives +1
NOTE
For Microsoft* DirectX* 9
: When rendering
indexed primitives
the
Vertex Count
metric does not match the
NumVertices
parameter in the
::DrawIndexedPrimitive
,
::DrawIndexedPrimitiveUP
functions because the Input Assembler counts shared vertices multiple times.
For Microsoft* DirectX* 10
and later
: The
Vertex Count
metric does not include vertices created during the geometry shader stage.
Examples
  • If you render 100 points, the IA stage assembles 100 point primitives with 100 vertices total, and the
    Vertex Count
    is 100.
  • If you render two triangles as a triangle list, the IA stage assembles two triangles with six vertices total, and the
    Vertex Count
    is six.
  • If you render two triangles as a triangle strip, the IA stage assembles two triangles with four vertices total, and the
    Vertex Count
    is four.
Improving Performance
To minimize the number of vertices sent to the pipeline and thereby improve vertex processing performance, use graphics primitives that minimize the amount of data being sent to and processed by the GPU, such as using single triangle strips.
Samples Blended, pixels
Represents the total number of samplers or pixels written to all render targets.
Samples Written
Represents the number of pixels/samples written to render targets.
The graphics driver 9.17.10 introduces a new notion of deferred clears. For the sake of optimization, the driver decides whether to defer the actual rendering of clear calls in case subsequent clear and draw calls make it unnecessary. As a result, when clear calls are deferred, the Graphics Frame Analyzer shows their
GPU Duration
and
Samples Written
as zero. If later it turns out that a clear call needs to be drawn, the work associated with that clear call gets included in the duration of the erg that was being drawn when this clear call was deferred, not necessarily a clear call. This means that in the Graphics Frame Analyzer metrics associated with a clear call accurately reflect the real work associated with that erg.
Alpha Test Fails
Represents the number of pixels that failed the alpha test and are ignored (not written to the surface).
Examples
If
Alpha Test Fails
is 5000, then 5000 pixels failed the alpha test and were not written to the surface.
Pixels Rendered
Represents the number of pixels that passed the depth-test (both Z-buffer and Stencil if enabled). If the depth-test was disabled,
Pixels Rendered
counts all the pixels that passed through from the previous pipeline stage.
NOTE
Pixels that passed the depth-test might not necessarily appear in the render target, which could occur if the color buffer write mask is set to 0.
Examples
  • If you render a quad with 8x8 pixels, located entirely within the viewing frustum and all the pixels passed depth test,
    Pixels Rendered
    is 64.
  • If you render a quad with 8x8 pixels, located entirely within the viewing frustum and half of the pixels are rejected by depth tests or other stages of the graphics pipeline,
    Pixels Rendered
    is 32.
Improving Performance
A high number of rendered pixels results in a high number of pixel shader executions, which requires more rendering time. To keep the number of rendered pixels as low as possible, optimize the rendering order to maximize
Early-Z
benefit or use a
Z-only
pass if possible.
To find areas with high depth complexity, use the Overdraw option in the Graphics Frame Analyzer.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804