User Guide

Contents

gpu-hotspots
Command Line Analysis

Use the
gpu-hotspots
value to launch the GPU Compute/Media Hotspots analysis to:
  • Explore GPU kernels with high GPU utilization, estimate the effectiveness of this utilization, identify possible reasons for stalls or low occupancy and options.
  • Explore the performance of your application per selected GPU metrics over time.
  • Analyze the hottest DPC++ or OpenCL™ kernels for inefficient kernel code algorithms or incorrect work item configuration.

Configure Characterization Analysis

This type of analysis requires root/administrative privileges and is available for Windows*, Linux* and Android* targets.
Use the
Characterization
configuration option to monitor the Render and GPGPU engine usage (Intel Graphics only), identify which parts of the engine are loaded, and correlate GPU and CPU data. When you select the
Characterization
radio button, the configuration section expands additional options.
The Characterization drop-down menu provides platform-specific presets of the GPU metrics. All presets, except for the
Dynamic Instruction Count
, collect data about execution units (EUs) activity: EU Array Active, EU Array Stalled, EU Array Idle, Computing Threads Started, and Core Frequency; and each one introduces additional metrics:
  • Overview
    metric set includes additional metrics that track general GPU memory accesses such as Memory Read/Write Bandwidth, GPU L3 Misses, Sampler Busy, Sampler Is Bottleneck, and GPU Memory Texture Read Bandwidth. These metrics can be useful for both graphics and compute-intensive applications.
  • Compute Basic (with global/local memory accesses)
    metric group includes additional metrics that distinguish accessing different types of data on a GPU: Untyped Memory Read/Write Bandwidth, Typed Memory Read/Write Transactions, SLM Read/Write Bandwidth, Render/GPGPU Command Streamer Loaded, and GPU EU Array Usage. These metrics are useful for compute-intensive workloads on the GPU.
  • Compute Extended
    metric group includes additional metrics targeted only for GPU analysis on the Intel processor code name Broadwell and higher. For other systems, this preset is not available.
  • Full Compute
    metric group is a combination of the
    Overview
    and
    Compute Basic
    event sets.
  • Dynamic Instruction Count
    metric group counts the execution frequency of specific classes of instructions. With this metric group, you also get an insight into the efficiency of SIMD utilization by each kernel.
For the Characterization analysis, you can also collect additional data:
  • Use the
    Trace GPU programming APIs
    option to analyze DPC++, OpenCL™, or Intel Media SDK programs running on Intel Processor Graphics. This option may affect the performance of your application on the CPU side.
    For DPC++ or OpenCL applications, you may identify the hottest kernels and identify the GPU architecture block where a performance issue for a particular kernel was detected.
    For Intel Media SDK programs, you may explore the Intel Media SDK tasks execution on the timeline and correlate this data with the GPU usage at each moment of time.
    Support limitations:
    • OpenCL kernels analysis is possible for Windows and Linux targets running on Intel Graphics.
    • Intel Media SDK program analysis is possible for Windows and Linux targets running on Intel Graphics.
    • Only
      Launch Application
      or
      Attach to Process
      target types are supported.
    In the
    Attach to Process
    mode if you attached to a process when the computing queue is already created,
    VTune
    Profiler
    will not display data for the OpenCL kernels in this queue.
  • Use the
    Analyze memory bandwidth
    option to collect the data required to compute memory bandwidth. This type of analysis requires Intel sampling drivers to be installed.
  • Use the
    GPU sampling internal, ms
    field to specify an interval (in milliseconds) between GPU samples for GPU hardware metrics collection. By default, the
    VTune
    Profiler
    uses 1ms interval.

Configure Source Analysis

In the Source Analysis,
VTune
Profiler
helps you identify performance-critical basic blocks, issues caused by memory accesses in the GPU kernels.
  • Basic Blocks Latency
    option helps you identify issues caused by algorithm inefficiencies. In this mode,
    VTune
    Profiler
    measures the execution time of all basic blocks. Basic block is a straight-line code sequence that has a single entry point at the beginning of the sequence and a single exit point at the end of this sequence. During post-processing,
    VTune
    Profiler
    calculates the execution time for each instruction in the basic block. So, this mode helps understand which operations are more expensive.
  • Memory Latency
    option helps identify latency issues caused by memory accesses. In this mode,
    VTune
    Profiler
    profiles memory read/synchronization instructions to estimate their impact on the kernel execution time. Consider using this option, if you ran the GPU Compute/Media Hotspots analysis in the Characterization mode, identified that the GPU kernel is throughput or memory-bound, and want to explore which memory read/synchronization instructions from the same basic block take more time.
In the
Basic Block Latency
or
Memory Latency
profiling modes, the GPU Compute/Media Hotspots analysis uses these metrics:
  • Estimated GPU Cycles
    : The average number of cycles spent by the GPU executing the profiled instructions.
  • Average Latency
    : The average latency of the memory read and synchronization instructions, in cycles.
  • GPU Instructions Executed per Instance
    : The average number of GPU instructions executed per one kernel instance.
  • GPU Instructions Executed per Thread
    : The average number of GPU instructions executed by one thread per one kernel instance.
If you enable the
Instruction count
profiling mode,
VTune
Profiler
shows a breakdown of instructions executed by the kernel in the following groups:
Control Flow
group
if, else, endif, while, break, cont, call, calla, ret, goto, jmpi, brd, brc, join, halt
and
mov, add
instructions that explicitly change the ip register.
Send & Wait
group
send, sends, sendc, sendsc, wait
Int16 & HP Float
|
Int32 & SP Float
|
Int64 & DP Float
groups
Bit operations (only for integer types):
and, or, xor,
and others.
Arithmetic operations:
mul, sub,
and others;
avg, frc, mac, mach, mad, madm
.
Vector arithmetic operations:
line, dp2, dp4,
and others.
Extended math operations.
Other
group
Contains all other operations including
nop
.
In the
Instruction count
mode,
VTune
Profiler
also provides
Operations per second
metrics calculated as a weighted sum of the following executed instructions:
  • Bit operations (only for integer types):
    • and, not, or, xor, asr, shr, shl, bfrev, bfe, bfi1, bfi2, ror, rol
      - weight 1
  • Arithmetic operations:
    • add, addc, cmp, cmpn, mul, rndu, rndd, rnde, rndz, sub
      - weight 1
    • avg, frc, mac, mach, mad, madm
      - weight 2
  • Vector arithmetic operations:
    • line
      - weight 2
    • dp2, sad2
      - weight 3
    • lrp, pln, sada2
      - weight 4
    • dp3
      - weight 5
    • dph
      - weight 6
    • dp4
      - weight 7
    • dp4a
      - weight 8
  • Extended math operations:
    • math.inv, math.log, math.exp, math.sqrt, math.rsq, math.sin, math.cos
      (weight 4)
    • math.fdiv, math.pow
      (weight 8)
The type of an operation is determined by the type of a destination operand.

Syntax

vtune
-collect gpu-hotspots [-knob <
knobName=knobValue
>] -- <
target
> [
target_options
]
For the most current information on available knobs (configuration options) for the GPU Compute/Media Hotspots analysis, enter:
vtune
-help collect gpu-hotspots
Example
This example runs the
gpu-hotspots
analysis in the default
characterization
mode with the default
overview
GPU hardware metric preset:
vtune -collect gpu-hotspots -knob enable-gpu-runtimes=true -- /home/test/myApplication

What's Next

When the data collection is complete, do one of the following to view the result:

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804