User Guide

  • 2020
  • 06/18/2020
  • Public Content

Command Line Analysis

Use the
value to launch the GPU Compute/Media Hotspots analysis to:
  • Explore GPU kernels with high GPU utilization, estimate the effectiveness of this utilization, identify possible reasons for stalls or low occupancy and options.
  • Explore the performance of your application per selected GPU metrics over time.
  • Analyze the hottest DPC++ or OpenCL™ kernels for inefficient kernel code algorithms or incorrect work item configuration.
In the
Basic Block Latency
Memory Latency
profiling modes, the GPU Compute/Media Hotspots analysis uses these metrics:
  • Estimated GPU Cycles
    : The average number of cycles spent by the GPU executing the profiled instructions.
  • Average Latency
    : The average latency of the memory read and synchronization instructions, in cycles.
  • GPU Instructions Executed per Instance
    : The average number of GPU instructions executed per one kernel instance.
  • GPU Instructions Executed per Thread
    : The average number of GPU instructions executed by one thread per one kernel instance.
If you enable the
Instruction count
profiling mode,
shows a breakdown of instructions executed by the kernel in the following groups:
Instruction count profiling mode
Control Flow
if, else, endif, while, break, cont, call, calla, ret, goto, jmpi, brd, brc, join, halt
mov, add
instructions that explicitly change the ip register.
Send & Wait
send, sends, sendc, sendsc, wait
Int16 & HP Float
Int32 & SP Float
Int64 & DP Float
Bit operations (only for integer types):
and, or, xor,
and others.
Arithmetic operations:
mul, sub,
and others;
avg, frc, mac, mach, mad, madm
Vector arithmetic operations:
line, dp2, dp4,
and others.
Extended math operations.
Contains all other operations including
In the
Instruction count
mode, the
also provides
Operations per second
metrics calculated as a weighted sum of the following executed instructions:
  • Bit operations (only for integer types):
    • and, not, or, xor, asr, shr, shl, bfrev, bfe, bfi1, bfi2, ror, rol
      - weight 1
  • Arithmetic operations:
    • add, addc, cmp, cmpn, mul, rndu, rndd, rnde, rndz, sub
      - weight 1
    • avg, frc, mac, mach, mad, madm
      - weight 2
  • Vector arithmetic operations:
    • line
      - weight 2
    • dp2, sad2
      - weight 3
    • lrp, pln, sada2
      - weight 4
    • dp3
      - weight 5
    • dph
      - weight 6
    • dp4
      - weight 7
    • dp4a
      - weight 8
  • Extended math operations:
    • math.inv, math.log, math.exp, math.sqrt, math.rsq, math.sin, math.cos
      (weight 4)
    • math.fdiv, math.pow
      (weight 8)
The type of an operation is determined by the type of a destination operand.


-collect gpu-hotspots [ -knob <
>] -- <
> [
For the most current information on available knobs (configuration options) for the GPU Compute/Media Hotspots analysis, enter:
-help collect gpu-hotspots
This example runs the
analysis in the default
mode with the default
GPU hardware metric preset:
-collect gpu-hotspots -knob enable-gpu-runtimes=true -- /home/test/myApplication

What's Next

When the data collection is complete, do one of the following to view the result:

Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804