User Guide


GPU Compute/Media Hotspots View

Use the
Intel® VTune™
's GPU Compute/Media Hotspots viewpoint to analyze how your GPU-bound code is utilizing GPU and CPU resources.
Depending on the profiling mode selected for the GPU Compute/Media Hotspots analysis, you can explore your GPU-side code performance from different perspectives:

Analyze Memory Accesses

mode, which is enabled by default in the GPU Compute/Media Hotspots configuration, is a recommended entry-level analysis for GPU-bound applications.
Hottest GPU Computing Task
section of the
window displays the most time-consuming GPU tasks. Click such a task to switch the
tab and explore GPU hardware metrics (by default, the
set of metrics) collected for this hotspot:
Analyze memory accesses in GPU Compute/Media Hotspots analysis

Analyze GPU Instruction Execution

If you enabled the
Dynamic Instruction Count
preset as part of the Characterization analysis configuration, the
tab shows a breakdown of instructions executed by the kernel in the following groups:
Control Flow
if, else, endif, while, break, cont, call, calla, ret, goto, jmpi, brd, brc, join, halt
mov, add
instructions that explicitly change the ip register.
send, sends, sendc, sendsc
Int16 & HP Float
Int32 & SP Float
Int64 & DP Float
Bit operations (only for integer types):
and, or, xor,
and others.
Arithmetic operations:
mul, sub,
and others;
avg, frc, mac, mach, mad, madm
Vector arithmetic operations:
line, dp2, dp4,
and others.
Extended math operations:
, and others.
Contains all other operations including
The type of an operation is determined by the type of a destination operand.
In the
tab, the
also provides the SIMD Utilization metric. This metric helps identify kernels that underutilize the GPU by producing instructions that cause thread divergence. A common cause of low SIMD utilization is conditional branching within the kernel, since the threads execute all of the execution paths sequentially, with each thread executing one path while the other threads are stalled.
To get additional information, double-click the hottest function to open the source view. Enable both the
panes to get a side-by-side view of the source code and the resulting assembly code. You can then locate the assembly instructions with low SIMD Utilization values and map them to specific lines of code by clicking on the instruction. This allows you to determine and optimize the kernels that do not meet your desired SIMD Utilization criteria.

Analyze Source

If you selected the
Source Analysis
mode for the GPU Compute/Media Hotspots analysis, you can analyze a kernel of interest for basic block latency or memory latency issues. To do this, in the
tab, expand the kernel node and double-click the function name.
redirects you to the hottest source line for the selected function:
The GPU Compute/Media Hotspots provides a full-scale analysis of the kernel source per code line. The hottest kernel code line is highlighted by default.
To view the performance statistics on GPU instructions executed per kernel instance, switch to the Assembly view:
If your OpenCL kernel uses inline functions, make sure to enable the
Inline Mode
on the filter toolbar to have a correct attribution of the GPU Cycles per function. See examples.

Examine Energy Consumption by your GPU

In Linux environments, when you run the GPU Compute/Media Hotspots analysis on an Intel® Iris® X
MAX graphics discrete GPU, you can see energy consumption information for the GPU device. To collect this information, make sure you check the
Analyze power usage
option when you configure the analysis.
Analyze Power Usage by GPU
Once the analysis completes, see energy consumption data in these sections of your results.
In the
window, observe the
Energy Consumption
column in the grouping. Sort this column to identify the GPU kernels that consumed the most energy. You can also see this information mapped in the timeline.
Tune for Power Usage
When you locate individual GPU kernels that consume the most energy, for optimum power efficiency, start by tuning the top energy hotspot.
Tune for Processing Time
If your goal is to optimize GPU processing time, keep a check on energy consumption metrics per kernel to monitor the tradeoff between performance time and power use.
Move the
Energy Consumption
column next to
Total Time
to make this comparison easier.
Energy Consumption in GPU by Computing Task
You may notice that the correlation between power use and processing time is not direct. The kernels that compute the fastest may not be the same kernels that consume the least amounts of energy. Check to see if larger values of power usage correspond to longer stalls/wait periods.
Energy consumption metrics do not display in GPU profiling analyses that scan Intel® Iris® X
MAX graphics on Windows machines.
Example: Basic Block Latency Profiling
You have an OpenCL kernel that performs compute operations:
__kernel void viete_formula_comp(__global float* data) { int gid = get_global_id(0); float c = 0, sum = 0; for (unsigned i = 0; i < 50; ++i) { float t = 0; float p = (i % 2 ? -1 : 1); p /= i*2 + 1; p /= pown(3.f, i); p -=c; t = sum + p; c = (t - sum) - p; sum = t; } data[gid] = sum * sqrt(12.f); }
To compare these operations, run the GPU In-kernel profiling in the
Basic block latency
mode and double-click the kernel in the grid to open the Source view:
The Source view analysis highlights the
call as the most expensive operation in this kernel.
Example: Memory Latency Profiling
You have an OpenCL kernel that performs several memory reads (lines 14, 15 and 20):
__kernel void viete_formula_mem(__global float* data) { int gid = get_global_id(0); float c = 0; for (unsigned i = 0; i < 50; ++i) { float t = 0; float p = (i % 2 ? -1 : 1); p /= i*2 + 1; p /= pown(3.f, i); p -=c; t = data[gid] + p; c = (t - data[gid]) - p; data[gid] = t; } data[gid] *= sqrt(12.f); }
To identify which read instruction takes the longest time, run the GPU In-kernel Profiling in the
Memory latency
The Source view analysis shows that the compiler understands that each thread works only with its own element from the input buffer and generates the code that performs the read only once. The value from the input buffer is stored in the registry and reused in other operations, so the compiler does not generate additional reads.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at