User Guide


GPU Compute/Media Hotspots View

Use the
Intel® VTune™
's GPU Compute/Media Hotspots viewpoint to analyze how your GPU-bound code is utilizing GPU and CPU resources.
Depending on the profiling mode selected for the GPU Compute/Media Hotspots analysis, you can explore your GPU-side code performance from different perspectives:

Analyze Memory Accesses

mode, which is enabled by default in the GPU Compute/Media Hotspots configuration, is a recommended entry-level analysis for GPU-bound applications.
Hottest GPU Computing Task
section of the
window displays the most time-consuming GPU tasks. Click such a task to switch the
tab and explore GPU hardware metrics (by default, the
set of metrics) collected for this hotspot:
 In-Kernel Analysis

Analyze GPU Instruction Execution

If you enabled the
Dynamic Instruction Count
preset as part of the Characterization analysis configuration, the
tab shows a breakdown of instructions executed by the kernel in the following groups:
Control Flow
if, else, endif, while, break, cont, call, calla, ret, goto, jmpi, brd, brc, join, halt
mov, add
instructions that explicitly change the ip register.
send, sends, sendc, sendsc
Int16 & HP Float
Int32 & SP Float
Int64 & DP Float
Bit operations (only for integer types):
and, or, xor,
and others.
Arithmetic operations:
mul, sub,
and others;
avg, frc, mac, mach, mad, madm
Vector arithmetic operations:
line, dp2, dp4,
and others.
Extended math operations:
, and others.
Contains all other operations including
The type of an operation is determined by the type of a destination operand.
In the
tab, the
also provides the SIMD Utilization metric. This metric helps identify kernels that underutilize the GPU by producing instructions that cause thread divergence. A common cause of low SIMD utilization is conditional branching within the kernel, since the threads execute all of the execution paths sequentially, with each thread executing one path while the other threads are stalled.
To get additional information, double-click the hottest function to open the source view. Enable both the
panes to get a side-by-side view of the source code and the resulting assembly code. You can then locate the assembly instructions with low SIMD Utilization values and map them to specific lines of code by clicking on the instruction. This allows you to determine and optimize the kernels that do not meet your desired SIMD Utilization criteria.

Analyze Source

If you selected the
Source Analysis
mode for the GPU Compute/Media Hotspots analysis, you can analyze a kernel of interest for basic block latency or memory latency issues. To do this, in the
tab, expand the kernel node and double-click the function name.
redirects you to the hottest source line for the selected function:
The GPU Compute/Media Hotspots provides a full-scale analysis of the kernel source per code line. The hottest kernel code line is highlighted by default.
To view the performance statistics on GPU instructions executed per kernel instance, switch to the Assembly view:
If your OpenCL kernel uses inline functions, make sure to enable the
Inline Mode
on the filter toolbar to have a correct attribution of the GPU Cycles per function. See examples.
Example: Basic Block Latency Profiling
You have an OpenCL kernel that performs compute operations:
__kernel void viete_formula_comp(__global float* data) { int gid = get_global_id(0); float c = 0, sum = 0; for (unsigned i = 0; i < 50; ++i) { float t = 0; float p = (i % 2 ? -1 : 1); p /= i*2 + 1; p /= pown(3.f, i); p -=c; t = sum + p; c = (t - sum) - p; sum = t; } data[gid] = sum * sqrt(12.f); }
To compare these operations, run the GPU In-kernel profiling in the
Basic block latency
mode and double-click the kernel in the grid to open the Source view:
The Source view analysis highlights the
call as the most expensive operation in this kernel.
Example: Memory Latency Profiling
You have an OpenCL kernel that performs several memory reads (lines 14, 15 and 20):
__kernel void viete_formula_mem(__global float* data) { int gid = get_global_id(0); float c = 0; for (unsigned i = 0; i < 50; ++i) { float t = 0; float p = (i % 2 ? -1 : 1); p /= i*2 + 1; p /= pown(3.f, i); p -=c; t = data[gid] + p; c = (t - data[gid]) - p; data[gid] = t; } data[gid] *= sqrt(12.f); }
To identify which read instruction takes the longest time, run the GPU In-kernel Profiling in the
Memory latency
The Source view analysis shows that the compiler understands that each thread works only with its own element from the input buffer and generates the code that performs the read only once. The value from the input buffer is stored in the registry and reused in other operations, so the compiler does not generate additional reads.

Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804