GPU Compute/Media Hotspots View

Use the Intel® VTune™ Profiler's GPU Compute/Media Hotspots viewpoint to analyze how your GPU-bound code is utilizing GPU and CPU resources.

Depending on the profiling mode selected for the GPU Compute/Media Hotspots analysis, you can explore your GPU-side code performance from different perspectives:

Analyze Memory Accesses

The Characterization mode, which is enabled by default in the GPU Compute/Media Hotspots configuration, is a recommended entry-level analysis for GPU-bound applications.

The Hottest GPU Computing Task section of the Summary window displays the most time-consuming GPU tasks. Click such a task to switch the Graphics window and explore GPU hardware metrics (by default, the Overview set of metrics) collected for this hotspot:

In-Kernel Analysis

Analyze GPU Instruction Execution

If you enabled the Dynamic Instruction Count preset as part of the Characterization analysis configuration, in the Assembly view the VTune Profiler shows a breakdown of instructions executed by the kernel in the following groups:

Control Flow group

if, else, endif, while, break, cont, call, calla, ret, goto, jmpi, brd, brc, join, halt and mov, add instructions that explicitly change the ip register.

Send group

send, sends, sendc, sendsc

Synchronization group

wait

Int16 & HP Float | Int32 & SP Float | Int64 & DP Float groups

Bit operations (only for integer types): and, or, xor, and others.

Arithmetic operations: mul, sub, and others; avg, frc, mac, mach, mad, madm.

Vector arithmetic operations: line, dp2, dp4, and others.

Extended math operations: math.sin, math.cos, math.sqrt, and others.

Other group

Contains all other operations including nop.

Note

The type of an operation is determined by the type of a destination operand.

Analyze Source

If you selected the Source Analysis mode for the GPU Compute/Media Hotspots analysis, you can analyze a kernel of interest for basic block latency or memory latency issues. To do this, in the Graphics window, expand the kernel node and double-click the function name. VTune Profiler will redirect you to the hottest source line for the selected function:

The GPU Compute/Media Hotspots provides a full-scale analysis of the kernel source per code line. The hottest kernel code line is highlighted by default.

To view the performance statistics on GPU instructions executed per kernel instance, switch to the Assembly view:

Note

If your OpenCL kernel uses inline functions, make sure to enable the Inline Mode on the filter toolbar to have a correct attribution of the GPU Cycles per function. See examples.

Example: Basic Block Latency Profiling

You have an OpenCL kernel that performs compute operations:

__kernel void viete_formula_comp(__global float* data)
{
    int gid = get_global_id(0);
    float c = 0, sum = 0;

    for (unsigned i = 0; i < 50; ++i)
    {
            float t = 0;
            float p = (i % 2 ? -1 : 1);
            p /= i*2 + 1;
            p /= pown(3.f, i);
            p -=c;

            t = sum + p;
            c = (t - sum) - p;
            sum = t;
    }
    data[gid] = sum * sqrt(12.f);
}

To compare these operations, run the GPU In-kernel profiling in the Basic block latency mode and double-click the kernel in the grid to open the Source view:

The Source view analysis highlights the pown() call as the most expensive operation in this kernel.

Example: Memory Latency Profiling

You have an OpenCL kernel that performs several memory reads (lines 14, 15 and 20):

 __kernel void viete_formula_mem(__global float* data)
{
    int gid = get_global_id(0);
    float c = 0;

    for (unsigned i = 0; i < 50; ++i)
    {
            float t = 0;
            float p = (i % 2 ? -1 : 1);
            p /= i*2 + 1;
            p /= pown(3.f, i);
            p -=c;

            t = data[gid] + p;
            c = (t - data[gid]) - p;

            data[gid] = t;
    }
    data[gid] *= sqrt(12.f);
}

To identify which read instruction takes the longest time, run the GPU In-kernel Profiling in the Memory latency mode:

The Source view analysis shows that the compiler understands that each thread works only with its own element from the input buffer and generates the code that performs the read only once. The value from the input buffer is stored in the registry and reused in other operations, so the compiler does not generate additional reads.

See Also

For more complete information about compiler optimizations, see our Optimization Notice.