Use the Intel® VTune™ Amplifier's GPU Compute/Media Hotspots viewpoint to identify causes of poor CPU utilization such as inefficient synchronization.

Use the following workflow to analyze results collected by the GPU In-kernel Profiling analysis type:

  1. Identify the hottest OpenCL kernels.

  2. Dive to source to identify the most expensive operations and analyze instruction execution.

Identify the Hottest OpenCL™ Kernels

In the GPU Compute/Media Hotspots viewpoint, you can start with the Summary window to identify the hottest GPU Computing Task, click it to navigate to the Graphics window and explore metrics collected for this hotspot:

GPU In-Kernel Profiling

Analyze Instruction Execution

Double-clicking the hot kernel in the Graphics window opens its source code:

The GPU In-kernel Profiling provides a full-scale analysis of the kernel source per code line. The hottest kernel code line is highlighted by default.

To view the performance statistics on GPU instructions executed per kernel instance, switch to the Assembly view:

Note

If your OpenCL kernel uses inline functions, make sure to enable the Inline Mode on the filter toolbar to have a correct attribution of the GPU Cycles per function. See examples.

Example: Basic Block Latency Profiling

You have a kernel that performs compute operations:

__kernel void viete_formula_comp(__global float* data)
{
    int gid = get_global_id(0);
    float c = 0, sum = 0;

    for (unsigned i = 0; i < 50; ++i)
    {
            float t = 0;
            float p = (i % 2 ? -1 : 1);
            p /= i*2 + 1;
            p /= pown(3.f, i);
            p -=c;

            t = sum + p;
            c = (t - sum) - p;
            sum = t;
    }
    data[gid] = sum * sqrt(12.f);
}

To compare these operations, run the GPU In-kernel profiling in the Basic block latency mode and double-click the kernel in the grid to open the Source view:

The Source view analysis highlights the pown() call as the most expensive operation in this kernel.

Example: Memory Latency Profiling

You have a kernel that performs several memory reads (lines 14, 15 and 20):

 __kernel void viete_formula_mem(__global float* data)
{
    int gid = get_global_id(0);
    float c = 0;

    for (unsigned i = 0; i < 50; ++i)
    {
            float t = 0;
            float p = (i % 2 ? -1 : 1);
            p /= i*2 + 1;
            p /= pown(3.f, i);
            p -=c;

            t = data[gid] + p;
            c = (t - data[gid]) - p;

            data[gid] = t;
    }
    data[gid] *= sqrt(12.f);
}

To identify which read instruction takes the longest time, run the GPU In-kernel Profiling in the Memory latency mode:

The Source view analysis shows that the compiler understands that each thread works only with its own element from the input buffer and generates the code that performs the read only once. The value from the input buffer is stored in the registry and reused in other operations, so the compiler does not generate additional reads.

优化通知: 

standard

有关编译器优化的更完整信息,请参阅优化通知
选择粘滞按钮颜色: 
Orange (only for download buttons)