Use the GPU In-kernel Profiling to analyze GPU kernel execution per code line, identify performance issues caused by memory latency or inefficient kernel algorithms, and analyze GPU instruction frequency per certain instruction types.


  • GPU In-kernel Profiling is available on the processors based on Intel® microarchitecture code name Broadwell and later.

  • Since the GPU In-kernel Profiling incurs higher performance overhead than the GPU Compute/Media Hotspots analysis, you may consider first running the GPU Compute/Media Hotspots analysis to identify the hottest GPU computing task (GPU kernel) and then exploring this kernel with the GPU In-kernel Profiling.

How It Works

The GPU In-kernel Profiling instruments your code and, depending on your configuration settings, helps identify performance-critical basic blocks, issues caused by memory accesses in the GPU kernels.

In the Basic block latency and Memory latency profiling modes, the GPU In-kernel profiling introduces the following key metrics:

  • Estimated GPU Cycles: The average number of GPU cycles per one kernel instance.

  • GPU Instructions Executed per Instance: The average number of GPU instructions executed per one kernel instance.

  • GPU Instructions Executed per Thread: The average number of GPU instructions executed by one thread per one kernel instance.

If you enable the Instruction count profiling mode, the VTune Amplifier shows a breakdown of instructions executed by the kernel in the following groups:

Instruction count profiling mode

Control Flow group

if, else, endif, while, break, cont, call, calla, ret, goto, jmpi, brd, brc, join, halt and mov, add instructions that explicitly change the ip register.

Send & Wait group

send, sends, sendc, sendsc, wait

Int16 & HP Float | Int32 & SP Float | Int64 & DP Float groups

Bit operations (only for integer types): and, or, xor, and others.

Arithmetic operations: mul, sub, and others; avg, frc, mac, mach, mad, madm.

Vector arithmetic operations: line, dp2, dp4, and others.

Extended math operations.

Other group

Contains all other operations including nop.

In the Instruction count mode, the VTune Amplifier also provides Operations per second metrics calculated as a weighted sum of the following executed instructions:

  • Bit operations (only for integer types):

    • and, not, or, xor, asr, shr, shl, bfrev, bfe, bfi1, bfi2, ror, rol - weight 1
  • Arithmetic operations:

    • add, addc, cmp, cmpn, mul, rndu, rndd, rnde, rndz, sub - weight 1

    • avg, frc, mac, mach, mad, madm - weight 2

  • Vector arithmetic operations:

    • line - weight 2
    • dp2, sad2 - weight 3
    • lrp, pln, sada2 - weight 4
    • dp3 - weight 5
    • dph - weight 6
    • dp4 - weight 7
    • dp4a - weight 8
  • Extended math operations:

    • math.inv, math.log, math.exp, math.sqrt, math.rsq, math.sin, math.cos (weight 4)

    • math.fdiv, math.pow (weight 8)


The type of an operation is determined by the type of a destination operand.

Configure and Run Analysis

To run the GPU In-kernel Profiling analysis:

Prerequisites: Create a project and specify an analysis target and system.

  1. Click the Configure Analysis button on the Intel® VTune™ Amplifier toolbar.

    The New Amplifier Result tab opens.

  2. From the HOW pane, click the Browse button and select Platform Analysis > GPU In-kernel Profiling.

  3. From the Profiling mode drop-down menu, select a type of issues you want to analyze:

    • Basic blocks latency option helps you identify issues caused by algorithm inefficiencies. In this mode, VTune Amplifier measures the execution time of all basic blocks. Basic block is a straight-line code sequence that has a single entry point at the begging of the sequence and a single exit point at the end of this sequence. During post-processing, VTune Amplifier calculates the execution time for each instruction in the basic block. So, this mode helps understand which compute instructions are more expensive. Example: Basic Block Latency Profiling.

    • Memory latency option helps identify latency issues caused by memory accesses. In this mode, VTune Amplifier profiles memory read/synchronization instructions to estimate their impact on the kernel execution time. Consider using this option, if you ran the GPU Compute/Media Hotspots analysis, identified that the GPU kernel is throughput or memory-bound, and want to explore which memory read/synchronization instructions from the same basic block take more time. Example: Memory Latency Profiling.

    • Instruction count (preview) option counts the execution frequency of specific classes of instructions. It also enables measuring arithmetic operations per second, which is a common metric for comparing different algorithms or variants in their implementation. This mode provides an estimate of how close an implementation comes to the theoretical arithmetic peak performance of the target device during the optimization of a kernel. You can use this mode to track the progress of optimizing performance of a kernel.


      For the theoretical peak throughput of the compute architecture of Intel processor graphics, see The Compute Architecture of Intel® Processor Graphics Gen9 guide.

  4. Optionally, if you want to narrow down the analysis to specific kernels (and minimize the overhead), specify the kernels of interest to profile. If required, modify the Instance step for each kernel, which is a sampling interval (in the number of kernels). This option helps reduce profiling overhead.

  5. Click Start to run the analysis.

View Data

VTune Amplifier runs the analysis and opens the data in the GPU Compute/Media Hotspots viewpoint providing various platform data in the following windows:

  • Summary window provides high-level statistics on how your application uses GPU resources and helps you identify the hottest GPU Computing Tasks.
  • Graphics window displays GPU metrics per kernel instance.

Para obtener información más completa sobre las optimizaciones del compilador, consulte nuestro Aviso de optimización.
Seleccione el color del botón adhesivo: 
Orange (only for download buttons)