gpu-profiling Command Line Analysis

Use the GPU In-kernel Profiling to analyze GPU kernel execution per code line, identify performance issues caused by memory latency or inefficient kernel algorithms, and estimate the execution frequency of specific instruction categories.

How It Works

The GPU In-kernel Profiling instruments your code and, depending on your configuration settings, helps identify performance-critical basic blocks, issues caused by memory accesses in the GPU kernels.

In the Basic block latency and Memory latency profiling modes, the GPU In-kernel profiling introduces the following key metrics:

  • Estimated GPU Cycles: The average number of GPU cycles per one kernel instance.

  • GPU Instructions Executed per Instance: The average number of GPU instructions executed per one kernel instance.

  • GPU Instructions Executed per Thread: The average number of GPU instructions executed by one thread per one kernel instance.

If you enable the Instruction count profiling mode, the VTune Amplifier shows a breakdown of instructions executed by the kernel in the following groups:

Instruction count profiling mode

Control Flow group

if, else, endif, while, break, cont, call, calla, ret, goto, jmpi, brd, brc, join, halt and mov, add instructions that explicitly change the ip register.

Send & Wait group

send, sends, sendc, sendsc, wait

Int16 & HP Float | Int32 & SP Float | Int64 & DP Float groups

Bit operations (only for integer types): and, or, xor, and others.

Arithmetic operations: mul, sub, and others; avg, frc, mac, mach, mad, madm.

Vector arithmetic operations: line, dp2, dp4, and others.

Extended math operations.

Other group

Contains all other operations including nop.

In the Instruction count mode, the VTune Amplifier also provides Operations per second metrics calculated as a weighted sum of the following executed instructions:

  • Bit operations (only for integer types):

    • and, not, or, xor, asr, shr, shl, bfrev, bfe, bfi1, bfi2, ror, rol - weight 1
  • Arithmetic operations:

    • add, addc, cmp, cmpn, mul, rndu, rndd, rnde, rndz, sub - weight 1

    • avg, frc, mac, mach, mad, madm - weight 2

  • Vector arithmetic operations:

    • line - weight 2
    • dp2, sad2 - weight 3
    • lrp, pln, sada2 - weight 4
    • dp3 - weight 5
    • dph - weight 6
    • dp4 - weight 7
    • dp4a - weight 8
  • Extended math operations:

    • math.inv, math.log, math.exp, math.sqrt, math.rsq, math.sin, math.cos (weight 4)

    • math.fdiv, math.pow (weight 8)


The type of an operation is determined by the type of a destination operand.


$ amplxe-cl -collect gpu-profiling [-knob <knobName=knobValue>] -- <target> [target_options]

Knobs: gpu-profiling-mode=bblatency | memlatency | inscount, kernels-to-profile.


For the most current information on available knobs (configuration options) for the GPU In-kernel Profiling, enter:

$ amplxe-cl -help collect gpu-profiling


This example runs GPU In-kernel Profiling for a Linux target to identify memory latency issues and analyzes only the specified kernel1 and kernel2 with the sampling interval equal to 10 kernels.

amplxe-cl -collect gpu-profiling -knob gpu-profiling-mode=memlatency -knob kernels-to-profile=kernel1:1:10:4294967185,kernel2:1:10:4294967185 -- home/test/myApplication

What's Next

When the data collection is complete, do one of the following to view the result:

For more complete information about compiler optimizations, see our Optimization Notice.
Select sticky button color: 
Orange (only for download buttons)