User Guide

Contents

gpu-profiling
Command Line Analysis

Use the GPU In-kernel Profiling to analyze GPU kernel execution per code line, identify performance issues caused by memory latency or inefficient kernel algorithms, and estimate the execution frequency of specific instruction categories.
This analysis is deprecated staring with VTune Profiler Beta. Use the analysis instead.

How It Works

The GPU In-kernel Profiling instruments your code and, depending on your configuration settings, helps identify performance-critical basic blocks, issues caused by memory accesses in the GPU kernels.
In the
Basic block latency
and
Memory latency
profiling modes, the GPU In-kernel profiling introduces the following key metrics:
  • Estimated GPU Cycles
    : The average number of GPU cycles per one kernel instance.
  • GPU Instructions Executed per Instance
    : The average number of GPU instructions executed per one kernel instance.
  • GPU Instructions Executed per Thread
    : The average number of GPU instructions executed by one thread per one kernel instance.
If you enable the
Instruction count
profiling mode, the
VTune
Profiler
shows a breakdown of instructions executed by the kernel in the following groups:
Instruction count profiling mode
Control Flow
group
if, else, endif, while, break, cont, call, calla, ret, goto, jmpi, brd, brc, join, halt
and
mov, add
instructions that explicitly change the ip register.
Send & Wait
group
send, sends, sendc, sendsc, wait
Int16 & HP Float
|
Int32 & SP Float
|
Int64 & DP Float
groups
Bit operations (only for integer types):
and, or, xor,
and others.
Arithmetic operations:
mul, sub,
and others;
avg, frc, mac, mach, mad, madm
.
Vector arithmetic operations:
line, dp2, dp4,
and others.
Extended math operations.
Other
group
Contains all other operations including
nop
.
In the
Instruction count
mode, the
VTune
Profiler
also provides
Operations per second
metrics calculated as a weighted sum of the following executed instructions:
  • Bit operations (only for integer types):
    • and, not, or, xor, asr, shr, shl, bfrev, bfe, bfi1, bfi2, ror, rol
      - weight 1
  • Arithmetic operations:
    • add, addc, cmp, cmpn, mul, rndu, rndd, rnde, rndz, sub
      - weight 1
    • avg, frc, mac, mach, mad, madm
      - weight 2
  • Vector arithmetic operations:
    • line
      - weight 2
    • dp2, sad2
      - weight 3
    • lrp, pln, sada2
      - weight 4
    • dp3
      - weight 5
    • dph
      - weight 6
    • dp4
      - weight 7
    • dp4a
      - weight 8
  • Extended math operations:
    • math.inv, math.log, math.exp, math.sqrt, math.rsq, math.sin, math.cos
      (weight 4)
    • math.fdiv, math.pow
      (weight 8)
The type of an operation is determined by the type of a destination operand.

Syntax

vtune
-collect gpu-profiling [-knob <
knobName=knobValue
>] -- <
target
> [
target_options
]
Knobs:
gpu-profiling-mode=bblatency | memlatency | inscount
,
computing-task-of-interest
.
For the most current information on available knobs (configuration options) for the GPU In-kernel Profiling, enter:
vtune
-help collect gpu-profiling
Example
This example runs GPU In-kernel Profiling for a Linux target to identify memory latency issues and analyzes only the specified
kernel1
and
kernel2
with the sampling interval equal to 10 kernels.
vtune
-collect gpu-profiling -knob gpu-profiling-mode=memlatency -knob computing-task-of-interest=kernel1:1:10:4294967185,kernel2:1:10:4294967185 -- home/test/myApplication

What's Next

When the data collection is complete, do one of the following to view the result:

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804