User Guide

Contents

knob

Set configuration options for the specified analysis type or collector type.

Syntax

-knob | -k
<knob-name>=<knob-value>
Arguments
knob-name
An analysis type or collector type may have one or more configuration options (
knobs
) that provide additional instructions for performing the specified type of analysis. To use a knob, you must specify the knob name and knob value.
Multiple
knob
options are allowed and can be followed by additional action-options, as well as global-options, if needed.
knob-value
There are values available for each knob. In most cases this is a Boolean value, so for Boolean knobs, specify
<knob-name>
=true
to enable the knob.
Knob behavior may vary depending on the analysis type or collector type.
<
knob-name
>
Description
accurate-cpu-time-detection=true | false
(Windows only)
Default:
true
Collect more accurate CPU time data. This option requires additional disk space and post-processing time. Administrator privileges are required.
Supported analysis:
runss
analyze-loops=true | false
Default:
false
Extend loop analysis to collect advanced loops information such as instruction set usage and display analysis results by loops and functions.
Supported analysis:
runss
,
runsa
analyze-mem-objects=true | false
Default:
false
Enable the instrumentation of memory allocation/de-allocation and map hardware events to memory objects. This option is supported for Linux targets only running on the Intel microarchitecture code name Sandy Bridge (or later).
Supported analysis:
memory-access
analyze-openmp=true | false
Default:
true
for the HPC Performance Characterization analysis;
false
for other analysis types.
Instrument the OpenMP* runtimes in your application to group performance data by regions/work-sharing constructs and detect inefficiencies such as imbalance, lock contention, or overhead on performing scheduling, reduction, and atomic operations. Using this option may cause higher overhead and increase the result size.
Supported analysis:
hotspots
,
threading
,
hpc-performance
,
memory-access
,
uarch-exploration
,
runsa
atrace-config=<
event
>
Available events are
gfx, input, view, webview, wm, am, audio, video, camera, hal, res, dalvik
.
Collect Android framework events from Systrace*.
Supported analysis:
runsa
characterization-mode=overview | global-local-accesses | compute-extended | full-compute | instruction-count
Default:
overview
Monitor the Render and GPGPU engine usage (Intel Graphics only), identify which parts of the engine are loaded, and correlate GPU and CPU data.
The Characterization mode uses platform-specific presets of the GPU metrics. All presets, except for the
instruction-count
, collect data about execution units (EUs) activity: EU Array Active, EU Array Stalled, EU Array Idle, Computing Threads Started, and Core Frequency; and each one introduces additional metrics:
  • overview
    metric set includes additional metrics that track general GPU memory accesses such as Memory Read/Write Bandwidth, GPU L3 Misses, Sampler Busy, Sampler Is Bottleneck, and GPU Memory Texture Read Bandwidth. These metrics can be useful for both graphics and compute-intensive applications.
  • global-local-accesses
    metric group includes additional metrics that distinguish accessing different types of data on a GPU: Untyped Memory Read/Write Bandwidth, Typed Memory Read/Write Transactions, SLM Read/Write Bandwidth, Render/GPGPU Command Streamer Loaded, and GPU EU Array Usage. These metrics are useful for compute-intensive workloads on the GPU.
  • compute-extended
    metric group includes additional metrics targeted only for GPU analysis on the Intel processor code name Broadwell and higher. For other systems, this preset is not available.
  • full-compute
    metric group is a combination of the
    overview
    and
    global-local-accesses
    event sets.
  • instruction-count
    metric group counts the execution frequency of specific classes of instructions.
Supported analysis:
gpu-hotspots
,
graphics-rendering
,
runsa
chipset-event-config="
event1
,
event2
,..."
Specify a comma-separated list of Android chipset events (up to 5 events) to monitor with the hardware event-based sampling collector.
Supported analysis:
runsa
aource-analysis=bb-latency | mem-latency
Default value:
bb-latency
Collect data on performance-critical basic blocks and issues caused by memory accesses in the GPU kernels. Choose one of the following modes:
  • bb-latency
    mode helps you identify issues caused by algorithm inefficiencies. In this mode, VTune Profiler measures the execution time of all basic blocks. Basic block is a straight-line code sequence that has a single entry point at the beginning of the sequence and a single exit point at the end of this sequence. During post-processing, VTune Profiler calculates the execution time for each instruction in the basic block. So, this mode helps understand which operations are more expensive.
  • mem-latency
    mode helps identify latency issues caused by memory accesses. In this mode, VTune Profiler profiles memory read/synchronization instructions to estimate their impact on the kernel execution time. Consider using this option, if you ran the
    gpu-hotspots
    analysis in the Characterization mode, identified that the GPU kernel is throughput or memory-bound, and want to explore which memory read/synchronization instructions from the same basic block take more time.
Supported analysis:
gpu-hotspots
collect-bad-speculation=true | false
Default value:
true
Collect the minimum set of data required to compute top-level metrics and all Bad Speculation sub-metrics.
Supported analysis:
uarch-exploration
,
runsa
collect-core-bound=true | false
Default:
false
Collect the minimum set of data required to compute top-level metrics and all Core Bound sub-metrics.
Supported analysis:
uarch-exploration
,
runsa
collect-frontend-bound=true | false
Default value:
true
Collect the minimum set of data required to compute top-level metrics and all Front-End Bound sub-metrics.
Supported analysis:
uarch-exploration
,
runsa
collect-io-waits=true | false
Default:
false
Analyze the percentage of time each thread and CPU spends in I/O wait state.
Supported analysis:
runsa
collect-memory-bandwidth=true | false
Default: depends on analysis type
Collect data to identify where your application is generating significant bandwidth to DRAM. To view collected data in GUI, enable the Analyze memory bandwidth option.
Supported analysis:
performance-snapshot, uarch-exploration
,
hpc-performance
,
gpu-hotspots
,
offload
runsa
collect-memory-bound=true | false
Default value:
true
Collect the minimum set of data required to compute top-level metrics and all Memory Bound sub-metrics.
Supported analysis:
uarch-exploration
,
hpc-performance
collect-programming-api=true | false
Default for
gpu-hotspots
:
true
, for
runss
:
false
.
Analyze execution of DPC++ apps, OpenCL™ kernels and Intel® Media SDK programs on Intel HD Graphics and Intel® Iris® Graphics. This option may affect the performance of your application on the CPU side.
Supported analysis:
gpu-hotspots
,
gpu-offload
,
runsa
collect-retiring=true | false
Default value:
true
Collect the minimum set of data required to compute top-level metrics and all Retiring sub-metrics.
Supported analysis:
uarch-exploration
,
runsa
collecting-mode=hw-tracing | hw-tracing
Default value:
hw-sampling
Specify the system-wide collection mode to either explore CPU, GPU, and I/O resources utilization with the default event-based sampling mode, or enable the low-overhead hardware tracing and identify a root cause of latency issues.
Supported analysis:
system-overview
,
runsa
computing-task-of-interest=
computing_task_name
[
#start_idx#step#stop_idx
]
Specify a comma-separated list of GPU computing task names and invocations.
computing_task_name
is the name of a GPU computing task;
start_idx
is the number of the first invocation; and
stop_idx
is the number of the last invocation to be profiled.
Supported analysis:
gpu-hotspots
,
runsa
counting-mode=true | false
Default:
false
Choose between collecting detailed context data for each PMU event (such as code or hardware context) or the counts of events. Counting mode introduces less overhead but gives less information.
Supported analysis:
runsa
cpu-samples-mode=off | stack | nostack
Default:
false
Enable to periodically sample the application. Samples can be collected with or without stacks.
Supported analysis:
runss
dpdk=true | false
Default:
false
Profile DPDK IO API.
Supported analysis:
io
dram-bandwidth-limits=true | false
Default:
true
for the HPC Performance Characterization and Microarchitecture Exploration analysis with
collect-memory-bandwidth
knob enabled;
true
for the Memory Access and Microarchitecture Exploration analysis.
Evaluate maximum achievable local DRAM bandwidth before the collection starts. This data is used to scale bandwidth metrics on the timeline and calculate thresholds.
Supported analysis:
performance-snapshot, memory-access
,
uarch- exploration
,
hpc-performance
,
runsa
enable-characterization-insights=true | false
Get additional performance insights such as the efficiency of hardware usage, and learn next steps.
Supported analysis:
gpu-offload
enable-context-switches=true | false
Default:
false
Analyze detailed scheduling layout for all threads
in your application
, explore time spent on a context switch and identify the nature of context switches for a thread (preemption or synchronization).
Supported analysis:
runsa
enable-driverless-collection=true | false
Default:
false
Enable driverless Linux Perf collection when possible.
Supported analysis:
runsa
enable-gpu-usage=true | false
Default:
false
Analyze frame rate and usage of Intel HD Graphics and Intel® Iris® Graphics engines and identify whether your application is GPU or CPU bound.
Supported analysis:
runss
,
runsa
enable-interrupt-collection=true | false
Default:
false
Collect interrupt events that alter a normal execution flow of a program. Such events can be generated by hardware devices or by CPUs. Use this data to identify slow interrupts that affect your code performance.
Supported analysis:
system-overview
.
enable-parallel-fs-collection=true | false
Default:
false
Analyze Lustre* file system performance statistics, including Bandwidth, Package Rate, Average Packet Size, and others.
Supported analysis:
runsa
enable-stack-collection=true | false
Default:
false
Supported analysis:
hotspots
,
hpc-performance
,
gpu-offload
,
runsa
enable-system-cswitch=true | false
Default:
false
Analyze detailed scheduling layout for all threads
on the system
and identify the nature of context switches for a thread (preemption or synchronization).
Supported analysis:
runsa
enable-thread-affinity=true | false
Default:
false
Analyze thread pinning to sockets, physical cores, and logical cores. Identify incorrect affinity that utilizes logical cores instead of physical cores and contributes to poor physical CPU utilization.
Affinity information is collected at the end of the thread lifetime, so the resulting data may not show the whole issue for dynamic affinity that is changed during the thread lifetime.
enable-user-sync=true | false
Default:
false
Collect synchronization data via the User-Defined Synchronization API.
Supported analysis:
threading
,
runss
enable-user-tasks=true | false
Default:
false
Analyze tasks, events and counters specified in your application via the Task API. This option causes higher overhead and increases result size.
Supported analysis:
hotspots
,
threading
, ,
runss
,
uarch-exploration
,
runss
,
runsa
event-config=<event_name1>,<event_name2>,...
Configure PMU events to collect with the hardware event-based sampling collector. Multiple events can be specified as a comma-separated list (no spaces).
To display a list of events available on the target PMU, enter:
vtune
-collect-with runsa -knob event-config=?
<target>
The command returns names and short descriptions of available events. For more information on the events, use Intel Processor Events Reference.
Supported analysis:
runsa
event-mode=all | user | os
Default:
all
Limit event-based sampling collection to OS or USER mode.
Supported analysis:
hotspots
,
runsa
ftrace-config=<
event_name
>
Available events are
freq, idle, sched, disk, filesystem, irq, kvm, workq, softirq, sync
.
Default for Linux targets:
sched,freq,idle,workq,irq,softirq
Default for Android targets:
sched,freq,idle,workq,filesystem, irq,softirq,sync,disk
Collect Linux Ftrace* framework events.
Supported analysis:
runsa
,
runss
gpu-sampling-interval=<
number
>
between 0.1 and 1000ms
Default: 1.
Specify an interval between GPU samples (in milliseconds).
Supported analysis:
gpu-hotspots
,
graphics-rendering
,
runss
,
runsa
io-mode=off | stack | nostack
Default:
off
Enable to identify where threads are waiting or compute thread concurrency. The collector instruments APIs, which causes higher overhead and increases result size.
Supported analysis:
runss
,
runsa
ipt-regions-to-load=<
number
>
between 10 and 5000
Default:
1000
Specify the maximum number (10-5000) of code regions to load for detailed analysis.
Supported analysis:
anomaly-detection
kernel-stack=true | false
Default:
true
Profile system disk IO API.
Supported analysis:
io
max-region-duration=<
number
>
between 0.001 and 1000 ms
Default:
100
Specify the maximum duration (0.001-1000ms) of analysis per code region.
Supported analysis:
anomaly-detection
mem-object-size-min-thres=<
number
>
Default: 1024 bytes
Specify a minimal size of memory allocations to analyze. This option helps reduce runtime overhead of the instrumentation.
This option is supported for Linux targets only running on the Intel microarchitecture code name Sandy Bridge (or later).
Supported analysis:
memory-access
mrte-type=java,dotnet | java,dotnet,python | python
Default:
java,dotnet
Specify a type of managed runtime to analyze. Available values: combined .NET* and Java* analysis, combined Java, .NET and Python* analysis, and Python only.
Supported analysis:
runss
,
runsa
no-altstack=true | false
Default:
false
Disable using alternative stacks for signal handlers. Consider this option for profiling standard Python 3 code on Linux.
Supported analysis:
runss
pmu-collection-mode=detailed | summary
Default:
detailed
Choose the
detailed
sampling-based collection mode to view data breakdown per function and other hotspots. Use the
summary
counting-based mode for an overview of the whole profiling run. This mode has a lower collection overhead and fast post-processing time.
Supported analysis:
uarch-exploration
profiling-mode=
characterization
(default),
code-level-analysis
Select a profiling mode to either characterize GPU performance issues based on GPU hardware metric presets or enable a source analysis to identify basic blocks latency due to algorithm inefficiencies, or memory latency due to memory access issues.
Supported analysis:
gpu-hotspots
,
runsa
sampling-interval=<
number
>
For user-mode sampling and tracing types: a number (in milliseconds) between 1 and 1000. Default: 10
For hardware event-based sampling types: a number (in milliseconds) between 0.01 and 1000. Default: 1.
Specify a sampling interval (in milliseconds) between CPU samples.
Supported analysis:
hotspots
,
runss
,
threading
, ,
runsa
,
system-overview
,
memory-access
,
hpc-performance
,
runss
sampling-mode=sw | hw
Default:
sw
Specify a profiling mode.
Use
sw
to identify CPU hotspots and explore a call flow of your program. This mode does not require sampling drivers to be installed but incurs more collection overhead.
Use
hw
to identify application hotspots based on such basic hardware events as Clockticks and Instructions Retired. This is a low-overhead collection mode but it requires the sampling driver to be installed on your system.
Supported analysis:
hotspots, threading
signals-mode=off | objects | stack | nostack
Default:
off
Enable to view synchronization transitions in the timeline and signalling call stacks for associated waits. The collector instruments signalling APIs, which causes higher overhead and increases result size.
Supported analysis:
runss
spdk=true | false
Default:
false
Profile SPDK IO API.
Supported analysis:
io
stack-size=<
number
>
A number between 0 and 2147483647. Default is 0 (unlimited stack size).
Reduce the collection overhead and limit the stack size (in bytes) processed by the
VTune
Profiler
.
Supported analysis:
runsa
stack-stitching=true | false
Default:
true
For Intel TBB-based applications, restructure the call flow to attach stacks to a point introducing a parallel workload.
Supported analysis:
runss
stack-type=software | lbr
Default:
software
Choose between software stack and hardware LBR-based stack types. Software stacks have no depth limitations and provide more data while hardware stacks introduce less overhead. Typically, software stack type is recommended unless the collection overhead becomes significant. Note that hardware LBR stack type may not be available on all platforms.
Supported analysis:
runsa
stackwalk-mode=online | offline
Default:
offline
Choose between online (during collection) and offline (after collection) modes to analyze stacks. Offline mode reduces analysis overhead and is typically recommended.
Supported analysis:
runss
waits-mode=off | stack | nostack
Default:
off
Enable to identify where threads are waiting or compute thread concurrency. The collector instruments APIs, which causes higher overhead and increases result size.
Supported analysis:
runss
uncore-sampling-interval=<
number
>
For hardware event-based sampling types: a number (in milliseconds) between 1 and 1000. Default: 10.
Specify an interval (in milliseconds) between uncore event samples.
Supported analysis:
runsa
Actions Modified
Description
Use the
knob
action-option to configure knob settings for a
collect
(predefined analysis types) or
collect-with
(custom analysis types) action where the analysis type supports one or more knobs. Each analysis type or collector type supports a specific set of knobs, and each knob requires a value. In most cases the knob value is Boolean, so you would use
True
to enable the knob.
To see all knobs available for a predefined analysis type:
vtune
-help collect <
analysis_type
>
To see knobs for a custom analysis type:
vtune
-help collect-with <
analysis_type
>
This example returns a list of knobs for the Threading analysis type:
vtune -help collect threading
This example runs a custom event-based sampling data collection on an Android system enabling collection of Android framework and chipset events.
vtune -collect-with runss -target-system=android -knob sampling-interval=2 -knob cpu-samples-mode=stack -knob ftrace-config=gfx,dalvik -knob chipset-event-config="GMCH_PARTIAL_WR_DRAM.ANY,GMCH_CORE_CLKS" --target-process com.intel.tbb.example.tachyon
This example configures and runs a custom event-based sampling data collection with the stack size limited to 8192 bytes:
vtune -collect-with runsa -knob enable-stack-collection=true -knob stack-size=8192 -knob enable-call-counts=true -knob event-config=CPU_CLK_UNHALTED.REF_TSC:sa=1800000,CPU_CLK_UNHALTED

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804