Run Command Line Analysis

Set Up Environment Variables

Set up the environment variables for the standalone VTune Amplifier by executing the amplxe-vars script:

Linux* OS:

  • csh/tcsh users: source <install_dir>/amplxe-vars.csh
  • bash users: source <install_dir>/amplxe-vars.sh

By default, the <install_dir> is:

  • For root users: /opt/intel/vtune_amplifier_version
  • For non-root users: $HOME/intel/vtune_amplifier_version

Windows* OS:<install_dir>\amplxe-vars.bat, where default <install_dir> is C:\[Program Files]\IntelSWTools\VTune Amplifier version.

When you run the script, it displays the product name and the build number. You can now use the amplxe-cl and amplxe-gui commands.

Run Predefined Analysis

The predefined analysis configurations already have most of the knobs (configuration options) set by default for your convenience. To run a predefined performance analysis, use the -collect action:

> amplxe-cl -collect <analysis_type> [-target-system=<system>] [-knob <knobName=knobValue>] [--] <target>

where:

  • <analysis_type> is the type of analysis to run. To see the list of available analysis types, enter:

    amplxe-cl -help collect

  • -target-system is an option targeted for remote analysis and specifies a remote Linux* system or a Android* device

  • -knob is a configuration option that modifies the analysis

  • [knobName=knobValue] is the name of the specified knob and its value

  • <target> is the path and name of the application to analyze. If you need to analyze a process, use the -target-process or -target-pid option to specify the process name or ID. For a system-wide analysis, no target specification is required.

Intel® VTune™ Amplifier supports the following predefined analysis types:

Analysis Type

Description

hotspots

Analyze application flow and identify sections of code that take a long time to execute (hotspots).

advanced-hotspots (deprecated)

Extend the hotspots analysis by collecting call stacks, context switch and statistical call count data as well as analyzing the CPI (Cycles Per Instruction) metric.

threading

Collect data on how an application is using available logical CPU cores, discover where parallelism is incurring synchronization overhead, identify where an application is waiting on synchronization objects or I/O operations, and discover how waits affect application performance.

concurrency (deprecated)

Collect data on how an application is using available logical CPU cores, discover where parallelism is incurring synchronization overhead, and identify potential candidates for parallelization.

locksandwaits (deprecated)

Identify where an application is waiting on synchronization objects or I/O operations, and discover how these waits affect the application performance.

hpc-performance

Identify opportunities to optimize CPU, memory, and FPU utilization for compute-intensive or throughput applications. The HPC Performance Characterization analysis type is a starting point for understanding the performance landscape of your application. Use this analysis type to improve application performance by increasing the number of floating-point operations per second (GFLOPS) and reducing the overall application run time. The analysis collects data related to CPU, memory, and FPU utilization. Additional scalability metrics are available for applications that use OpenMP* or MPI runtime libraries.

memory-consumption

Analyze memory consumption by your Linux application, its distinct memory objects and their allocation stacks.

uarch-exploration (former general-exploration)

Collect hardware events for analyzing a typical client application. This analysis calculates a set of predefined ratios used for the metrics and facilitates identifying hardware-level performance problems.

memory-access

Identify memory-related issues, like NUMA problems and bandwidth-limited accesses, and attribute performance events to memory objects (data structures), which is provided due to instrumentation of memory allocations/de-allocations and getting static/global variables from symbol information.

sgx-hotspots (deprecated)

Analyze hotspots inside security enclaves for systems with the Intel Software Guard Extensions (Intel SGX) feature enabled. This analysis type uses the INST_RETIRED.PREC_DIST hardware event that emulates precise clockticks and helps identify performance-critical program units inside enclaves.

tsx-exploration (deprecated)

Collect events that help understand Intel Transactional Synchronization Extensions (Intel TSX) behavior and causes of transactional aborts.

tsx-hotspots (deprecated)

Monitor the UOPS_RETIRED.ALL_PS hardware event that emulates precise clockticks and identify performance-critical program units inside transactions.

cpugpu-concurrency

Explore code execution on the various CPU and GPU cores in your system, correlate CPU and GPU activity and identify whether your application is GPU or CPU bound.

gpu-hotspots

Identify Graphics Processing Unit (GPU) tasks with high GPU utilization and estimate the effectiveness of this utilization. This analysis type is intended for analysis of applications that use a GPU for rendering, video processing, and computations with explicit support of Intel® Media SDK and OpenCL™ software technology.

gpu-profiling

Use the GPU In-kernel Profiling to analyze GPU kernel execution per code line and identify performance issues caused by memory latency or inefficient kernel algorithms.

graphics-rendering (preview)

Analyze the CPU/GPU utilization of your code running on the Xen virtualization platform. Explore GPU usage per GPU engine and GPU hardware metrics that help understand where performance improvements are possible. If applicable, this analysis also detects OpenGL-ES API calls and displays them on the timeline.

fpga-interaction (preview)

Analyze the CPU/FPGA interaction issues via exploring OpenCL kernels running on FPGA, identify the most time-consuming FPGA kernels.

io

Monitor utilization of the IO subsystems, CPU and processor buses. This analysis type uses the hardware event-based sampling collection and system-wide Ftrace* collection (for Linux* and Android* targets)/ETW collection (Windows* targets) to provide a consistent view of the storage sub-system combined with hardware events and an easy-to-use method to match user-level source code with I/O packets executed by the hardware.

Note

This is a PREVIEW FEATURE on Windows* OS. A preview feature may or may not appear in a future production release. It is available for your use in the hopes that you will provide feedback on its usefulness and help determine its future. Data collected with a preview feature is not guaranteed to be backward compatible with future releases. Please send your feedback to parallel.studio.support@intel.com or to intelsystemstudio@intel.com.

system-overview

Evaluate general behavior of Linux* or Android* target systems and correlate power and performance metrics with IRQ handling.

Run Custom Analysis

If you need to run a modified version of the predefined analysis type, you may use the -collect-with action option to specify a data collection type and required configuration options (knobs):

> amplxe-cl -collect-with <collection_type> [-target-system=<system>] [-knob <knobName=knobValue>] [--] <target>

where

  • <collection_type> is the type of analysis to run. To see the list of available collection types, enter:

    amplxe-cl -help collect-with

  • -target-system is an option targeted for remote analysis and specifies a remote Linux* system or a Android* device

  • <-knob> is an option that configures the analysis

  • [knobName=knobValue] is the name of specified knob and its value

  • <target> is the path and name of the application to analyze. If you need to analyze a process, use the -target-process or -target-pid option to specify the process name or ID. For a system-wide analysis, no target specification is required.

Intel® VTune™ Amplifier supports the following collection types:

Collector Description
runsa

Profile your application using the counter overflow feature of the Performance Monitoring Unit (PMU).

runss

Profile the application execution and take snapshots of how that application utilizes the processors in the system. The collector interrupts a process, collects the value of all active instruction addresses and captures a calling sequence for each of these samples.

Next Steps

When the collection is complete, the VTune Amplifier saves the data as an analysis result in the default or specified result directory. You can either view the result in the GUI or generate a formatted analysis report.

For more complete information about compiler optimizations, see our Optimization Notice.
Select sticky button color: 
Orange (only for download buttons)