User Guide

Contents

What's New in
Intel® VTune™
Profiler

Intel® VTune™
Profiler
2021.2.0

Download this version of
Intel® VTune™
Profiler
from the product download page.
This version of
Intel® VTune™
Profiler
contains the following additions:
  • User Interface
    • This release introduces a new main vertical toolbar to enhance your user experience. All controls previously located in the main horizontal toolbar are now located on this toolbar. The vertical toolbar is designed to enhance your experience with clear, bright controls.
  • Hardware Support
    • This version includes support for Intel Atom® Processor P Series code named Snow Ridge, including Hotspots, Microarchitecture Exploration, Memory Access, and Input and Output analyses.
  • GPU Accelerators
    • Source-level analysis for DPC++ and OpenMP applications running on GPU over Level Zero
      The following modes in GPU Compute/Media Hotspots analysis are now available when profiling Level Zero applications:
      Support also includes full-scale analysis of the kernel source per code line, including Source/Assembly mapping.
  • Input and Output Analysis
    • New major features in Input and Output analysis
      • This release introduces the
        Platform Diagram
        , a new starting point for the Input and Output analysis. It reveals system topology and high-level utilization metrics for hardware resources including PCIe devices, Intel® Ultra Path Interconnect, and memory. It enables you to examine the utilization of your hardware at a glance.
        This feature is enabled for 1st and 2nd Generation Intel® Xeon® Scalable Processors in up to four-socket configurations, excluding the Intel® Xeon® Platinum 9200 series processors code named Cascade Lake AP. This feature is also supported on Intel Atom® Processors P Series code named Snow Ridge.
      • Intel® Data Direct I/O (Intel DDIO) utilization efficiency metrics
        are extended with average Inbound PCIe read/write latency and core/IO contention indicator.
      • It is now possible to perform
        Linux perf-based data collection
        without root access on 1st and 2nd Generation Intel Xeon® Scalable Processors on Linux kernel versions 5.10 and newer.

Intel® VTune™
Profiler
2021.1.2

This version of
Intel® VTune™
Profiler
contains the following additions:
  • Software Enhancement
    • Fix for an issue where Command line analysis based on User-Mode Sampling does not work when using a non-root account
      If
      VTune
      Profiler
      was installed by a root/sudo user, some executable files were requiring only root permissions to run an analysis based on the User-Mode Sampling collector, such as the Hotspots analysis. This issue has been rectified in this release.
  • Documentation
    • Guidance resource on GPU-profiling features in
      Intel® VTune™
      Profiler
      A new article captures learning pathways to profile GPUs and illustrates techniques to Optimize Applications for Intel® GPUs with
      Intel® VTune™
      Profiler
      . Use this article to understand the
      Intel® VTune™
      Profiler
      workflow to profile and optimize GPUs. The article also informs about several key resources including procedural topics, cookbook recipes, and webinars that explain GPU compute profiling and graphics profiling with Intel software analyzer products.

Intel® VTune™
Profiler
2021

This version of
Intel® VTune™
Profiler
contains improvements and additions in these areas:
  • GPU Accelerators
    • GPU Adapter Selection for Profiling Analyses in Multi-GPU Systems
      When you have multiple Intel GPUs connected to your system, you can now select a specific GPU adapter directly in the user interface for your GPU Offload Analysis or GPU Compute/Media Hotspots Analysis. The
      Target GPU
      pulldown menu appears in the
      HOW
      pane of the analysis configuration when
      VTune
      Profiler
      detects multiple Intel GPUs on your system. The menu lists available GPU adapters with their Bus/Device/Function (BDF) values.
    • Energy Consumption Metrics in GPU Compute/Media Hotspots Analysis
      When you run the GPU Compute/Media Hotspots Analysis on an Intel® Iris® X
      e
      MAX graphics discrete GPU in a Linux environment, you can now use the
      Analyze power usage
      option to collect information about energy consumed by the GPU. The analysis results display energy consumption metrics over time and per discrete GPU kernel. Use this data to better monitor power usage with processing time and optimize for either purpose.
    • Data Transfer Information in GPU Offload Analysis
      Kernel information in the GPU Offload Analysis combines data transfer times to and from the GPU kernel with the execution time. In the
      Summary
      window, you can now see the total time for computing tasks along with the execution time. Previously, this display included only the execution time. In the
      Graphics
      window, the total time for computing task by kernel now combines the data transfer time between device and host as well as the actual execution time. The
      Graphics
      window also displays now information about the size of data transfer between the host (CPU) and GPU (device).
    • Support for oneAPI Level Zero Specification for DPC++ Applications
      Intel® VTune™
      Profiler
      now supports version 1.0.4 of the oneAPI Level Zero API when you run GPU analyses (GPU Offload analysis and GPU Compute/Media Hotspots) on DPC++ applications in Windows and Linux environments.
    • Update to IP Architecture diagram
      The IP Architecture Diagram of the GPU Compute/Media Hotspots analysis is renamed as the
      Memory Hierarchy Diagram
      . The diagram features a new design that can help make the understanding of metrics more intuitive. The diagram also displays the same markers to highlight metrics as the ones used to indicate performance or data issues in the Summary and Grid displays. This provides a consistent look and feel to the diagram and helps you correlate metrics between both displays.
    • SIMD utilization metrics at kernel level.
      The GPU Compute/Media Hotspots analysis in the Dynamic Instruction Count mode now includes SIMD utilization metrics at the kernel and instruction level. These metrics help identify instructions in the OpenCL kernel that utilize SIMD poorly.
    • GPU metrics in APS and HPC Analysis type.
      The GPU utilization analysis in Application Performance Snapshot (APS) and the HPC Performance Characterization analysis now includes these GPU computation metrics:
      • GPU Time
      • GPU IPC
      • GPU Utilization
      • Percentage of stalled and idle EUs
      The GPU Compute metric set of Application Performance Snapshot has been enhanced with OpenMP Offload Efficiency metrics, including offload region overhead. These metrics are available for binaries compiled with the Intel® C/C++ Compiler included in several Intel® oneAPI Toolkits 2021.1-beta05 or newer.
    • Simplified dependency on Intel® Metrics Discovery API library
      There is now a simplified dependency on the Intel® Metrics Discovery API library to collect GPU hardware statistics on Linux* systems.
      Intel® VTune™
      Profiler
      now automatically selects the latest
      libstdc++
      available in runtime to satisfy the GPU analysis requirements. For older versions of the product, follow procedures to enable manual configuration.
    • GPU Compute/Media Hotspots analysis extended with GPU in-kernel analysis for OpenCL™ code and an option to filter by a kernel of interest
    • Extension to Command Line Analysis
      The report generated when you run analysis from the command line now includes GPU analysis data. Apply the
      computing-task
      and
      computing-instance
      groupings to your collected data to focus on time-consuming computing tasks.
    • Dynamic Instruction Count Collection in GPU Compute/Media Hotspots Analysis
      The GPU Compute/Media Hotspots analysis has been improved to include Dynamic instruction count collection. The analysis results provide better accuracy for basic block Assembly analysis
  • FPGA Accelerators
    • Multiple enhancements to CPU/FPGA Interaction Analysis
      The CPU/FPGA Interaction analysis type features several new additions to enhance your FPGA profiling experience.
      • Analysis results now display
        Activity percentage
        and
        Idle percentage
        metrics to describe the proportion of cycles when a channel instruction was enabled or absent.
      • The analysis type can now profile loops and display occupancy information for them.
      • You can now adjust the depth of channels using
        Average depth
        and
        Maximum depth
        information that displays in the analysis results.
  • Performance Summary:
    • Performance Snapshot Analysis Type for Quick Summary
      Use Performance Snapshot as the starting point for your performance analysis. Get a quick overview of issues that affect your application performance. Performance Snapshot provides recommendations for next steps to help you select other analyses for deeper profiling. It also characterizes the workload on the system.
  • Algorithm Group
    • Anomaly Detection Analysis for Performance Anomalies
      Use the Anomaly Detection analysis type in the
      Algorithm
      group to detect performance anomalies in frequently recurring code intervals including loop iterations. Anomaly Detection uses Intel® Processor Trace (Intel® PT) technology to perform detailed analysis at the microsecond level. These are some metrics that get highlighted in analysis results when
      Intel® VTune™
      Profiler
      identifies a performance anomaly:
      • Instructions Retired
      • Kernel CPU Time
      • User CPU Time
      • Inactive/Wait Time
      • CPU Frequency
      Anomaly Detection can also detect hypervisors that do not have support for processor trace virtualization through Intel® Processor Trace (Intel® PT).
  • Parallelism:
    • Support for OpenMP Offload in HPC Analysis
      The HPC Performance Characterization analysis type supports the offload of OpenMP regions. The summary pane now includes a breakdown of OpenMP offload time by
      Compute
      ,
      Data Transfer
      , and
      Overhead
      . The bottom-up pane now allows grouping by
      OpenMP Offload Region
      . With this grouping active, the grid displays several new columns. The timeline shows scale markers that indicate the span of OpenMP offload regions and OpenMP operations internal to those regions.
  • I/O Analysis
    • Improvements and Changes to Input and Output Analysis
      • The Input and Output analysis type features a new methodology for locating sources of reads and writes targeting Memory-Mapped I/O (MMIO) address space regions to which I/O devices are mapped. Such
        MMIO reads and writes
        are expensive loads and stores resulting in
        Outbound PCIe traffic
        .
      • The collection of source-level Memory Mapped I/O (MMIO) data in the Input and Output analysis supports InfiniBand* devices.
      • Platform I/O metrics can now be attributed to individual devices managed by Intel® VMD technology.
      • Per-device metrics are now available when running Input and Output analysis as a non-root user, as long as the sampling driver is loaded.
      • Enhanced profiling for servers based on Intel® processor microarchitectures codenamed Skylake and Cascade Lake by highlighting code that potentially performs MMIO reads.
      • This analysis type features
        Inbound PCIe Read/Write L3 Hit/Miss Ratio
        metrics that show the utilization efficiency of Intel® Data Direct I/O (Intel® DDIO) hardware technology. There are new metrics for Intel® Xeon® Scalable processors that allow data break down by PCIe devices. Input and Output analysis is deprecated in the Windows version of
        Intel® VTune™
        Profiler
        .
  • Energy Analysis
    • Rootless Data Collection on Linux Systems
      You do not require root privileges to run energy analysis using
      Intel® VTune™
      Profiler
      in a Linux environment. You can run this analysis without root privileges once your system administrator installs sampling drivers for
      Intel® VTune™
      Profiler
      and configures relevant permissions for the drivers. Administrator privileges are required to collect energy data in Windows machines.
    • Processor Package Energy Consumption
      Options for Energy analysis, based on the Intel SoC Watch data collector, have been extended to monitor processor package energy consumption over time and identify how it correlates with CPU throttling
  • Platform Analysis:
    • Enhancements to System Overview Analysis
      Use the System Overview analysis as an entry point to platform analysis. Assess your system (IO, accelerators and CPU) performance and get guidance for further analysis steps.
      • The System Overview analysis can display energy consumption data. Enable the
        Analyze energy usage
        option to get energy consumption characterization on the
        Summary
        tab with the total energy consumed by CPU packages and DRAM, as well as overtime energy consumption data on the
        Platform
        tab.
      • The Hardware Tracing mode in the System Overview analysis enables application analysis at the micro-second level and helps you to identify causes for latency. These are some metrics you can collect:
        • User/kernel metrics
        • OS Kernel Activity
        • OS Scheduling
        • Thread/Hardware grouping
        • Module entry points
        The metrics help identify anomaly issues caused by unexpected kernel activity or preemptions.
    • Overview and Memory views are extended with new metrics to analyze Non-Uniform Memory Access (NUMA) behavior
    • User authentication and authorization has been added to enable access control to your data
    • There is a new option to choose or modify the location of Platform Profiler data files
  • VTune
    Profiler
    Server for HPC Environment
    • A quality-of-life improvement was added to
      VTune
      Profiler
      Server
      . If you use
      VTune
      Profiler
      CLI
      to run data collection using a scheduler in an HPC cluster and put the results into a mounted shared location, you can now point
      VTune
      Profiler
      Server to an arbitrarily structured folder in this shared location.
      VTune
      Profiler
      now discovers all results in a directory and allows you to seamlessly navigate your arbitrary folder structure and open any result.
  • HPC Analysis
    • Application Performance Snapshot includes
      Max
      and
      Bound
      Bandwidth metrics to better estimate the efficiency of the DRAM, MCDRAM, Intel Persistent Memory and Intel® Omni-Path usage
  • Cloud and Containerization
    • Use Containerization support to install and run
      VTune
      Profiler
      in a Docker* container and profile targets both inside the same container as well as outside the container
    • This release extends container profiling capabilities to display the container name instead of its ID for ease of identification.
    • You can profile applications running in Amazon Web Services* (AWS) EC2 Instances based on Intel microarchitecture code name Cascade Lake X.
  • Connection Types
    • New TCP/IP Communication Agent
      Use the TCP/IP communication agent as a connection type to profile embedded systems running real-time operating systems. You can profile the kernel of an arbitrary real-time operating system and the applications running on it. This requires the development of a custom agent (Analysis Communication Agent). A reference solution based on Linux OS is available through the Analysis Communication Agent GitHub* repository. Detailed information on developing an agent for a specific real-time operating system is available in the ACA documentation.
    • Remote Linux (SSH) Connection Type
      The Remote Linux (SSH) connection type has been improved to make automated target package deployment more transparent. Now
      VTune
      Profiler
      checks for the presence of the target package on the remote system and offers to deploy the package automatically with a single click of a button if the package is not found.
  • Quality and Usability
    • Symbol resolution for effective source-level analysis enabled for crossgen (Ahead-of-JIT compilation) functions on Linux* systems
    • Interactive
      Help Tour
      available from the Welcome page and guiding you through the product interface using a sample project
    • The third-party components updated to the most recent versions to include functional and security changes. You are recommended to update your product to the latest version.
  • Profiling Support for OpenSHMEM Applications
    Use the Fabric Profiler feature in VTune Profiler to identify detailed characteristics of the runtime behavior for an OpenSHMEM application.
  • Profiling Support for Applications Annotated with ITT API
    • Average Task Time
      and
      Average Frame Time
      metrics are now included in analysis results when you profile applications annotated with ITT API .
  • Profiling Remote Amazon Web Services* Instances
    • There exists support for remote profiling of applications running in Amazon Web Services* (AWS) EC2 instances.
  • Support for DPC++ Applications
    • Demangling of Lambda Functions
      This release implements the demangling of DPC++ lambda function names, which are used as DPC++ kernel names.
  • Analysis Configuration:
    • Wrapper Script Option for Quick Profiling Environment Setup
      Use the Wrapper script to run a custom set of commands to prepare the profiling environment before you start analysis in the environment. For example, you can create a script with a custom set of commands that sets environment variables. Include the custom set in the
      WHAT
      pane when you configure the analysis. The commands get executed on the target system before the analysis begins. You can also provide the wrapper script through the command-line interface by using the
      --wrapper-script-path
      option.
  • Documentation:
    • PDF version of User Guide
      The
      Intel® VTune™
      Profiler
      User Guide is available in PDF format as well as HTML. If you are viewing this content online, click
      Download as PDF
      at the top of this page to use the PDF version.
As a part of the Intel oneAPI Base Toolkit,
VTune
Profiler
includes these features:
  • Support for Data Parallel C++ (DPC++) code profiling added across CPUs and multiple accelerator architectures, including GPUs and FPGAs
  • GPU Offload and GPU Compute/Media Hotspots types extended to support profiling DPC++ code and OpenMP* code offloaded to the GPU
  • CPU/FPGA Interaction analysis extended with FPGA device-side metrics, like Stalls, Global Bandwidth and Occupancy, and mapping FPGA kernel performance data to the source code
  • GPU Time and Utilization metrics added to Application Performance Snapshot to help you triage your performance issues and identify whether your code is CPU or GPU bound
For a full list of platforms that support
Intel® VTune™
Profiler
, see the VTune Profiler Release Notes.
Documentation for versions of
VTune
Profiler
prior to the 2021 release are available for download only. For a list of available documentation downloads by product version, see these pages:

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.