Intel® VTune™ Amplifier Release Notes and New Features

This page provides the current Release Notes for Intel® VTune™ Amplifier (Intel® VTune™ Amplifier XE for versions 2017 and older). The notes are categorized by major version, from newest to oldest, with individual releases listed within each version section. 

Click a release to expand it into a summary of new features and changes in that version since the last release. The expanded summary also contains download buttons for the detailed release notes, which include important information, such as pre-requisites, software compatibility, installation instructions, and known issues.

You can copy a link to a specific release's section by clicking the chain icon next to its name.

The installation guides are posted separately:
Linux*macOS*Windows*

All files are in PDF format - Adobe Reader* (or compatible) required.
To get product updates, log in to the Intel® Software Development Products Registration Center.
For questions or technical support, visit Intel® Software Developer Support.

2019

Update 6

Release Notes

Overview

  • Bug fixes and security update.
  • Intel VTune Amplifier has been updated to include more recent versions of 3rd party components, which include functional and security updates. Users should update to the latest version.
Update 5

Release Notes

Overview

  • Relaxed limitations on collecting GPU hardware metrics for users without Administrator/root privileges.
  • Added support for HW-based analysis on systems running under Hyper-V.
  • Microarchitecture analysis improvements:
    • Relaxed limitations for the Perf* driverless collection on Linux for users with perf_event_paranoid value set to 2. For such environments, event-based sampling is supported in the user space.
    • Memory Access analysis updated with a new UPI Utilization metric for Intel microarchitectures code named Cascade Lake and Skylake.
  • Application Performance Snapshot improvements:
    • Max and Bound metrics added to estimate the efficiency of the DRAM, MCDRAM, and Persistent Memory usage.
  • Quality and usability improvements:
    • Interactive Help Tour available from the Welcome page and guiding you through the product interface using a sample project.
    • Simplified configuration of a Windows-to-Linux remote collection supporting automated password-less access to a Remote Linux (SSH) target.
  • Platform Profiler improvements:
    • Enhanced system overview including CPU and memory utilization summary, CPU I/O Wait information (Linux) and CPU utilization breakdown for key CPU-stall reasons
    • Initial implementation of a custom view that allows users to specify the metrics to be visualized
    • Full support for Intel® Optane™ DC Persistent Memory metrics on 2nd Generation Intel® Xeon® Scalable Processor server platforms (formerly Cascade Lake).
    • A simplified and consistent command line interface for collecting data (the previous command line interface is supported for backward compatibility).
    • A simplified and consistent command line interface for collecting data that conforms better to CLI conventions. The previous command line interface is supported for backward compatibility. See the Intel® VTune™ Amplifier User Guide for more details.
  • Support for new operating systems:
    • Android Q
    • Red Hat* Enterprise Linux* 8
    • Fedora* 30

NOTE: 32 bit OSs are deprecated in 2019 version and support will be removed in upcoming release. VTune™ can still profile 32 bit applications on 64 bit OSs (cross mode).

Update 4

Release Notes

Overview:

  • GPU analysis improvements:
    • Inline Mode filter option added to GPU In-Kernel Profiling viewpoint, to display GPU-side call stacks with OpenCL™ inline functions and correctly attribute GPU Cycles statistics per function. By default, Inline Mode is switched off.
    • Source/Assembly analysis available for OpenCL programs created with IL (intermediate language), if the intermediate SPIR-V binary was built with the -gline-tables-only -s <cl_source_file_name> option.
    • (PREVIEW) New Instruction Count profiling mode added to the GPU In-Kernel Profiling to analyze GPU instructions executed by an OpenCL kernel and classified per instruction type. This mode helps you compare the performance of the same OpenCL kernel on different hardware or explore instruction count for different implementations of the same algorithms on the same hardware. 
  • Microarchitecture analysis improvements:
    • Default driverless mode for hardware event-based collections with stacks, such as Hotspots and Threading. Driver-based collection can still be run by setting the "Stack size" option to the unlimited value (0) or disabling the "Enable driverless collection" option in a custom analysis.
    • The Precise column added to the Summary of the Hardware Events viewpoint to clearly identify precise events. Using precise events in your configurations provide more accurate Assembly analysis with no event skids. 
  • Quality and usability improvements:
    • Improved integration with the Microsoft* Visual Studio* IDE with a quick access to VTune Amplifier options via a smart integrated Welcome page.
    • Overlay help with quick tips for the Bottom-up tab highlights important interface elements to efficiently manage analysis data.
    • Added Linux kernel 5.0 support
Update 3

Release Notes

Overview:

  • Support for Intel® Optane™ DC persistent memory and the latest microarchitecture code-named Cascade Lake. This includes new hardware event support and enhanced memory analysis to design and optimize for the new persistent memory technology.
    Learn more about the next generation of memory!
  • Resolve performance bottlenecks where network workloads are consuming high I/O bandwidth. Enhanced PCIe device metrics for I/O traffic in the Input and Output analysis help you understand the interactions between Cores and Network Interface Cards (NICs).
  • MPI improvements:
    • Easier control of data collection for MPI applications using the standard MPI_PControl API. Collect only the data you need with a few quick changes and no dependency on the ITT API.
    • Easier MPI communication pattern diagnosis with Application Performance Snapshot’s rank to rank communication diagram by message volume.
  • Usability improvements:
    • Friendlier welcome page provides fast access to technical content and project controls.
    • Improved importing process for traces and result files. It’s now possible to import whole result directories to a project and use project search directories for symbol and source/assembly resolution.
    • Simplified installation and licensing (serial numbers and license files are no longer required for this product).
Update 2

Release Notes

Overview:

  • Intel® VTune™ Amplifier 2019 Update 2 includes functional and security updates. Users should update to the latest version.
  • Microarchitecture analysis improvements:
    • Configuration for the Microarchitecture Exploration analysis optimized to provide you with the control over collected hardware metrics and data collection overhead in general. By default, the analysis provides you with a full set of top-level hardware metrics and their sub-metrics that show how your code uses hardware resources. With a new configuration option, you can choose to narrow down the scope and collect sub-metrics only for the selected top-level metrics.
  • System Analyzer tool for monitoring real-time metrics on a target system added to the VTune Amplifier as a PREVIEW feature.
  • HPC workload profiling improvements:
    • Full-featured support of OpenMPI targets in Application Performance Snapshot
    • Vectorization metrics streamlined for the HPC Performance Characterization analysis
    • PREVIEW: HTML report added to show process/thread affinity along with CPU execution and remote access information
  • Supported managed Linux and Windows targets with tiered compilation for .NET* Core 3.0 Preview 1 and .NET Core 2.2
  • Quality and usability improvements:
    • Improved support for standalone command-line results imported into a VTune Amplifier GUI project. Search directories specified in the command line configuration are preserved and applied for proper module resolution in the graphical viewpoints.
Update 1

Release Notes

Overview:

  • Threading analysis extended with the lower overhead hardware event-based sampling mode. This mode helps analyze an impact of thread preemption and context switching. On Windows*, this analysis configuration requires the sampling driver. On Linux*, the analysis is available both with the sampling driver and with the Linux Perf* collector for kernels 4.4 and higher.
  • Quality and usability improvements:
    • summary command line report for the Hotspots analysis enriched with metrics and Top 5 Hotspots table that is also available from the GUI Summary view.
    • A sample matrix project added to the Project Navigator to help you get started with the product, review a sample pre-collected Hotspots result, and test other analysis types and source view options. A pre-built version of the matrix sample application and associated source files are available installed with VTune Amplifier.
    • Support for Linux Perf* collection extended with VTune Amplifier metrics with a further option to import the Perf trace to the VTune Amplifier GUI and benefit from predefined viewpoints. This solution could be useful for performance analysis in data centers)
Initial release

Release Notes

Overview:

  • New, easier tuning workflow and simplified setup
  • New Platform Profiler. Longer data collection finds hardware configuration issues and poorly tuned applications.
  • Application Performance Snapshot adds utilization of logical vs. physical cores, pause and resume and Intel Trace Analyzer and Collector integration

2018

Update 4

Release Notes

Overview:

  • Support for new operating systems:
    • SUSE* Linux* Enterprise Server (SLES)12 SP3, SUSE* Linux* Enterprise Server (SLES)15
    • Red Hat* Enterprise Linux* 6.10
    • Fedora 28
    • Microsoft Windows* 10 RS4
Update 3

Release Notes

Overview:

  • Analysis on embedded platforms and accelerators:
    • New CPU/FPGA Interaction analysis (PREVIEW) to assess the balance between the CPU and FPGA on systems with a discrete Intel® Arria® 10 FPGA running OpenCL™ applications
    • New Graphics Rendering analysis (PREVIEW) for CPU/GPU utilization of your code running on the Xen* virtualization platform installed on a remote embedded target
    • Support for the sampling command-line analysis on remote QNX* embedded systems via ethernet connection
  • HPC workload profiling improvements:
    • CPU Utilization metric refined to differentiate the utilization on logical vs. physical cores, which is particularly important for HPC applications running on Intel® Xeon® processor family processors
  • Managed runtime analysis improvements:
    • Extended JIT profiling for server-side applications running on the LLVM* or HHVM* PHP servers to support the event-based sampling analysis in the attach mode
    • Extended Java* code analysis with support for OpenJDK* 9 and Oracle* JDK 9
    • Enabled Advanced Hotspots analysis for .NET* Core applications running on Linux and Windows systems in the Launch Application mode
  • Application Performance Snapshot improvements:
    • Added the ability to pause/resume collection with MPI_Pcontrol and itt API. The -start-paused option was added to exclude application execution from collection from the start to the first collection resume occurrence.
    • Enabled selection of which data types are collected to reduce overhead. The choices include MPI tracing, OpenMP tracing, hardware counter based collection, or a combination of the three.
    • Exposed the CPU Utilization metric by physical cores on processors that support proper hardware events.
    • Significantly reduced MPI tracing overhead when there are a large number of ranks.
    • Enriched MPI statistics generated by the aps-report utility by showing information about communicators used in the application and to group and filter collective operations by the communicators.
    • Improved integration with Intel® Trace Analyzer and Collector by adding the ability to generate profiling configuration files with the aps-report option.
  • Quality and usability improvements:
    • Hardware event-based analysis supported for targets running in the Hyper-V* environment on Windows* 10 Fall Creators Update (RedStone3)
  • Support for new operating systems and IDEs including:
    • Fedora*
    • Ubuntu* 17.10
Update 2

Release Notes

Overview:

  • Mitigated impact of OS security updates: https://software.intel.com/en-us/articles/intel-vtune-amplifier-impact-of-recent-os-security-updates
  • Collect only the data you need with Application Performance Snapshot’s new data selection options and pause/resume API support. Get better answers with lower overhead.
  • Assess the balance between the CPU and FPGA with a new CPU/FPGA Interaction analysis (PREVIEW)
  • CPU utilization for physical and logical cores improves analysis of hyper-threading and thread migration performance effects.
  • Improvements to JIT profiling for server-side applications and support for OpenJDK* 9 and Oracle* JDK 9.
  • Profile .Net Core applications running on Linux* or Windows* systems with Advanced Hotspots analysis
  • Hardware event-based analysis supported for targets running in the Hyper-V* environment on Windows* 10 Fall Creators Update (build 1709)
Initial Release and Update 1

Release Notes

Overview:

  • Easier tuning of threaded MPI applications. HPC analysis adds enhanced metrics for MPI including MPI imbalance & performance of critical path rank. Application Performance Snapshot merges MPI + Application data, includes richer metrics, and adds MPICH compatibility.
  • Optimize private cloud-based applications. Profile inside Docker & Mesos containers and attach to running Java services and daemons.
  • Easier analysis of remote Linux* systems. Automated install of performance collectors on a remote Linux target.

2017

Update 5

Linux* Release NotesmacOS* Release NotesWindows* Release Notes

This update is optional unless you need the new features.

Overview:

  • Support for Microsoft Visual Studio* 2017 Update 3
  • Bug fixes and performance improvements
Update 4

Linux* Release NotesmacOS* Release NotesWindows* Release Notes

This update is optional unless you need the new features.

Overview:

  • General Exploration, Memory Access, HPC Performance Characterization analysis types extended to support Intel® Xeon® Processor Scalable family
  • Support for Microsoft Windows* 10 Creators Update (RS2)
Update 3

Linux* Release NotesmacOS* Release NotesWindows* Release Notes

This update is optional unless you need the new features.

Overview:

  • Application Performance Snapshot (Preview) provides a quick look at your application performance and helps you understand where your application will benefit from tuning. The revised tool shows metrics on MPI parallelism (Linux* only), OpenMP* parallelism, memory access, FPU utilization, and I/O efficiency with recommendations on further in-depth analysis.
  • Support for Intel® Xeon Phi™ coprocessor targets codenamed Knights Landing
  • Improved insight into parallelism inefficiencies for applications using Intel Threading Building Blocks (Intel TBB) with extended classification of high Overhead and Spin time.
  • Automated installation of the VTune Amplifier collectors on a remote Linux target system. This feature is helpful if you profile a target on a shared resource without VTune Amplifier installed or on an embedded platform where targets may be reset frequently.
  • Support for Microsoft Visual Studio* 2017
Update 2

Linux* Release NotesmacOS* Release NotesWindows* Release Notes

This update is optional unless you need the new features.

Overview:

Details:

HPC Performance Characterization Analysis improvements

The HPC Performance Characterization Analysis has received several improvements.

Increased detail and structure for the vector efficiency metrics based on FLOP counters in the FPU Utilization section help diagnose the reason for low utilization connected with poor vector code generation. Relevant metrics include:

  • Vector Capacity Usage
  • FP Instruction Mix
  • FP Arithmetic Instructions per Memory Read or Write
  • SP FLOPs per Cycle (may indicate memory bandwidth bound code)

For MPI applications, the MPI Imbalance metric shows CPU time spent by ranks spinning in waits on communication operations, normalized by the number of ranks on the profiling node. The metric issue detection description generation is based on minimal MPI Busy Wait time by ranks. If the minimal MPI Busy Wait time by ranks is not significant, then the rank with the minimal time most likely lies on the critical path of application execution. In this case, review the CPU utilization metrics by this rank.

The Top Loops/Functions with FPU Usage by CPU Time table shows the top functions that contain floating point operations, sorted by CPU time. The FPU Utilization column provides issue descriptions based on whether a loop/function is bandwidth bound, whether it is vectorized or scalar, and what instruction set it's using.

For Intel Xeon Phi processors (codenamed Knights Landing), the following FPU metrics are available instead of FLOP counters:

  • SIMD instructions per cycle
  • Fraction of packed SIMD instructions vs scalar SIMD instructions per cycle
  • Vector instruction set for loops based on static analysis
DRAM Bandwidth Bound metric

A new metric is available in the Memory Usage viewpoint for the Memory Access and HPC Performance Characterization analyses which indicates whether your system spent much time heavily utilizing the DRAM bandwidth. The calculation of this metric relies on accurate maximum system DRAM bandwidth measurement, and depends on the number of sockets on your system.

GPU Hotspots Summary improvements

The GPU Hotspots viewpoint's Summary tab has been extended to display more information. The GPU Usage section can be used to identify whether the GPU was properly utilized. The Packet Queue Depth Histogram can be used to estimate the GPU software queue depth per GPU engine during the target run. Ideally, your goal is an effective GPU engine utilization with evenly loaded queues and minimal duration for the zero queue depth.

For a high-level view of the DMA packet execution during the target run, review the Packet Duration Histogram. Select a required packet type from the drop-down menu and identify how effectively these packets were executed on the GPU. Having high packet count values for the minimal duration is optimal.

KVM Guest OS Profiling

If you are a system developer and interested in the performance analysis of a guest Linux* system, use Intel VTune Amplifier for performance analysis of this guest Linux* OS via Kernel-based Virtual Machine (KVM) from the host system. Depending on your analysis target, you may choose either of the following usage models for KVM guest OS profiling:

Locks & Waits analysis for Python

Locks and Waits analysis can now be used to tune threaded performance of mixed Python* and native code. View Sync Objects in the grid, see Python frames in the Call Stack, an define which sync objects are the Global Interpreter Lock (GIL), either by wait count or by callstack. Drill down to Python source to explore thread synchronization issues at code level. For more information on how to configure the analysis, see the Python* Code Analysis product help article.

Update 1

Linux* Release NotesmacOS* Release NotesWindows* Release Notes

  • Support for the Average Latency metric in the Memory Access analysis based on the driverless collection
  • Support for locator hardware event metrics for the General Exploration analysis results in the Source/Assembly view that enable you to filter the data by a metric of interest and identify performance-critical code lines/instructions
  • Command line summary report for the HPC Performance Characterization analysis extended to show metrics for CPU, Memory and FPU performance aspects including performance issue descriptions for metrics that exceed the predefined threshold. To hide issue descriptions in the summary report, use a new report-knob show-issues option.
  • Summary view of the General Exploration analysis extended to explicitly display measure for the hardware metrics: Clockticks vs. Piepline Slots
  • GPU Hotspots analysis extended to detect hottest computing tasks bound by GPU L3 bandwidth
  • PREVIEW: New Full Compute event group added to the list of predefined GPU hardware event groups collected for Intel® HD Graphics and Intel Iris™ Graphics. This group combines metrics from the Overview and Compute Basic presets and allows to see all detected GPU stalled/idle issues in the same view.
  • Support for hotspot navigation and filtering of stack sampling analysis data by the Total type of values in the Source/Assembly view
Initial Release

Linux* Release NotesmacOS* Release NotesWindows* Release Notes

Overview:

Details:

Intel® Xeon Phi™ Processor Support

Intel® VTune™ Amplifier now supports the Intel® Xeon Phi™ Processor codenamed Knights Landing.

Decide how to use MCDRAM (the high bandwidth memory) effectively using Memory Access Analysis, analyze the scalability of MPI and OpenMP* with HPC Performance Characterization Analysis, and explore the microarchitecture efficiency with General Exploration Analysis.

HPC Performance

The HPC Performance Characterization Analysis explores the three key performance aspects of application scalability:

  • Threading: CPU Utilization with parallel efficiency for MPI and OpenMP*. Explore the serial vs parallel time and the top OpenMP regions by potential gain.
  • Memory Access Efficiency: includes bandwidth utilization and stalls by memory hierarchy.
  • FPU utilization: includes basic vectorization metrics.

See the analysis usage example in the Analyzing an OpenMP and MPI Application web-based tutorial, which provides a hands-on exercise to identify memory utilization inefficiencies and load imbalance for a sample hybrid application.

Memory Access Analysis

The Memory Access Analysis has been improved. In addition to support for the Intel Xeon Phi processors, it now supports custom memory allocators, and includes automatic detection of maximum system DRAM bandwidth characteristics and scaling bandwidth data from that maximum. This allows users to easily see how they actually utilize the available DRAM bandwidth, rather than just raw GB/S values. The QPI bandwidth has been split to Total, Outgoing, and Incoming, instead of just the total. The workflow has been optimized for identifying the top memory objects with high bandwidth utilization per domain. Finally, no special drivers are required on Linux*; this analysis type can now use standard Linux* perf to collect data, eliminating the need for root to install other drivers.

Disk I/O Analysis (Preview)

The Disk Input and Output analysis for HDD, SATA, or NVMe SSD monitors utilization of the disk subsystem, CPU, and PCle buses, and helps to identify long latency of I/O requests and imbalance between I/O and compute operations.

See the Analyzing Input/Output Waits tutorial for a hands-on exercise with sample code on Linux*.

GPU analysis improvements

GPU Hotspots Analysis is intended for GPU-bound applications, and provides options to analyze execution of OpenCL™ kernels and Intel® Media™ SKD tasks.

The GPU Analysis Summary provides a set of metrics to estimate the GPU utilization per engine, identify stalled or idle execution units, and explore the most typical problems with low occupancy or frequent sampler accesses. Navigate from the Hottest GPU computing tasks summary to the details provided in the graphics tab.

Intel VTune Amplifier now also supports the detection of OpenCL 2.0 Shared Virtual Memory (SVM) usage types per kernel instance.

For more information, see Using Intel VTune Amplifier to Optimize Media & Video Applications.

Usability Improvements

Remote usage and Command Line usage have been improved. Use the Arbitrary target GUI configuration to generate a command line for performance analysis on a system that is not accessible from the current host.

MPI analysis has been extended with the event-based sampling collection supported for multiple ranks per node with an arbitrary MPI launcher and natural syntax. Use the MPI launcher option in the arbitrary targets configuration to automatically generate a command line for MPI analysis from the GUI.

An option for enabling and disabling the OpenMP regions analysis has been added to selected analysis configurations.

Support has been added for the Attach To Process target type with event-based sampling for low-privilege Java* daemons on Linux*.

The event selection mechanism for custom hardware event based sampling has been extended with filtering options.

The grid views and identification of performance issues have had UI improvements made.

Intel® Performance Snapshot (Preview)

The Application Performance Snapshot tool provides a quick look at your application performance and helps you understand whether your application will benefit from tuning.

It identifies how effectively your application uses the hardware platform and displays basic performance enhancement opportunities.

The Storage Performance Snapshot tool analyzes your system's storage, CPU, memory, and network usage and displays basic performance enhancement opportunities for systems using Intel hardware.

Para obter informações mais completas sobre otimizações do compilador, consulte nosso aviso de otimização.