Intel® VTune™ Profiler Release Notes and New Features

By Mayank Tiwari,

Published:10/23/2014   Last Updated:06/26/2020

This page provides the current Release Notes for Intel® VTune™ Profiler (starting with Intel® VTune™ Amplifier XE 2017). The notes are categorized by major version, from newest to oldest, with individual releases listed within each version section.

NOTE: Intel® VTune™ Amplifier has been renamed to Intel® VTune™ Profiler starting with version 2020.

Click a release to expand it into a summary of new features and changes in that version since the last release. The expanded summary also contains download buttons for the detailed release notes, which include important information, such as pre-requisites, software compatibility, installation instructions, and known issues.

You can copy a link to a specific release's section by clicking the chain icon next to its name.

The installation guides are posted separately:
Linux* macOS* Windows*

All files are in PDF format - Adobe Reader* (or compatible) required.
To get product updates, log in to the Intel® Software Development Products Registration Center.
For questions or technical support, visit Intel® Software Developer Support.

2020

Update 3

Release Notes

Overview

Intel® VTune™ Profiler has been updated to include more recent versions of 3rd party components, which include functional and security updates. Users should update to the latest version.

Hardware Support:

  • Added support for 11th Gen Intel® Core™ processors codenamed Tiger Lake, including Hotspots, Microarchitecture Exploration, Memory Access, and GPU analyses.

Input and Output Analysis:

  • Source-level Memory Mapped I/O (MMIO) analysis now supports InfiniBand devices.

Profiling Applications Annotated with ITT API:

  • Profiling of applications annotated with ITT API has been enhanced by the introduction of additional Average Task Time and Average Frame Time metrics. 

Profiling Remote Amazon Web Services* Instances:

  • Added support for remote profiling of applications running in Amazon Web Services* (AWS) EC2 instances.

Update 2

Release Notes

Overview

Intel® VTune™ Profiler has been updated to include more recent versions of 3rd party components, which include functional and security updates. Users should update to the latest version.

Performance Snapshot analysis type for quick summary:

  • This release introduces the Performance Snapshot analysis type. Start with this analysis and get a quick overview of issues that affect your application performance. Performance Snapshot characterizes the workload on the system. It also provides recommendations for next steps to help you select other analyses for deeper profiling.

Platform Analysis:

  • Platform I/O metrics can now be attributed to individual devices managed by Intel® VMD technology
  • I/O Analysis has been enhanced for Sky Lake and Cascade Lake servers by highlighting code potentially performing MMIO reads

New hardware/operating systems/IDEs support:

  • Support for Intel’s processors code named Cooper Lake and Comet Lake
  • Ubuntu* 20.04, Fedora* 32
  • Microsoft* Windows* 10, May 2020 Update

Deprecation:

  • We recommend that Storage Snapshot users switch to using the Platform Profiler feature of Intel® VTune™ Profiler. It provides a more informative set of data with a similar low overhead. We are deprecating Storage Snapshot and will discontinue it in our next major release.
  • Preview of Input and Output analysis on Windows* is deprecated and will be removed in a future release. This analysis continues to be supported on Linux* OS.

Update 1

Release Notes

Overview

This version of Intel® VTune™ Profiler contains improvements and additions in these areas:

  • Microarchitecture Exploration analysis is now supported on Intel processors codenamed Icelake.
  • GPU accelerators support:
    • GPU Compute/Media Hotspots analysis in the Dynamic Instruction Count mode has been extended to include SIMD utilization metrics at the kernel and instruction level. These metrics help identify instructions in the OpenCL™ kernel that utilize SIMD poorly.
    • A deeper GPU utilization analysis has been introduced in Application Performance Snapshot (APS) and the HPC Performance Characterization analysis. The GPU utilization analysis now includes these GPU computation metrics:
      • GPU Time
      • GPU IPC
      • GPU Utilization
      • % of Stalled and Idle EUs.
    • There is now a simplified dependency on the Intel® Metric Discovery API library to collect GPU hardware statistics on Linux* systems. VTune Profiler now automatically selects the latest libstdc++ available in runtime to satisfy the GPU analysis requirements. For older versions of the product, follow procedures to enable manual configuration.

Platform analysis improvements:

  • CPU/FPGA Interaction analysis has been extended to process data sources collected either with AOCL Profiler (new mode) and via OpenCL Profiling API (legacy mode). Specify the name of your application target and its parameters directly in the WHAT pane.
  • The Hardware Tracing mode in the System Overview analysis has been extended to include new metrics to make the analysis more kernel-aware:
    • OS Kernel Activity and OS Scheduling metrics identify anomaly issues caused by unexpected kernel activity or preemptions.
    • The CPU Time metric has been split to User Time and Kernel Time metrics to show the number of kernel mode switches and their frequency (switches per second).
  • A new Module Entry Point grouping level has been added to the System Overview viewpoint to display data collected in the Hardware Tracing mode. The grouping shifts the focus to precise CPU time spent within sys calls, interrupts or within particular API of the runtime library.

Initial Release

Release Notes

Overview

  • There is a new, more descriptive name: “Intel® VTune™ Profiler” (formerly “Intel® VTune™ Amplifier).
    • Command line interface amplxe-cl and GUI interface amplxe-gui were re-named to vtune and vtune-gui respectively
  • Intel® VTune™ Profiler has been updated to include more recent versions of 3rd party components, which include functional and security updates. Users should update to the latest version.
  • GPU accelerators support:
    • New GPU Offload analysis added to explore and correlate code execution across CPUs and GPUs. You can identify a kernel of interest for GPU-bound applications and explore further with GPU Compute/Media Hotspots analysis.
    • GPU Compute/Media Hotspots analysis updated with GPU in-kernel analysis for OpenCL™ code and an option to filter by a kernel of interest.
    • Command line hotspots report now supports GPU analysis types. You can apply the computing-task and computing-instance groupings to your collected data to focus on time-intensive computing tasks.
    • Dynamic instruction count collection (available as part of the GPU Compute/Media Hotspots Analysis) improved to provide better accuracy for basic block Assembly analysis.
    • Support for Intel® Processor Graphics Gen11.
  • Platform analysis support:
    • System Overview analysis updated to serve as an entry point to platform analysis. Use this analysis to assess system (IO, accelerators and CPU) performance and review guidance for next steps.
    • New Hardware Tracing mode in the System Overview analysis enables application analysis on the micro-second level and identification of causes for latency issues.
  • HPC analysis improvements:
    • Max and Bound Bandwidth metrics added to Application Performance Snapshot to better estimate the efficiency of the DRAM, MCDRAM, Persistent Memory and Intel® Omni-Path usage.
  • Platform Profiler new features and improvements:
    • Overview and Memory views extended with new metrics to analyze Non-Uniform Memory Access (NUMA) behavior.
    • User authentication and authorization implemented to enable access control to user data.
    • Added a new option for users to choose or modify the location of Platform Profiler data files.
  • Energy analysis improvements:
    • New Throttling analysis added to identify causes for system throttling, including violation of safe thermal or power limits.
    • Options for Energy analysis, based on the Intel SoC Watch data collector, extended to monitor processor package energy consumption over time and identify how it correlates with CPU throttling.
  • Cloud and containerization support:
    • Containerization support extended with an option to install and run VTune™ Profiler in a Docker* container and profile targets inside and outside the same container.
    • Added support to profile applications running in Amazon Web Services* (AWS) EC2 Instances based on Intel microarchitecture code name Cascade Lake X.
  • New Fabric Profiler performance tool added to VTune™ Profiler in Preview mode. Use Fabric Profiler to identify detailed characteristics of the runtime behavior for an OpenSHMEM application.
  • Quality and usability improvements:
    • Symbol resolution for effective source-level analysis enabled for crossgen (Ahead-of-JIT compilation) functions on Linux* systems.
    • Interactive Help Tour (available on the Welcome page) guides you through the product interface using a sample project.
  • New hardware/operating systems/IDEs support:
    • 10th Gen Intel® Core™ processors
    • Ubuntu* 19.10
    • Microsoft* Windows* 10, November 2019 Update

2019

Update 8

Release Notes

Overview

  • Bug fixes and security updates.
  • Intel VTune Amplifier has been updated to include more recent versions of 3rd party components, which include functional and security updates. Users should update to the latest version.

Update 7

  • No public release of Update 7 is available for Intel VTune Amplifier.

Update 6

Release Notes

Overview

  • Bug fixes and security updates.

Update 5

Release Notes

Overview

  • Relaxed limitations on collecting GPU hardware metrics for users without Administrator/root privileges.
  • Added support for HW-based analysis on systems running under Hyper-V.
  • Microarchitecture analysis improvements:
    • Relaxed limitations for the Perf* driverless collection on Linux for users with perf_event_paranoid value set to 2. For such environments, event-based sampling is supported in the user space.
    • Memory Access analysis updated with a new UPI Utilization metric for Intel microarchitectures code named Cascade Lake and Skylake.
  • Application Performance Snapshot improvements:
    • Max and Bound metrics added to estimate the efficiency of the DRAM, MCDRAM, and Persistent Memory usage.
  • Quality and usability improvements:
    • Interactive Help Tour available from the Welcome page and guiding you through the product interface using a sample project.
    • Simplified configuration of a Windows-to-Linux remote collection supporting automated password-less access to a Remote Linux (SSH) target.
  • Platform Profiler improvements:
    • Enhanced system overview including CPU and memory utilization summary, CPU I/O Wait information (Linux) and CPU utilization breakdown for key CPU-stall reasons
    • Initial implementation of a custom view that allows users to specify the metrics to be visualized
    • Full support for Intel® Optane™ DC Persistent Memory metrics on 2nd Generation Intel® Xeon® Scalable Processor server platforms (formerly Cascade Lake).
    • A simplified and consistent command line interface for collecting data (the previous command line interface is supported for backward compatibility).
    • A simplified and consistent command line interface for collecting data that conforms better to CLI conventions. The previous command line interface is supported for backward compatibility. See the Intel® VTune™ Amplifier User Guide for more details.
  • Support for new operating systems:
    • Android Q
    • Red Hat* Enterprise Linux* 8
    • Fedora* 30

NOTE: 32 bit OSs are deprecated in 2019 version and support will be removed in upcoming release. VTune™ can still profile 32 bit applications on 64 bit OSs (cross mode).

Update 4

Release Notes

Overview:

  • GPU analysis improvements:
    • Inline Mode filter option added to GPU In-Kernel Profiling viewpoint, to display GPU-side call stacks with OpenCL™ inline functions and correctly attribute GPU Cycles statistics per function. By default, Inline Mode is switched off.
    • Source/Assembly analysis available for OpenCL programs created with IL (intermediate language), if the intermediate SPIR-V binary was built with the -gline-tables-only -s <cl_source_file_name> option.
    • (PREVIEW) New Instruction Count profiling mode added to the GPU In-Kernel Profiling to analyze GPU instructions executed by an OpenCL kernel and classified per instruction type. This mode helps you compare the performance of the same OpenCL kernel on different hardware or explore instruction count for different implementations of the same algorithms on the same hardware.
  • Microarchitecture analysis improvements:
    • Default driverless mode for hardware event-based collections with stacks, such as Hotspots and Threading. Driver-based collection can still be run by setting the "Stack size" option to the unlimited value (0) or disabling the "Enable driverless collection" option in a custom analysis.
    • The Precise column added to the Summary of the Hardware Events viewpoint to clearly identify precise events. Using precise events in your configurations provide more accurate Assembly analysis with no event skids.
  • Quality and usability improvements:
    • Improved integration with the Microsoft* Visual Studio* IDE with a quick access to VTune Amplifier options via a smart integrated Welcome page.
    • Overlay help with quick tips for the Bottom-up tab highlights important interface elements to efficiently manage analysis data.
    • Added Linux kernel 5.0 support

Update 3

Release Notes

Overview:

  • Support for Intel® Optane™ DC persistent memory and the latest microarchitecture code-named Cascade Lake. This includes new hardware event support and enhanced memory analysis to design and optimize for the new persistent memory technology.
    Learn more about the next generation of memory!
  • Resolve performance bottlenecks where network workloads are consuming high I/O bandwidth. Enhanced PCIe device metrics for I/O traffic in the Input and Output analysis help you understand the interactions between Cores and Network Interface Cards (NICs).
  • MPI improvements:
    • Easier control of data collection for MPI applications using the standard MPI_PControl API. Collect only the data you need with a few quick changes and no dependency on the ITT API.
    • Easier MPI communication pattern diagnosis with Application Performance Snapshot’s rank to rank communication diagram by message volume.
  • Usability improvements:
    • Friendlier welcome page provides fast access to technical content and project controls.
    • Improved importing process for traces and result files. It’s now possible to import whole result directories to a project and use project search directories for symbol and source/assembly resolution.
    • Simplified installation and licensing (serial numbers and license files are no longer required for this product).

Update 2

Release Notes

Overview:

  • Intel® VTune™ Amplifier 2019 Update 2 includes functional and security updates. Users should update to the latest version.
  • Microarchitecture analysis improvements:
    • Configuration for the Microarchitecture Exploration analysis optimized to provide you with the control over collected hardware metrics and data collection overhead in general. By default, the analysis provides you with a full set of top-level hardware metrics and their sub-metrics that show how your code uses hardware resources. With a new configuration option, you can choose to narrow down the scope and collect sub-metrics only for the selected top-level metrics.
  • System Analyzer tool for monitoring real-time metrics on a target system added to the VTune Amplifier as a PREVIEW feature.
  • HPC workload profiling improvements:
    • Full-featured support of OpenMPI targets in Application Performance Snapshot
    • Vectorization metrics streamlined for the HPC Performance Characterization analysis
    • PREVIEW: HTML report added to show process/thread affinity along with CPU execution and remote access information
  • Supported managed Linux and Windows targets with tiered compilation for .NET* Core 3.0 Preview 1 and .NET Core 2.2
  • Quality and usability improvements:
    • Improved support for standalone command-line results imported into a VTune Amplifier GUI project. Search directories specified in the command line configuration are preserved and applied for proper module resolution in the graphical viewpoints.

Update 1

Release Notes

Overview:

  • Threading analysis extended with the lower overhead hardware event-based sampling mode. This mode helps analyze an impact of thread preemption and context switching. On Windows*, this analysis configuration requires the sampling driver. On Linux*, the analysis is available both with the sampling driver and with the Linux Perf* collector for kernels 4.4 and higher.
  • Quality and usability improvements:
    • summary command line report for the Hotspots analysis enriched with metrics and Top 5 Hotspots table that is also available from the GUI Summary view.
    • A sample matrix project added to the Project Navigator to help you get started with the product, review a sample pre-collected Hotspots result, and test other analysis types and source view options. A pre-built version of the matrix sample application and associated source files are available installed with VTune Amplifier.
    • Support for Linux Perf* collection extended with VTune Amplifier metrics with a further option to import the Perf trace to the VTune Amplifier GUI and benefit from predefined viewpoints. This solution could be useful for performance analysis in data centers)

Initial release

Release Notes

Overview:

  • New, easier tuning workflow and simplified setup
  • New Platform Profiler. Longer data collection finds hardware configuration issues and poorly tuned applications.
  • Application Performance Snapshot adds utilization of logical vs. physical cores, pause and resume and Intel Trace Analyzer and Collector integration

2018

Update 4

Release Notes

Overview:

  • Support for new operating systems:
    • SUSE* Linux* Enterprise Server (SLES)12 SP3, SUSE* Linux* Enterprise Server (SLES)15
    • Red Hat* Enterprise Linux* 6.10
    • Fedora 28
    • Microsoft Windows* 10 RS4

Update 3

Release Notes

Overview:

  • Analysis on embedded platforms and accelerators:
    • New CPU/FPGA Interaction analysis (PREVIEW) to assess the balance between the CPU and FPGA on systems with a discrete Intel® Arria® 10 FPGA running OpenCL™ applications
    • New Graphics Rendering analysis (PREVIEW) for CPU/GPU utilization of your code running on the Xen* virtualization platform installed on a remote embedded target
    • Support for the sampling command-line analysis on remote QNX* embedded systems via ethernet connection
  • HPC workload profiling improvements:
    • CPU Utilization metric refined to differentiate the utilization on logical vs. physical cores, which is particularly important for HPC applications running on Intel® Xeon® processor family processors
  • Managed runtime analysis improvements:
    • Extended JIT profiling for server-side applications running on the LLVM* or HHVM* PHP servers to support the event-based sampling analysis in the attach mode
    • Extended Java* code analysis with support for OpenJDK* 9 and Oracle* JDK 9
    • Enabled Advanced Hotspots analysis for .NET* Core applications running on Linux and Windows systems in the Launch Application mode
  • Application Performance Snapshot improvements:
    • Added the ability to pause/resume collection with MPI_Pcontrol and itt API. The -start-paused option was added to exclude application execution from collection from the start to the first collection resume occurrence.
    • Enabled selection of which data types are collected to reduce overhead. The choices include MPI tracing, OpenMP tracing, hardware counter based collection, or a combination of the three.
    • Exposed the CPU Utilization metric by physical cores on processors that support proper hardware events.
    • Significantly reduced MPI tracing overhead when there are a large number of ranks.
    • Enriched MPI statistics generated by the aps-report utility by showing information about communicators used in the application and to group and filter collective operations by the communicators.
    • Improved integration with Intel® Trace Analyzer and Collector by adding the ability to generate profiling configuration files with the aps-report option.
  • Quality and usability improvements:
    • Hardware event-based analysis supported for targets running in the Hyper-V* environment on Windows* 10 Fall Creators Update (RedStone3)
  • Support for new operating systems and IDEs including:
    • Fedora*
    • Ubuntu* 17.10

Update 2

Release Notes

Overview:

  • Mitigated impact of OS security updates: https://software.intel.com/content/www/us/en/develop/articles/intel-vtune-amplifier-impact-of-recent-os-security-updates.html
  • Collect only the data you need with Application Performance Snapshot’s new data selection options and pause/resume API support. Get better answers with lower overhead.
  • Assess the balance between the CPU and FPGA with a new CPU/FPGA Interaction analysis (PREVIEW)
  • CPU utilization for physical and logical cores improves analysis of hyper-threading and thread migration performance effects.
  • Improvements to JIT profiling for server-side applications and support for OpenJDK* 9 and Oracle* JDK 9.
  • Profile .Net Core applications running on Linux* or Windows* systems with Advanced Hotspots analysis
  • Hardware event-based analysis supported for targets running in the Hyper-V* environment on Windows* 10 Fall Creators Update (build 1709)

Initial Release and Update 1

Release Notes

Overview:

  • Easier tuning of threaded MPI applications. HPC analysis adds enhanced metrics for MPI including MPI imbalance & performance of critical path rank. Application Performance Snapshot merges MPI + Application data, includes richer metrics, and adds MPICH compatibility.
  • Optimize private cloud-based applications. Profile inside Docker & Mesos containers and attach to running Java services and daemons.
  • Easier analysis of remote Linux* systems. Automated install of performance collectors on a remote Linux target.

2017

Update 5

Linux* Release NotesmacOS* Release NotesWindows* Release Notes

This update is optional unless you need the new features.

Overview:

  • Support for Microsoft Visual Studio* 2017 Update 3
  • Bug fixes and performance improvements

Update 4

Linux* Release NotesmacOS* Release NotesWindows* Release Notes

This update is optional unless you need the new features.

Overview:

  • General Exploration, Memory Access, HPC Performance Characterization analysis types extended to support Intel® Xeon® Processor Scalable family
  • Support for Microsoft Windows* 10 Creators Update (RS2)

Update 3

Linux* Release NotesmacOS* Release NotesWindows* Release Notes

This update is optional unless you need the new features.

Overview:

  • Application Performance Snapshot (Preview) provides a quick look at your application performance and helps you understand where your application will benefit from tuning. The revised tool shows metrics on MPI parallelism (Linux* only), OpenMP* parallelism, memory access, FPU utilization, and I/O efficiency with recommendations on further in-depth analysis.
  • Support for Intel® Xeon Phi™ coprocessor targets codenamed Knights Landing
  • Improved insight into parallelism inefficiencies for applications using Intel Threading Building Blocks (Intel TBB) with extended classification of high Overhead and Spin time.
  • Automated installation of the VTune Amplifier collectors on a remote Linux target system. This feature is helpful if you profile a target on a shared resource without VTune Amplifier installed or on an embedded platform where targets may be reset frequently.
  • Support for Microsoft Visual Studio* 2017

Update 2

Linux* Release NotesmacOS* Release NotesWindows* Release Notes

This update is optional unless you need the new features.

Overview:

Details:

HPC Performance Characterization Analysis improvements

The HPC Performance Characterization Analysis has received several improvements.

Increased detail and structure for the vector efficiency metrics based on FLOP counters in the FPU Utilization section help diagnose the reason for low utilization connected with poor vector code generation. Relevant metrics include:

  • Vector Capacity Usage
  • FP Instruction Mix
  • FP Arithmetic Instructions per Memory Read or Write
  • SP FLOPs per Cycle (may indicate memory bandwidth bound code)

For MPI applications, the MPI Imbalance metric shows CPU time spent by ranks spinning in waits on communication operations, normalized by the number of ranks on the profiling node. The metric issue detection description generation is based on minimal MPI Busy Wait time by ranks. If the minimal MPI Busy Wait time by ranks is not significant, then the rank with the minimal time most likely lies on the critical path of application execution. In this case, review the CPU utilization metrics by this rank.

The Top Loops/Functions with FPU Usage by CPU Time table shows the top functions that contain floating point operations, sorted by CPU time. The FPU Utilization column provides issue descriptions based on whether a loop/function is bandwidth bound, whether it is vectorized or scalar, and what instruction set it's using.

For Intel Xeon Phi processors (codenamed Knights Landing), the following FPU metrics are available instead of FLOP counters:

  • SIMD instructions per cycle
  • Fraction of packed SIMD instructions vs scalar SIMD instructions per cycle
  • Vector instruction set for loops based on static analysis
DRAM Bandwidth Bound metric

A new metric is available in the Memory Usage viewpoint for the Memory Access and HPC Performance Characterization analyses which indicates whether your system spent much time heavily utilizing the DRAM bandwidth. The calculation of this metric relies on accurate maximum system DRAM bandwidth measurement, and depends on the number of sockets on your system.

GPU Hotspots Summary improvements

The GPU Hotspots viewpoint's Summary tab has been extended to display more information. The GPU Usage section can be used to identify whether the GPU was properly utilized. The Packet Queue Depth Histogram can be used to estimate the GPU software queue depth per GPU engine during the target run. Ideally, your goal is an effective GPU engine utilization with evenly loaded queues and minimal duration for the zero queue depth.

For a high-level view of the DMA packet execution during the target run, review the Packet Duration Histogram. Select a required packet type from the drop-down menu and identify how effectively these packets were executed on the GPU. Having high packet count values for the minimal duration is optimal.

KVM Guest OS Profiling

If you are a system developer and interested in the performance analysis of a guest Linux* system, use Intel VTune Amplifier for performance analysis of this guest Linux* OS via Kernel-based Virtual Machine (KVM) from the host system. Depending on your analysis target, you may choose either of the following usage models for KVM guest OS profiling:

Locks & Waits analysis for Python

Locks and Waits analysis can now be used to tune threaded performance of mixed Python* and native code. View Sync Objects in the grid, see Python frames in the Call Stack, an define which sync objects are the Global Interpreter Lock (GIL), either by wait count or by callstack. Drill down to Python source to explore thread synchronization issues at code level. For more information on how to configure the analysis, see the Python* Code Analysis product help article.

Update 1

Linux* Release NotesmacOS* Release NotesWindows* Release Notes

  • Support for the Average Latency metric in the Memory Access analysis based on the driverless collection
  • Support for locator hardware event metrics for the General Exploration analysis results in the Source/Assembly view that enable you to filter the data by a metric of interest and identify performance-critical code lines/instructions
  • Command line summary report for the HPC Performance Characterization analysis extended to show metrics for CPU, Memory and FPU performance aspects including performance issue descriptions for metrics that exceed the predefined threshold. To hide issue descriptions in the summary report, use a new report-knob show-issues option.
  • Summary view of the General Exploration analysis extended to explicitly display measure for the hardware metrics: Clockticks vs. Piepline Slots
  • GPU Hotspots analysis extended to detect hottest computing tasks bound by GPU L3 bandwidth
  • PREVIEW: New Full Compute event group added to the list of predefined GPU hardware event groups collected for Intel® HD Graphics and Intel Iris™ Graphics. This group combines metrics from the Overview and Compute Basic presets and allows to see all detected GPU stalled/idle issues in the same view.
  • Support for hotspot navigation and filtering of stack sampling analysis data by the Total type of values in the Source/Assembly view

Initial Release

Linux* Release NotesmacOS* Release NotesWindows* Release Notes

Overview:

Details:

Intel® Xeon Phi™ Processor Support

Intel® VTune™ Amplifier now supports the Intel® Xeon Phi™ Processor codenamed Knights Landing.

Decide how to use MCDRAM (the high bandwidth memory) effectively using Memory Access Analysis, analyze the scalability of MPI and OpenMP* with HPC Performance Characterization Analysis, and explore the microarchitecture efficiency with General Exploration Analysis.

HPC Performance

The HPC Performance Characterization Analysis explores the three key performance aspects of application scalability:

  • Threading: CPU Utilization with parallel efficiency for MPI and OpenMP*. Explore the serial vs parallel time and the top OpenMP regions by potential gain.
  • Memory Access Efficiency: includes bandwidth utilization and stalls by memory hierarchy.
  • FPU utilization: includes basic vectorization metrics.

See the analysis usage example in the Analyzing an OpenMP and MPI Application web-based tutorial, which provides a hands-on exercise to identify memory utilization inefficiencies and load imbalance for a sample hybrid application.

Memory Access Analysis

The Memory Access Analysis has been improved. In addition to support for the Intel Xeon Phi processors, it now supports custom memory allocators, and includes automatic detection of maximum system DRAM bandwidth characteristics and scaling bandwidth data from that maximum. This allows users to easily see how they actually utilize the available DRAM bandwidth, rather than just raw GB/S values. The QPI bandwidth has been split to Total, Outgoing, and Incoming, instead of just the total. The workflow has been optimized for identifying the top memory objects with high bandwidth utilization per domain. Finally, no special drivers are required on Linux*; this analysis type can now use standard Linux* perf to collect data, eliminating the need for root to install other drivers.

Disk I/O Analysis (Preview)

The Disk Input and Output analysis for HDD, SATA, or NVMe SSD monitors utilization of the disk subsystem, CPU, and PCle buses, and helps to identify long latency of I/O requests and imbalance between I/O and compute operations.

See the Analyzing Input/Output Waits tutorial for a hands-on exercise with sample code on Linux*.

GPU analysis improvements

GPU Hotspots Analysis is intended for GPU-bound applications, and provides options to analyze execution of OpenCL™ kernels and Intel® Media™ SKD tasks.

The GPU Analysis Summary provides a set of metrics to estimate the GPU utilization per engine, identify stalled or idle execution units, and explore the most typical problems with low occupancy or frequent sampler accesses. Navigate from the Hottest GPU computing tasks summary to the details provided in the graphics tab.

Intel VTune Amplifier now also supports the detection of OpenCL 2.0 Shared Virtual Memory (SVM) usage types per kernel instance.

For more information, see Using Intel VTune Amplifier to Optimize Media & Video Applications.

Usability Improvements

Remote usage and Command Line usage have been improved. Use the Arbitrary target GUI configuration to generate a command line for performance analysis on a system that is not accessible from the current host.

MPI analysis has been extended with the event-based sampling collection supported for multiple ranks per node with an arbitrary MPI launcher and natural syntax. Use the MPI launcher option in the arbitrary targets configuration to automatically generate a command line for MPI analysis from the GUI.

An option for enabling and disabling the OpenMP regions analysis has been added to selected analysis configurations.

Support has been added for the Attach To Process target type with event-based sampling for low-privilege Java* daemons on Linux*.

The event selection mechanism for custom hardware event based sampling has been extended with filtering options.

The grid views and identification of performance issues have had UI improvements made.

Intel® Performance Snapshot (Preview)

The Application Performance Snapshot tool provides a quick look at your application performance and helps you understand whether your application will benefit from tuning.

It identifies how effectively your application uses the hardware platform and displays basic performance enhancement opportunities.

The Storage Performance Snapshot tool analyzes your system's storage, CPU, memory, and network usage and displays basic performance enhancement opportunities for systems using Intel hardware.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804