Intel® VTune™ Amplifier Release Notes and New Features

This page provides the current Release Notes for Intel® VTune™ Amplifier (Intel® VTune™ Amplifier XE for versions 2017 and older). The notes are categorized by major version, from newest to oldest, with individual releases listed within each version section.

Click a release to expand it into a summary of new features and changes in that version since the last release. The expanded summary also contains download buttons for the detailed release notes, which include important information, such as pre-requisites, software compatibility, installation instructions, and known issues.

You can copy a link to a specific release's section by clicking the chain icon next to its name.

The installation guides for version 2016 and later are posted separately:
Linux*macOS*Windows*

All files are in PDF format - Adobe Reader* (or compatible) required.
To get product updates, log in to the Intel® Software Development Products Registration Center.
For questions or technical support, visit Intel® Software Developer Support.

2018

Initial Release

Release Notes

Overview:

  • Easier tuning of threaded MPI applications. HPC analysis adds enhanced metrics for MPI including MPI imbalance & performance of critical path rank. Application Performance Snapshot merges MPI + Application data, includes richer metrics, and adds MPICH compatibility.
  • Optimize private cloud-based applications. Profile inside Docker & Mesos containers and attach to running Java services and daemons.
  • Easier analysis of remote Linux* systems. Automated install of performance collectors on a remote Linux target.

2017

Update 5

Linux* Release NotesmacOS* Release NotesWindows* Release Notes

This update is optional unless you need the new features.

Overview:

  • Support for Microsoft Visual Studio* 2017 Update 3
  • Bug fixes and performance improvements
Update 4

Linux* Release NotesmacOS* Release NotesWindows* Release Notes

This update is optional unless you need the new features.

Overview:

  • General Exploration, Memory Access, HPC Performance Characterization analysis types extended to support Intel® Xeon® Processor Scalable family
  • Support for Microsoft Windows* 10 Creators Update (RS2)
Update 3

Linux* Release NotesmacOS* Release NotesWindows* Release Notes

This update is optional unless you need the new features.

Overview:

  • Application Performance Snapshot (Preview) provides a quick look at your application performance and helps you understand where your application will benefit from tuning. The revised tool shows metrics on MPI parallelism (Linux* only), OpenMP* parallelism, memory access, FPU utilization, and I/O efficiency with recommendations on further in-depth analysis.
  • Support for Intel® Xeon Phi™ coprocessor targets codenamed Knights Landing
  • Improved insight into parallelism inefficiencies for applications using Intel Threading Building Blocks (Intel TBB) with extended classification of high Overhead and Spin time.
  • Automated installation of the VTune Amplifier collectors on a remote Linux target system. This feature is helpful if you profile a target on a shared resource without VTune Amplifier installed or on an embedded platform where targets may be reset frequently.
  • Support for Microsoft Visual Studio* 2017
Update 2

Linux* Release NotesmacOS* Release NotesWindows* Release Notes

This update is optional unless you need the new features.

Overview:

Details:

HPC Performance Characterization Analysis improvements

The HPC Performance Characterization Analysis has received several improvements.

Increased detail and structure for the vector efficiency metrics based on FLOP counters in the FPU Utilization section help diagnose the reason for low utilization connected with poor vector code generation. Relevant metrics include:

  • Vector Capacity Usage
  • FP Instruction Mix
  • FP Arithmetic Instructions per Memory Read or Write
  • SP FLOPs per Cycle (may indicate memory bandwidth bound code)

For MPI applications, the MPI Imbalance metric shows CPU time spent by ranks spinning in waits on communication operations, normalized by the number of ranks on the profiling node. The metric issue detection description generation is based on minimal MPI Busy Wait time by ranks. If the minimal MPI Busy Wait time by ranks is not significant, then the rank with the minimal time most likely lies on the critical path of application execution. In this case, review the CPU utilization metrics by this rank.

The Top Loops/Functions with FPU Usage by CPU Time table shows the top functions that contain floating point operations, sorted by CPU time. The FPU Utilization column provides issue descriptions based on whether a loop/function is bandwidth bound, whether it is vectorized or scalar, and what instruction set it's using.

For Intel Xeon Phi processors (codenamed Knights Landing), the following FPU metrics are available instead of FLOP counters:

  • SIMD instructions per cycle
  • Fraction of packed SIMD instructions vs scalar SIMD instructions per cycle
  • Vector instruction set for loops based on static analysis
DRAM Bandwidth Bound metric

A new metric is available in the Memory Usage viewpoint for the Memory Access and HPC Performance Characterization analyses which indicates whether your system spent much time heavily utilizing the DRAM bandwidth. The calculation of this metric relies on accurate maximum system DRAM bandwidth measurement, and depends on the number of sockets on your system.

GPU Hotspots Summary improvements

The GPU Hotspots viewpoint's Summary tab has been extended to display more information. The GPU Usage section can be used to identify whether the GPU was properly utilized. The Packet Queue Depth Histogram can be used to estimate the GPU software queue depth per GPU engine during the target run. Ideally, your goal is an effective GPU engine utilization with evenly loaded queues and minimal duration for the zero queue depth.

For a high-level view of the DMA packet execution during the target run, review the Packet Duration Histogram. Select a required packet type from the drop-down menu and identify how effectively these packets were executed on the GPU. Having high packet count values for the minimal duration is optimal.

KVM Guest OS Profiling

If you are a system developer and interested in the performance analysis of a guest Linux* system, use Intel VTune Amplifier for performance analysis of this guest Linux* OS via Kernel-based Virtual Machine (KVM) from the host system. Depending on your analysis target, you may choose either of the following usage models for KVM guest OS profiling:

Locks & Waits analysis for Python

Locks and Waits analysis can now be used to tune threaded performance of mixed Python* and native code. View Sync Objects in the grid, see Python frames in the Call Stack, an define which sync objects are the Global Interpreter Lock (GIL), either by wait count or by callstack. Drill down to Python source to explore thread synchronization issues at code level. For more information on how to configure the analysis, see the Python* Code Analysis product help article.

Update 1

Linux* Release NotesmacOS* Release NotesWindows* Release Notes

  • Support for the Average Latency metric in the Memory Access analysis based on the driverless collection
  • Support for locator hardware event metrics for the General Exploration analysis results in the Source/Assembly view that enable you to filter the data by a metric of interest and identify performance-critical code lines/instructions
  • Command line summary report for the HPC Performance Characterization analysis extended to show metrics for CPU, Memory and FPU performance aspects including performance issue descriptions for metrics that exceed the predefined threshold. To hide issue descriptions in the summary report, use a new report-knob show-issues option.
  • Summary view of the General Exploration analysis extended to explicitly display measure for the hardware metrics: Clockticks vs. Piepline Slots
  • GPU Hotspots analysis extended to detect hottest computing tasks bound by GPU L3 bandwidth
  • PREVIEW: New Full Compute event group added to the list of predefined GPU hardware event groups collected for Intel® HD Graphics and Intel Iris™ Graphics. This group combines metrics from the Overview and Compute Basic presets and allows to see all detected GPU stalled/idle issues in the same view.
  • Support for hotspot navigation and filtering of stack sampling analysis data by the Total type of values in the Source/Assembly view
Initial Release

Linux* Release NotesmacOS* Release NotesWindows* Release Notes

Overview:

Details:

Intel® Xeon Phi™ Processor Support

Intel® VTune™ Amplifier now supports the Intel® Xeon Phi™ Processor codenamed Knights Landing.

Decide how to use MCDRAM (the high bandwidth memory) effectively using Memory Access Analysis, analyze the scalability of MPI and OpenMP* with HPC Performance Characterization Analysis, and explore the microarchitecture efficiency with General Exploration Analysis.

HPC Performance

The HPC Performance Characterization Analysis explores the three key performance aspects of application scalability:

  • Threading: CPU Utilization with parallel efficiency for MPI and OpenMP*. Explore the serial vs parallel time and the top OpenMP regions by potential gain.
  • Memory Access Efficiency: includes bandwidth utilization and stalls by memory hierarchy.
  • FPU utilization: includes basic vectorization metrics.

See the analysis usage example in the Analyzing an OpenMP and MPI Application web-based tutorial, which provides a hands-on exercise to identify memory utilization inefficiencies and load imbalance for a sample hybrid application.

Memory Access Analysis

The Memory Access Analysis has been improved. In addition to support for the Intel Xeon Phi processors, it now supports custom memory allocators, and includes automatic detection of maximum system DRAM bandwidth characteristics and scaling bandwidth data from that maximum. This allows users to easily see how they actually utilize the available DRAM bandwidth, rather than just raw GB/S values. The QPI bandwidth has been split to Total, Outgoing, and Incoming, instead of just the total. The workflow has been optimized for identifying the top memory objects with high bandwidth utilization per domain. Finally, no special drivers are required on Linux*; this analysis type can now use standard Linux* perf to collect data, eliminating the need for root to install other drivers.

Disk I/O Analysis (Preview)

The Disk Input and Output analysis for HDD, SATA, or NVMe SSD monitors utilization of the disk subsystem, CPU, and PCle buses, and helps to identify long latency of I/O requests and imbalance between I/O and compute operations.

See the Analyzing Input/Output Waits tutorial for a hands-on exercise with sample code on Linux*.

GPU analysis improvements

GPU Hotspots Analysis is intended for GPU-bound applications, and provides options to analyze execution of OpenCL™ kernels and Intel® Media™ SKD tasks.

The GPU Analysis Summary provides a set of metrics to estimate the GPU utilization per engine, identify stalled or idle execution units, and explore the most typical problems with low occupancy or frequent sampler accesses. Navigate from the Hottest GPU computing tasks summary to the details provided in the graphics tab.

Intel VTune Amplifier now also supports the detection of OpenCL 2.0 Shared Virtual Memory (SVM) usage types per kernel instance.

For more information, see Using Intel VTune Amplifier to Optimize Media & Video Applications.

Usability Improvements

Remote usage and Command Line usage have been improved. Use the Arbitrary target GUI configuration to generate a command line for performance analysis on a system that is not accessible from the current host.

MPI analysis has been extended with the event-based sampling collection supported for multiple ranks per node with an arbitrary MPI launcher and natural syntax. Use the MPI launcher option in the arbitrary targets configuration to automatically generate a command line for MPI analysis from the GUI.

An option for enabling and disabling the OpenMP regions analysis has been added to selected analysis configurations.

Support has been added for the Attach To Process target type with event-based sampling for low-privilege Java* daemons on Linux*.

The event selection mechanism for custom hardware event based sampling has been extended with filtering options.

The grid views and identification of performance issues have had UI improvements made.

Intel® Performance Snapshot (Preview)

The Application Performance Snapshot tool provides a quick look at your application performance and helps you understand whether your application will benefit from tuning.

It identifies how effectively your application uses the hardware platform and displays basic performance enhancement opportunities.

The Storage Performance Snapshot tool analyzes your system's storage, CPU, memory, and network usage and displays basic performance enhancement opportunities for systems using Intel hardware.

2016

Update 4

Linux* Release NotesmacOS* Release NotesWindows* Release Notes

  • Support for the Intel® Xeon Phi™ Processor Codenamed Knights Landing (KNL) including General Exploration, Memory Access, HPC Performance Characterization analysis and PMU event reference.
  • PMU event reference for Intel® Xeon® Processor E5 v4 Family (formerly codenamed "Broadwell-EP")

Note: you may receive a warning message about "Unsigned driver" during installation on Windows* 7 and Windows* Server 2008 R2 systems. The VTune™ Amplifier hardware event-based sampling drivers (sepdrv.sys and vtss.sys) are now signed with digital SHA-2 certificate key for compliance with Windows* 10 requirements. To install the drivers on Windows* 7 and Windows* Server 2008 R2 operating systems, you must add functionality for the SHA-2 hashing algorithm to the systems by applying Microsoft* Security update 3033929.

Update 3

Linux* Release NotesmacOS* Release NotesWindows* Release Notes

  • Support for the next generation Intel® Xeon® Processor E5 v4 Family (formerly codenamed "Broadwell-EP")
  • Detection of the OpenCL™ 2.0 Shared Virtual Memory (SVM) usage types per kernel instance
  • Arbitrary targets command line configuration extended with MPI launcher options
  • New option for enabling/disabling the OpenMP* regions analysis added to selected analysis configurations (default is now off)
  • Driverless event-based sampling collection for uncore events enabled for the Memory Access analysis
  • Preview features:
    • Disk Input and Output analysis that monitors utilization of the disk subsystem, CPU and processor buses, helps identify long latency of I/O requests and imbalance between I/O and compute operations
    • GPU Hotspots analysis targeted for GPU-bound applications and providing options to analyze execution of OpenCL™ kernels and Intel Media SDK tasks
    • Basic Hotspots analysis extended to supportPython* applications running via the Launch Application or Attach to Process modes
  • Support for the Microsoft* Visual Studio 2015 Update 2
  • Ability to load/unload VTune Amplifier product environment with modulefile for Environment Modules system
  • Support for Intel® Manycore Platform Software Stack (Intel® MPSS) version 3.7

Note: you may receive a warning message about "Unsigned driver" during installation on Windows* 7 and Windows* Server 2008 R2 systems. The VTune™ Amplifier hardware event-based sampling drivers (sepdrv.sys and vtss.sys) are now signed with digital SHA-2 certificate key for compliance with Windows* 10 requirements. To install the drivers on Windows* 7 and Windows* Server 2008 R2 operating systems, you must add functionality for the SHA-2 hashing algorithm to the systems by applying Microsoft* Security update 3033929.

Update 2

Linux* Release NotesmacOS* Release NotesWindows* Release Notes

  • Source/Assembly analysis available for OpenCL™ kernels
  • SGX Hotspots analysis support for identifying hotspots inside security enclaves for systems with the Intel Software Guard Extensions (Intel SGX) feature enabled
  • HPC Characterization analysis (preview) that monitors utilization of the CPU, memory, and FPU for a compute-intensive or throughput application and helps identify floating point operation and memory optimization opportunities.
  • Metric-based navigation between call stack types replacing the former Data of Interest selection
  • Updated filter bar options, including the selection of a filtering metric used to calculate the contribution of the selected program unit (module, thread, and so on)
  • Default project configuration changed to apply existing target (thresholds for frame rate, region/ interrupt/function duration) and filtering (call stack mode, inline mode, loop mode) settings to all subsequent results generated for this project
  • New option to measure the maximum local bandwidth and use this data to scale the DRAM Bandwidth overtime view and calculate the bandwidth histogram thresholds
  • Support for the Fedora* 23, Ubuntu* 15.10
  • Support for the Microsoft Windows* 10 November update

Note: you may receive a warning message about "Unsigned driver" during installation on Windows* 7 and Windows* Server 2008 R2 systems. The VTune™ Amplifier hardware event-based sampling drivers (sepdrv.sys and vtss.sys) are now signed with digital SHA-2 certificate key for compliance with Windows* 10 requirements. To install the drivers on Windows* 7 and Windows* Server 2008 R2 operating systems, you must add functionality for the SHA-2 hashing algorithm to the systems by applying Microsoft* Security update 3033929.

Update 1

Linux* Release NotesmacOS* Release NotesWindows* Release Notes

  • General Exploration analysis for Intel® microarchitecture code name Cherry Trail
  • Event-based sampling collection for multiple ranks per node with an arbitrary MPI launcher
  • Command-line option -knob event-config extended to display a list of PMU events available on the target system
  • Algorithm analysis views extended to display confidence indication (greyed out font) for metrics lacking sufficient samples
  • Event-based sampling collection support for .NET* processes (.NET 4.0 and higher) in the attach mode
  • Intel® Manycore Platform Software Stack (Intel® MPSS) version 3.6 support
  • Linux* kernel 4.1, 4.2 and 4.3 support

Note: you may receive a warning message about "Unsigned driver" during installation on Windows* 7 and Windows* Server 2008 R2 systems. The VTune™ Amplifier hardware event-based sampling drivers (sepdrv.sys and vtss.sys) are now signed with digital SHA-2 certificate key for compliance with Windows* 10 requirements. To install the drivers on Windows* 7 and Windows* Server 2008 R2 operating systems, you must add functionality for the SHA-2 hashing algorithm to the systems by applying Microsoft* Security update 3033929.

Initial Release

Linux* Release NotesWindows* Release Notes


Tune OpenMP Scalability Faster

Using the enhanced OpenMP* analysis you can effectively identify common performance bottlenecks of your parallel implementation, such as:

  • Execution of serial portions (outside of any parallel region): When the master thread is executing a serial region and when the worker threads are in the OpenMP runtime library waiting for the next parallel region.
  • Load imbalance: When a thread finishes its part of workload in a parallel region, it waits at a barrier for the other threads to finish.
  • Not enough parallel work: The number of loop iterations is less than the number of working threads so several threads from the team are waiting at the barrier not doing useful work at all.
  • Synchronization on locks: When synchronization objects are used inside a parallel region, threads can wait on a lock release, contending with other threads for a shared resource.

VTune Amplifier, together with the Intel OpenMP runtime library from Intel Composer XE 2015 Update 3 or higher, helps you understand how an application utilizes available CPUs and identifies causes of CPU under-utilization.

For a detailed description and result interpretation instructions please refer to the OpenMP Analysis and Interpreting OpenMP* Analysis Data product help topics.

Easier MPI+OpenMP Multi-Rank Analysis

Identify ranks with low MPI communication spin time to have the highest impact when tuning OpenMP performance.

For MPI analysis results including more than one process with OpenMP regions, the Summary window shows a section with top processes laying on a critical path of MPI execution with Serial Time and OpenMP Potential Gain aggregated per process:

For a detailed description and result interpretation instructions please refer to the MPI Analysis product help topic.

Memory Access Analysis & Better Bandwidth Analysis for Non-Uniform Memory

The new Memory Access analysis helps to identify memory-related issues, like NUMA problems and bandwidth-limited accesses.  Memory Access analysis replaced the former Bandwidth Analysis.

For Linux targets based on Sandy Bridge or later processors, Memory Access analysis can be configured to attribute performance events (samples?) to memory objects (data structures).  VTune Amplifier uses instrumentation of memory allocations/de-allocations and symbol information for static/global variables to map memory access addresses to application memory objects.

For a detailed description and result interpretation instructions, please refer to the Memory Access Analysis and Interpreting Memory Usage Data product help topics.

GPU Analysis Improvements

For a detailed description and result interpretation instructions please refer to the GPU Analysis, Analyzing Applications Using Intel® HD Graphics and Interpreting GPU OpenCL™ Application Analysis Data product help topics.

Easier Profiling in Virtualized Environments

Event-based sampling (EBS) analysis within a virtual machine is available in the following environments virtualizing the on-chip Performance Monitoring Unit (PMU):

  • VMware* Fusion* 5 and higher supporting EBS analysis via the SEP kernel driver
  • KVM* supporting driverless EBS collection via Linux perf* tool starting with Linux Kernel 3.2 and QEMU 1.4
  • XEN* 4.0 and higher supporting driverless EBS collection via Linux perf tool for User Domain (domU)

VTune Amplifier installation detects virtual environments and disables sampling drivers installation to avoid system conflicts.  Hardware Event-based Sampling analyses use the driver-less collection mode via the Linux perf tool.

There are two separate applications for this support: profiling the guest operating system for system developers and profiling user applications within the guest operating system.

Microsoft Windows* 10 Support

All analysis types are supported, validated on Windows 10 Client OS Build #10240

Known limitations:

  • Built-in Hyper-V can be enabled by default on some systems with Windows 10 installed. Hyper-V doesn’t allow PMU EBS analysis for other tools. If you need to perform HW analysis - follow troubleshoot instructions in the VTune Amplifier product help to disable Hyper-V.
  • Profiling of Windows Store applications is not supported
  • Windows* 10 Mobile Edition is not supported

2015

Update 4

Linux* Release NotesWindows* Release Notes

  • Support for Intel® Manycore Platform Software Stack (Intel® MPSS) version 3.5
  • Support for Intel® Atom™ x7 Z8700 & x5 Z8500/X8400 processor series (codename Cherry Trail) including GPU analysis
  • MPI rank ID embedded to an MPI process name to better distinguish multiple ranks for MPI analysis
  • Support for the __itt_detach API to detach collection from all processes
  • Attach to process target type supported for user-mode sampling and tracing analysis of Java* applications on Linux OS
  • Information on packet submission available on the Timeline for Intel® Media SDK applications running on Linux systems with Intel® HD Graphics
  • Microsoft* Visual Studio* 2015 IDE integration
Update 3

Linux* Release NotesWindows* Release Notes

Note: The Release Notes included in the product installation mistakenly state that the "Tasks and Frames" tab has been replaced.

All Operating Systems

OpenMP Enhancements

Potential Gain expansion by parallelization inefficiencies representing their wall time cost

CPU time-based classification of Spin and Overhead time in OpenMP runtime does not reveal the elapsed-time impact of a parallel region inefficiency because it depends on the number of working threads. VTune Amplifier’s new per-OpenMP region metrics that are based on CPU time are now normalized by the number of threads in the region and represented as an expansion of the “potential gain” metric.

Besides reporting the potential gain metric in absolute elapsed time, the VTune Amplifier display breaks down the impact of various issues by percentage of total application elapsed time.

Precise trace-based imbalance calculation that is especially useful for profiling of small region instances

Imbalance of working threads on barriers is a major performance issue that prevents efficient CPU utilization by OpenMP applications. VTune Amplifier’s sampling method may miss certain situations of imbalance. For example

  • region instances that are smaller than the sampling interval,
  • the number of parallel region instances is insufficient to get statistically correct results, or
  • threads enter a passive wait on a barrier and don’t consume CPU time on a busy wait (e.g. for KMP_BLOCKTIME=0) .

To avoid these situations, the Intel OpenMP runtime library from Parallel Studio XE 2016 Beta reports to VTune Amplifier the precise imbalance time. This additional information from the OpenMP runtime does not add overhead since the reporting is done on a per-barrier basis. The precise imbalance metrics are displayed when the OpenMP Potential Gain metric is expanded.

Detailed analysis by barrier-to-barrier region segments to explore performance of OpenMP work-sharing constructs and barrier cost inside a region

When an OpenMP region contains multiple constructs with barriers (e.g., loops with implicit barriers, a ‘single’ construct, or a user barrier), it is useful to distribute inefficiency metrics by barrier-to-barrier segments. Below is an example a region based on four barrier-to-barrier segments.

The Intel OpenMP runtime from Intel Composer XE 2015 Update 3 (or higher) instruments barriers for VTune Amplifier to enhance its inefficiency metrics. The barrier type is added to the segment name – loop, single, reduction, etc. The runtime also emits additional information for parallel loops with implicit barriers, such as loop scheduling and chunk size, that is useful in understanding imbalance or the nature of the scheduling overhead. Use the /Barrier-to-Barrier Segment grouping to view the statistical distribution by barrier-to-barrier segments.

Please note that the same lexical loop constructs with different schedule types or chunk sizes will be displayed separately in different rows. For example, if one instance had a chunk size of 1000 and another had a chunk size of 1563, there would be two entries for the construct with the same name but different sizes in the OpenMP Loop Chunk column.

Barrier-to-Barrier Segments are also available on the timeline.


Intel® MPI and OpenMP Multi-rank Analysis on a Compute Node

Per-rank Intel MPI communication busy wait time detection and showing the metric in summary, grid and timeline view

For hybrid MPI and OpenMP applications, it is important to explore OpenMP inefficiency along with MPI communication between ranks. VTune Amplifier recognizes samples in Intel MPI communication busy wait functions and shows metrics based on that information. For multi-rank OpenMP results, VTune Amplifier’s Summary view is enriched with a table of Top MPI ranks with OpenMP metrics sorted by MPI Communication Spin time from low to high values. The lower the Communication time the more the rank was executing (vs. spinning) and the more impact OpenMP tuning will have on the application elapsed time.

Process names are hyperlinked to the Bottom-up view with ‘/Process /OpenMP Region/ …’ groupings to get details of the OpenMP metrics aggregated per-process, with the ability to expand the results by Regions and Barrier-to-Barrier Segments.

MPI Communication Spin time is highlighted on the timeline.

Intel MPI selective rank profiling configuration option, including EBS analysis for multiple ranks on a node

To simplify selective rank profiling configuration for VTune Amplifier analysis of MPI applications, Intel MPI introduced ‘-gtool’ option in version 5.0.2. The option syntax is:

$ mpirun -genvall –gtool “amplxe-cl -r  -collect :[=exclusive]” -n   [my_app_ options]

where specifies the rank range to be included in the VTune Amplifier analysis. Separate ranks with a comma or use the “-” symbol for a set of contiguous ranks. Use the ‘all’ value to configure profiling on all the ranks. Exclusive launch mode helps prevent running more than one collection per node, which is a limitation of EBS profiling.

Starting with Intel MPI version 5.0.3, the ‘node-wide’ clause can be used instead of ‘exclusive’ to make collection on all ranks of the nodes on which the resides, or for all nodes in the case of ‘all’ ranks. In this case, VTune Amplifier will create a result directory per node with host name suffix for the result directory name. This is particularly convenient for EBS collection, where there are limitations on simultaneous profiling by multiple VTune Amplifier command lines.

Below is an example of node level profiling:

$mpirun –gtool “amplxe-cl –c advanced-hotspots –r my_dir:all=node-wide” –n 4 –ppn 2 my_mpi_app

VTune Amplifier XE command line generation for selective rank profiling through Intel® Trace Analyzer and Collector (ITAC) user interface

With Intel Trace Analyzer and Collector 9.0.2 and later, you can generate VTune Amplifier hotspot analysis command lines for ranks selected in the ITAC graphical user interface: from “Event Timeline” Chart, ‘Function Profile/ Load Balance’ grid view or copy the generated command line for the most CPU bound process from ITAC summary page.

User Interface Enhancements

General Exploration analysis with confidence indication

Some of the metrics in VTune Amplifier views may now be marked as unreliable by greying out the values in the following views: Summary, Bottom-up, and Source. This can happen when the amount of collected event samples is too low to reliably calculate the metric.

Currently it is used for EBS metrics on General Exploration analysis but it may be extended to more metrics in the future if the feedback is favorable.

Timeline “Super Tiny” bird’s-eye view

Timeline analysis of core utilization on modern server and many-core co-processor cards with a large number of ranks/threads is particularly useful with a bird’s-eye view to be able to recognize application phases and behavioral patterns for further data zooming and filtering. VTune Amplifier’s “Super Tiny” view shows all application threads at once using a pixel color intensity to reflect Efficient, Spin & Overhead and MPI Communication Time metrics. Timeline hierarchical grouping for “Super Tiny” shows leaves only grouped according to the grouping hierarchy:

Access the new view in the timeline context menu:

New Filtering Mode for Command Line Reports

To display only particular columns providing metrics/event data, use the -column option and specify a full name of the required column(s) or its substring.

Examples:

  • Show grouping and data columns only for event columns with the *INST_RETIRED.* string in the title:
$ amplxe-cl -R hw-events -r r000ah --column=INST_RETIRED.
  • Show grouping and data columns only for columns with the Idle and Spin strings in the title:
$ amplxe-cl -R hotspots -r r001hs --column=Idle,Spin
Update 2

Linux* Release NotesWindows* Release Notes

All Operating Systems

Intel® Transactional Synchronization Extensions (Intel® TSX) TSX Hotspots analysis providing precise clockticks data for the Intel microarchitecture code name Haswell

TSX Hotspots analysis type uses event-based sampling collection and is targeted for the Intel® microarchitecture code name Haswell with Intel® Transactional Synchronization Extensions (Intel® TSX).

Due to PMU limitations, Advanced Hotspots cannot be collected inside an Intel® TSX transaction. Thus, the new “TSX Hotspots” analysis type has been added to help identify performance-critical program units inside transactions.

To launch the analysis, select Microarchitecture Analysis > CPU Specific Analysis > Haswell Analysis > TSX Hotspots in the analysis tree, or select the Collect precise clockticks option for the custom event-based sampling configuration. The collected result opens in the default TSX Exploration viewpoint.

For more details please refer to the “TSX Hotspots Analysis” topic in the product help.

Bandwidth analysis improvements

  • Regular DDR memory bandwidth analysis for the 5th Generation Intel® Core™ processors (code name: Broadwell) and Intel microarchitecture code name Silvermont
  • Intel® QuickPath Interconnect (Intel® QPI) “QPI Bandwidth” data analysis with grouping by package/Intel® QPI link for server platforms (codenamed Ivy Town & Haswell-EP)
  • Total, Read and WriteBandwidth timeline areas merged into single area making it easier to see all bandwidth activity
  • Grouping by package for the CPU Time timeline area

GPU Architecture Diagram

On Windows* systems with Intel HD Graphics you may find it easier to analyze your OpenCL application by exploring the GPU hardware metrics per GPU architecture blocks.

To do this, choose the Computing Task grouping level in the Graphics window, select an OpenCL kernel of interest and click the Architecture Diagram tab in the Timeline pane. VTune Amplifier updates the architecture diagram for your platform with performance data per GPU hardware metrics for the time range the selected kernel was executed.

GPU analysis on Linux

GPU analysis on Linux* targets is now available in VTune Amplifier XE, including:

  • Support for the OpenCL application analysis (for Intel HD Graphics) and GPU usage analysis (except for the GPU hardware metrics)
    Refer to the “GPU Analysis”, “GPU Usage” and “Interpreting GPU OpenCL™ Application Analysis Data” help topics for details on analysis configuration and results interpretation.
  • Intel® Media SDK program analysis for Linux systems with Intel HD Graphics.

    To perform analysis of Intel® Media SDK tasks execution over time, make sure to configure your Linux kernel according to the “Intel® Media SDK Program Analysis Configuration” topic in the VTune Amplifier help.

    Select the Trace OpenCL and Intel Media SDK programs (Intel HD Graphics only) option in one of Algorithm or Custom analysis types.
    To analyze Intel Media SDK tasks, focus on Timeline pane.

    If you also enable the Analyze GPU usage option for the collection, use the Graphics window to correlate data for the Intel Media SDK tasks execution with the GPU software queue data.
  • Compute extended counter set support added for GPU hardware metrics analysis on the 5th Generation Intel® Core™ processors (code name: Broadwell).
  • The Global/local accesses hardware event set for GPU analysis has been renamed Compute basic (with global/local memory accesses) to better represent the collected data. See the description in the "GPU Metrics" topic of the product help for detailed metrics.
Update 1

Linux* Release NotesWindows* Release Notes

All Operating Systems

Support for Intel(R) Xeon(R) processor E5 v3 family of processors based on the Intel microarchitecture code name Haswell-E, including General Exploration, Bandwidth and TSX Exploration analysis

Also, a performance tuning guide for the v3 family is available at the Tuning Guides and Performance Analysis Papers web page.

NOTE: TSX exploration analysis is supported only for Intel processors with the Intel® Transactional Synchronization Extensions (Intel® TSX) Exploration feature enabled.

Support for the Intel microarchitecture code name Broadwell, including General Exploration analysis

OpenMP* analysis classifications of Spin and Overhead Time

The Bottom-up view now classifies inefficiencies by presenting a breakdown of Spin and Overhead CPU time spent in the region.

Click the expand con icon to expand the Spin and Overhead Time column and get more details about the reasons for high Spin or Overhead Time values:

  • Spin Time reasons for OpenMP regions:
    • Imbalance or Serial Spinning is CPU time when OpenMP working threads are spinning on a synchronization barrier consuming CPU resources
    • Lock Contention is CPU time when OpenMP working threads are spinning on a lock consuming CPU resources
  • Overhead Time reasons for OpenMP regions:
    • Creation is CPU time that an OpenMP runtime library spends on organizing parallel work
    • Scheduling is CPU time that an OpenMP runtime library spends on work assignment for threads
    • Reduction is CPU time that an OpenMP* runtime library spends on loop or region reduction operations

For more details, please refer to the OpenMP* analysis topic in the product help.

Drill down to source from OpenMP analysis when using the /OpenMP Region/.. grouping

To analyze the source of a performance-critical OpenMP parallel region, double-click the region identifier in the grid, sorted by the OpenMP Region/.. grouping level.

VTune Amplifier opens the source view at the beginning of the selected OpenMP region in the pseudo function created by the Intel compiler.

Serial Time metric recalculated when timeline filtering applied

To analyze serially executed code (outside of any parallel region), switch to the Bottom-up window, select the /OpenMP Region /Thread /Function grouping, and examine [Serial - outside any region] row.

Apply a time filter on Timeline to recalculate Serial Time for the selected time interval.

Initial Release

Linux* Release NotesWindows* Release Notes

All Operating Systems

Windows* Operating Systems

Linux Operating Systems

Enhanced OpenMP* region analysis on Intel® Xeon® and Xeon Phi® systems

With enhanced OpenMP* region analysis, identify common performance bottlenecks, such as load imbalance, granularity issues or synchronization issues. See serial and parallel times for your application and potential tuning gains for parallel regions. For more details refer to the “OpenMP* Analysis” topic in the product help.

Example of new OpenMP* support

Second example of OpenMP* support

Easier data collection on Intel® Xeon Phi™ coprocessors

Collecting data on Intel® Xeon Phi™ coprocessors is easier than ever with improved analysis workflow via the new target system configuration options. Call stack collection is also now supported for Intel Xeon Phi coprocessors. ITT API collection (including OpenMP* analysis) now works out of the box on the Intel Xeon Phi coprocessor w/o necessity to set any environment variables for both native and offload applications. For more details, refer to the “Intel Xeon Phi Coprocessor Analysis Workflow” topic in the product help.

Easier to use General Exploration and Bandwidth Analysis

Stop worrying about which microarchitecture you’re profiling and use the new General Exploration and Bandwidth analysis types, enabling you to use the same command line on any supported system! For more details, please refer to the “About Performance Analysis with VTune Amplifier” topic in the product help.

The hardware event-based sampling analysis tree has been re-structured to introduce cross-CPU basic configurations and separate advanced CPU-specific analysis configurations. General Exploration and Bandwidth analysis types are shared between all supported CPUs. All tuning opportunities are covered by the General Exploration analysis type for newer processor families, e.g., Ivy Bridge and beyond. Review the Tuning Guides to take full advantage of the General Exploration analysis type. CPU specific analysis types, when available, are expanded automatically according to the detected processor type for older processor families (see note below).

NOTE: The Ivy Bridge family of processors no longer has separate advanced analysis types, only General Exploration and Bandwidth. The Sandy Bridge advanced analysis types that used to be available for Ivy Bridge did not work on Ivy Bridge processors because of hardware incompatibilities and the metrics of interest are now included in the General Exploration analysis type. Also, the Haswell processor family does not have separate advanced analysis types. Again, use the General Exploration metrics and the Haswell tuning guide.

Custom Groupings

Many new ways to group and order the performance data, including custom groupings in grid views and new groupings in the timeline pane.
To see how to create a custom grouping please refer to the "Grouping Data" and "Dialog Box: Custom Grouping" topics in the product help.

Use the Timeline grouping menu to group the data by program units. A grouping level depends on the analysis type. For more details, please refer to “Managing Timeline View” in the product help.

Enhanced navigation in the clickable Summary pane

Hyperlinks open the Bottom-up view sorted by the selected metric or directly to the selected function or OpenMP region.

Easier remote collection

Use the graphical interface running on a Windows* or Linux* host system to collect data on a remote Linux* system via SSH. Configure remote collection via the “remote Linux (SSH)” Target system configuration option in the Project Properties dialog:

NOTE:

  1. ssh/scp or plink/pscp tools must be available in the PATH
  2. When collecting data remotely, VTune Amplifier XE looks for the compatible collector on the remote system in the default install location: /opt/intel/vtune_amplifier_xe_. It also temporary stores performance results on the target system in the /tmp directory. If you installed the VTune Amplifier XE to a different location on target and need to specify another temporary directory, use the appropriate configuration options in the Project Properties/Target tab in GUI, or collection knobs -target-install-dir and -target-tmp-dir in the command line.
  3. If your target application requires custom working directory or user-defined environment variables you can specify them via a launching script and use the script as an application to launch.

For more details please refer to the "Collecting Data Remotely from the VTune Amplifier GUI" topic in the product help.

Analyze Linux* or Windows* profiling data on your OS X* host

Use a Mac* computer as your main system? Now you can host the VTune Amplifier GUI on Mac computers running OS X to view remotely collected results, including the ability to configure and launch remote collection to supported Linux systems.

Once you have registered your Windows or Linux product, an OS X viewer is available for download without additional cost (see below). It will use your existing Windows or Linux license. Note: performance profiling on Mac computers is not available.

The VTune Amplifier XE viewer for OS X is available as a separate download in the Intel Software Development Products Registration Center, e.g.:

After clicking on the "Version 2015" in the right column, you will see the following. Click on the .dmg file to download it, or use the download manager.

After downloading the vtune_amplifier_xe_2015.dmg file, follow these steps to install the software:

  • Install instructions
    • Open up permissions to "/Users/Shared/Library/Application Support" to allow the installation of the license file.
    • Start the 'Finder' application on your OS X* system.
    • Find the file 'vtune_amplifier_xe_2015.dmg'
    • Open/Click on the .dmg file to mount the disk-image.
    • In new opened window, double click on the 'vtune_amplifier_xe_2015.mpkg' item to start installation.
    • Respond to the installation procedure/wizard specifying license/registration type.
    • All GUI applications use the 'Applications' folder as their destination. As a result of a successful installation, 'VTune Amplifier XE 2015' should be created in 'Applications' folder.
    • You may start VTune Amplifier XE 2015 by double-clicking on it in the 'Applications' folder.
  • Un-install instructions
    • Ensure that the 'VTune Amplifier XE 2015' application is closed.
    • Open the 'Finder' application
    • Drag the 'VTune Amplifier XE 2015' application in directory 'Applications' (or other) and drop it in the 'Trash' on the desktop.

Reduce overhead by limiting stack depth

Reduce collection overhead for custom event-based sampling analysis types using the new option to limit call stack depth (in system pages). Use the '-stack-depth' collector knob in the command line and the corresponding GUI control "Stack size" in the Custom Analysis dialog for the hardware-based sampling.

Import externally collected data

Increase analysis by importing externally collected data into existing results. VTune Amplifier provides the ability to correlate interval or discrete data, provided by an external collector, with the regular data collected by the profiler. To learn more, refer to the “Adding External Data to the Intel® VTune™ Amplifier” topic in the product help.

You can extend standard VTune Amplifier performance analysis and launch a custom data collector directly from the VTune Amplifier. Your custom collector can be an application you analyze with the VTune Amplifier or a collector that can be launched with the VTune Amplifier. Learn more about configuring and launching a custom collector from GUI and command line from “Using a Custom Collector” help topic.

> amplxe-cl -collect hotspots -knob custom-collector="python.exe C:\work\custom_collector.py" -- notepad.exe 

VTune Amplifier can process and integrate performance statistics collected externally with a custom collector or with your target application in parallel with the native VTune Amplifier analysis. To achieve this, provide the collected custom data as a csv file with a predefined structure and save this file to the VTune Amplifier result directory.
VTune Amplifier can load and process the following data types:

  • Interval data with start time and end time
  • Samples with a set of counters

To make the VTune Amplifier interpret the custom statistics from the csv file, make sure the file format meets the requirements specified in “Creating a CSV File with External Data” help topic.

Intel® Transactional Synchronization Extensions (Intel® TSX) Exploration analysis

Use the TSX Exploration analysis for tuning applications that use Intel® Transactional Synchronization Extensions (Intel® TSX). The analysis relies on performance counter-based profiling to understand transactional execution behavior and the causes of transactional aborts. For more information on Intel® TSX, see Web resources about Intel® Transactional Synchronization Extensions.

NOTE: the analysis is supported only for Intel processors with the Intel® TSX feature enabled. Due to recent published errata, systems may have this feature disabled, by default.

The tuning process consists of 2 steps:

  1. Measuring transactional success
    The first step is to measure the transactional success in an application.
    Select 'TSX Exploration' analysis type and choose ‘1. Transactional success’ from the ‘Analysis Step’ combo box, as shown below:

    Three metrics are collected:
    • Clockticks – total number of unhalted cycles collected
    • Transactional Cycles – number of cycles spent during transactions. If it is near zero then the application is either not using lock-based synchronization or not using a synchronization library enabled for lock elision through the Intel TSX instructions.
    • Abort Cycles - number of cycles spent during transactions which were eventually aborted. If it is small relative to Transactional Cycles, then the transactional success rate is high and additional tuning is not required. If it is almost the same as Transactional Cycles (but not very small), then most transactional regions are aborting and lock elision is not going to be beneficial. The next step would be to identify the causes for transactional aborts and reduce them, which leads us to the next step.
  2. Sampling transactional aborts
    Select the 'TSX Exploration' analysis type and choose ‘2. Aborts’ option from the ‘Analysis Step’ combo box, as shown below:

    As a result of this analysis, you’ll see where the transaction aborts are happening and for what reason. Possible reasons include:
    • Instruction - Some instructions, such as CPUID and IO instructions, may cause a transactional execution to abort in the implementation.
    • Data Conflict - A conflicting data access occurs if another logical processor either reads a location that is part of the transactional region's write-set or writes a location that is a part of either the read- or write-set of the transactional region. Since Intel TSX detects data conflicts at the granularity of a cache line, unrelated data locations placed in the same cache line will be detected as conflicts.
    • Capacity - Transactional aborts may also occur due to limited transactional resources. For example, the amount of data accessed in the region may exceed an implementation-specific capacity.

OpenCL™ Software Technology Kernel Analysis

OpenCL software technology kernel analysis just got better with metrics for memory transfers and visualization of APIs, computing queues and SIMD widths.

If your application uses OpenCL software technology and is doing substantial computational work on the GPU, capture the timing (and other information) of OpenCL kernels running on Intel HD Graphics by enabling the 'Trace OpenCL kernels on Processor Graphics' option during analysis configuration. To view information about all OpenCL kernels running on the GPU, in the Graphics tab of the analysis results switch the grouping to 'Computing Task Purpose / Computing Task (GPU) / Instance'. VTune Amplifier identifies the following computing task purposes:

  • Compute (kernels)
  • Transfer (OpenCL routines responsible for transferring data from the host to a GPU)
  • Synchronization (for example, clEnqueueBarrierWithWaitList)

The corresponding columns show the overall time a kernel ran on the GPU and the average time for a single invocation (corresponding to one call of clEnqueueNDRangeKernel), working group sizes, as well as averaged GPU hardware metrics collected for a kernel. The cell is highlighted (pink) when there is a potential tuning opportunity. Hover over the cell to read the issue description.

To view details on OpenCL kernels submission, in particular distinguish the order of submission and execution, and analyze the time spent in the queue, zoom in and explore the Computing Queue data in the Timeline pane. You can click a kernel task to highlight the whole queue to the execution displayed at the top layer:

Synchronization tasks are marked with vertical hatching . Data transfers are marked with cross-diagonal hatching . For more details please refer to the “Analyzing Applications Using Intel® HD Graphics” and “Interpreting GPU OpenCL™ Application Analysis Data” topics in the product help.

Auto-driver rebuild

Did you update your Linux kernel and now the sampling driver won’t load? No worries! With the new auto-rebuild feature, the sampling driver detects the kernel update and automatically attempts to rebuild and load the driver.

Starting with this release, if the boot scripts have been installed so that the sampling drivers are automatically loaded during boot time, the boot scripts will check for a change in the kernel and automatically rebuild the driver, at boot time. If successfully rebuilt, new drivers will be loaded so that samples can be collected with the updated kernel. Make sure to update the kernel sources when updating the running kernel for this feature to work.

Driver-less Event-Based Sampling collection

Can’t install the Intel event-based sampling driver on Linux because IT won’t let you have root access? Advanced analysis is available even if you can’t install the Intel event-based sampling driver.

Driver-less event-based sampling is supported for the Advanced Hotspots, General Exploration and Custom analysis types on Linux* operating systems based on kernel 2.6.32 or higher, which exports CPU PMU programming details over /sys/bus/event_source/devices/cpu/format file system. This driver-less sampling collection mode is based on the Linux perf* functionality. VTune Amplifier automatically enables the driver-less collection if the Intel event-based sampling driver cannot be installed during product installation.

NOTE: The Intel event-based sampling driver provides additional features not available in perf, such as:

  • Stacks
  • Uncore events
  • Multiple precise events
  • New events for the latest processors, even on older OSes

NMI Watchdog timer automatically disabled during EBS data collection

The Non Maskable Interrupt (NMI) watchdog timer causes incorrect results in the PMU event-based sampling (EBS) analysis.
Before, VTune Amplifier XE refused to perform EBS collection if the nmi_watchdog is ON, and a user had to disable it manually.
Now the nmi_watchdog timer is disabled automatically for EBS collection period. No more hassles turning it on and off. Profiling just works!

Perf data visualization

Are you collecting event-based sampling data with the Linux ‘perf’ tool? Visualize it now in the VTune Amplifier GUI for enhanced analysis!

Run the perf collection with the predefined command line options:

  • For application analysis:
    > perf record -o< trace_file_name>.perf -e cpu-cycles,instructions 
  • For process analysis:
    > perf record -o< trace_file_name>.perf -e cpu-cycles,instructions  -p  sleep 15

where the -e option is used to specify a list of events to collect as -e .

Then import the *.perf file(s) into the VTune Amplifier project by using the Import option in GUI or command line.

Linux build-id feature support

VTune Amplifier automatically resolves symbols for modules with build-id and separate files with debug information.

2013

Update 17

Linux* Release NotesWindows* Release Notes

New for Update 17!

Improved OpenMP* region analysis

Common problems of OpenMP* overhead in an OpenMP program is serial time and load imbalance. OpenMP is a fork-join parallel model, which means that an OpenMP program starts with a single master thread executing serial code. Parallel regions cause the master thread to fork into multiple threads, which then execute the parallel region. At the end of the parallel region, the threads join at a barrier, and then the master thread continues executing serial code. It is possible to write an OpenMP program more like an MPI program, where the master thread immediately forks to a parallel region and constructs such as barrier and single are used for synchronization. But it is far more common for an OpenMP program to consist of a sequence of parallel regions interspersed with serial code. In such a program, the time is spent waiting in the OpenMP runtime in two cases:

  • Serial time: When the master thread is executing a serial region, the slave threads in the OpenMP runtime are waiting for the next parallel region.
  • Load imbalance: When a thread finishes a parallel region, it waits in a barrier for the other threads to finish.

Intel® VTune™ Amplifier together with Intel Composer XE 2013 Update 2 or later helps you understand where an OpenMP program is serial and where it is imbalanced. It also provides a mechanism to correlate the time spent in the OpenMP runtime with the source code of the program. The OpenMP runtime library in the Intel Composer XE contains markers that can be used by the VTune Amplifier to break out the time in OpenMP by parallel region and serial code. The following paragraphs highlight the enhancements.

Summary pane: Use the OpenMPRegion Duration histogram to analyze instances of each OpenMP region, explore the time distribution per instance and identify Fast/Good/Slow region instances and focus on analysis of performance outlier instances in Grid/Timeline views. Initial distribution of region instances by Fast/Good/Slow categories is done as a ratio of 20/40/20 between min and max region time values.

Region Duration Histogram

Bottom-up pane: Select the OpenMP Region grouping level and analyze CPU, Spin and Overhead time spent in OpenMP regions. High Spin time values signal a parallel region imbalance. As a potential solution, you may set dynamic scheduling to reduce the imbalance. High Overhead time values can result from too fine-grain parallel work with a high scheduling cost. In this case consider increasing the parallel work executed by a working thread, for example, defining the region for an outer loop.

OpenMP* Region Grouping

Top-down Tree pane: Explore the logical program flow of OpenMP regions. Call stacks of worker threads are properly joined with the corresponding fork point (OMP parallel for or OMP parallel directives) in the master thread so you can see full control flow graph for a hotspot in worker threads.

Timeline pane: Explore markers on the Timeline ruler area corresponding to OpenMP region instance duration. Hover over a marker to see the details on the region instance executed at this particular moment of time or click the marker to select the region on the timeline and filter data by region time.

Added new analysis type “TSX Exploration” for 4th generation Intel® Core™ processors

With Intel® Core™ processors based on the Intel microarchitecture code name Haswell, use the special VTune Amplifier analysis type TSX Exploration for tuning applications that use Intel® Transactional Synchronization Extensions (Intel® TSX). The analysis relies on performance counter-based profiling to understand transactional execution behavior and the causes of transactional aborts. For more information on Intel TSX, see Web Resources about Intel® Transactional Synchronization Extensions.

NOTE : You need to perform analysis on Haswell processors w/o the "K" designator, e.g., Intel® Core™ i7-4770K does not support Intel TSX.

The tuning process consists of 2 steps:

  1. Measuring transactional success
    The first step is to measure the transactional success in an application. Select TSX Exploration analysis type and choose 1. Transactional success from the Analysis Step combo box, as shown below:
    TSX Transactional Success
    Note that three metrics are collected:
    • Clockticks – total number of unhalted cycles collected
    • Transactional Cycles – number of cycles spent during transactions. If it is near zero then the application is either not using Intel TSX-based synchronization or not using a synchronization library enabled for lock elision through the Intel TSX instructions.
    • Abort Cycles - number of cycles spent during transactions which were eventually aborted. If it is small relative to Transactional Cycles, then the transactional success rate is high and additional tuning is not required. If it is almost the same as Transactional Cycles (but not very small), then most transactional regions are aborting and lock elision is not going to be beneficial. The next step would be to identify the causes for transactional aborts and reduce them – see next step.
  2. Sampling transactional aborts
    Select the TSX Exploration analysis type and choose 2. Aborts option from the Analysis Step combo box, as shown below:
    TSX Abort Type Analysis
    As a result of this analysis, you’ll see where the transaction aborts are happening and for what reason. Possible reasons include:
    • Instruction - Some instructions, such as CPUID and IO instructions, may cause a transactional execution to abort.
    • Data Conflict - A conflicting data access occurs if another logical processor either reads a location that is part of the transactional region's write-set or writes a location that is a part of either the read- or write-set of the transactional region. Since Intel TSX detects data conflicts at the granularity of a cache line, unrelated data locations placed in the same cache line will be detected as conflicts.
    • Capacity - Transactional aborts may also occur due to limited transactional resources. For example, the amount of data accessed in the region may exceed an implementation-specific capacity.
    Example of abort types analysis

"Multiplexing reliability" metric for General Exploration

A new multiplexing (MUX) reliability metric is now available for the General Exploration analysis type. Use this metric to know whether the data for your collection was statistically valid. Values close to 90% (i.e., 0.900) are desirable. Please see the documentation for more information on multiplexing events.

MUX Reliability Metric in Summary tab

An example of when multiplexing events can reduce precision is a short collection duration, so that a statistically relevant number of events is not counted during the collection period.

MUX Reliability Metric in Bottom Up View

In this case, either check the "Allow multiple runs" option in the Project Properties or increase the collection time. e.g., increase the workload of your application so that it runs longer.

Allow multiple runs in project properties

Extended Summary window with hyperlinks for Top Hotspots and performance metrics navigating to the Bottom-up grid view

The Summary Pane has been enriched with hyperlinks for Top Hotspots, performance metrics and General Exploration issues, which navigate a user to the Bottom-up grid view with the respective function item selected or column with the metric sorted.

Summary pan with links

Changed default Call Stack Mode default setting from “Only user function” to “User functions +1” for better understanding of library usage

Default settings for the Call Stack Mode drop-down menu on the filter bar have been changed to "User functions + 1".

When using VTune Amplifier with the default Call Stack Mode "Only user functions", some customers are often surprised that they do not see some library code in the results, while they are sure that there are MKL, IPP or some other library usage. These are usually considered as “system” by VTune Amplifier. This happens since in this mode we attribute all system code back to user code caller side. Attribution of everything to user functions created some confusion.

The User functions + 1 mode filters all system functions except those directly called from user functions, so a user can see which top function is hot and who is calling that.

NOTE: The changes will only be visible for newly created VTune Amplifier projects or if you never changed the Call Stack mode in your existing project, otherwise the Call Stack mode will be inherited from the project properties.

New call stack mode default

Updated product toolbar

Updated the product toolbar providing quick access to the product documentation with the new Help button and to the Import dialog box (standalone only) with the Import Result button.

updated toolbar

Added remote system configuration options

The Target tab of the Project Properties has been enhanced to specify a path to the VTune Amplifier installed on the remote machine and a path to a remote temporary directory used for storing performance results.

When collecting data remotely, the VTune Amplifier XE looks for the collectors on the target system in the default install location: /opt/intel/vtune_amplifier_xe_2013. It also temporary stores performance results on the target system in the /tmp directory.

If you installed the VTune Amplifier XE to a different location on target and need to specify another temporary directory, use the appropriate configuration options in the Project Properties:Target tab in GUI or command line collection knobs -target-install-dir and -target-tmp-dir:

New remote configuration options

Automatically disabled NMI Watchdog (nmi_watchdog) timer on Linux* only during data collection

You no longer need to disable the NMI Watchdog Timer on Linux* to use the VTune Amplifier hardware-based sampling support! Now, VTune Amplifier will automatically turn it off during collection. One more thing that you don't have to ask your admin to do!

Previous releases of the VTune Amplifier XE refused to perform hardware-based collection if the the Non Maskable Interrupt (NMI) watchdog timer was enabled because it would cause incorrect results, so the user had to manually disable it.

Effective with VTune Amplifier XE 2013 Update 17 release, the timer is automatically disabled during the hardware-based collection period, only. It is automatically re-enabled after collection completes. A message to that effect is displayed in the collection log window.

Update 16

Linux* Release NotesWindows* Release Notes

New for Update 16!

Graphical interface for remote data collection on Linux* systems via SSH

You may use the VTune Amplifier XE graphical interface running on a Windows* or Linux* host system to collect data on a remote Linux* system via SSH. To configure remote collection:

  1. Go to the Project Properties dialog Target tab
  2. Select the remote Linux (SSH) from the Target system drop-down menu
  3. In the SSH details field, enter the username and hostname for your remote Linux system in username@hostname format
  4. Select your profiling target from the Target type drop-down menu. You may select any type of profiling target: application, process, or system analysis
  5. Configure other Project properties if required and click OK to save your settings and close the Project Properties dialog box

Start a New Analysis

NOTE:

  1. ssh/scp or plink/pscp tools must be available in the PATH
  2. When collecting data remotely, VTune Amplifier XE looks for the compatible collector on the remote system in the default install location: /opt/intel/vtune_amplifier_xe_. It also temporary stores performance results on the target system in the /tmp directory. If you installed the VTune Amplifier XE on the remote system to a different location and need to specify another temporary directory, you need to set the following environment variable on the host before starting amplxe-gui:
    AMPLXE_TARGET_PRODUCT_DIR=
    AMPLXE_TARGET_TMP_DIR=
  3. If your target application requires custom working directory or user-defined environment variables you can specify them via a launching script and use the script as an application to launch.

For more details please refer to the "Collecting Data Remotely from the VTune Amplifier GUI" topic in the product help.

Simplified User API collection setting for native analysis on the Intel® Xeon Phi™ coprocessor

Update 16 simplifies setting up ITT API collection for native analysis on the Intel® Xeon Phi™ coprocessor. If you chose the default installation options, with the libittnotify library installed to the coprocessor (/usr/lib64/libittnotify.so exists on your card), set the KMP_FOR_TPROFILE=1 environment variable for the application to launch via ssh command, or via your launch script, to the card:

  • [host]$ amplxe-cl -c knc-hotspots -- ssh mic0 KMP_FOR_TPROFILE=1 /home/user/myApp

For more details please refer to the "ITT API Collection on the Intel® Xeon Phi™ Coprocessor" topic in the product help.

Support for external data collectors

VTune Amplifier can launch external data collection using the Custom collector target configuration option or the -custom-collector command line option. For more details, please refer to the “Using a Custom Collector” topic in the product help.

Import externally collected data in CSV format into existing VTune Amplifier results

To import a csv file with the externally collected data into an existing VTune Amplifier result use Import from CSV option in the Analysis Target tab, or Analysis Type tabs in GUI or -import option in the command line interface. Importing a csv file does not affect symbol resolution in the existing result. For more details please refer to the “About Adding External Data to the Intel® VTune™ Amplifier” topic in the product help.

Support for importing a csv file that does not specify a hostname for the target system

You can import a csv file that does not specify a hostname for the target system but contains time stamps represented in the UTC format. In this case, the VTune Amplifier displays global data (not attributed to specific threads/processes) only. For more details please refer to the “Creating a CSV File with External Data topic in the product help.

Search functionality for the grid views added to the toolbar

Find button is available on the grid toolbar, which invokes the search dialog in the same way as Ctr-F. See “Searching for Data” for more details.

Hardware event-based analysis types now support collection data limit

Hardware event-based analysis types now support the collection data limit to prevent collecting large amounts of data, which may slow down data processing. For more details, please refer to “Limiting Data Collection Size”

Usability improvements for the structure of hardware event-based sampling analysis types

The hardware event-based sampling analysis tree was re-structured to introduce cross-CPU basic configurations and separate advanced CPU-specific analysis configurations. General Exploration and Bandwidth analysis types are shared between all supported CPUs. All tuning opportunities are covered by the General Exploration analysis type for newer processor families, e.g., IVB and beyond. Users should review the Tuning Guides to take full advantage of the General Exploration analysis type. CPU specific analysis types, when available, are expanded automatically according to the detected system type for older processor families (see note below). For more details, please refer to “About Performance Analysis with VTune Amplifier”

NOTE: The Ivy Bridge family of processors no longer has separate advanced analysis types, only General Exploration and Bandwidth. The Sandy Bridge advanced analysis types that used to be available for Ivy Bridge did not work on Ivy Bridge processors because of hardware incompatibilities and the metrics of interest are now included in the General Exploration analysis type. The Haswell processor family does not have separate advanced analysis types, either.

Timeline grouping options

Use the Timeline grouping menu to group the data by program units. A grouping level depends on the analysis type. For more details, please refer to “Managing Timeline View” in the product help.

Event reference help for Intel microarchitecture code name Haswell and Intel® Xeon Phi™ coprocessor

Event reference help is available for Intel microarchitecture code name Haswell processor and Intel® Xeon Phi™ coprocessor (code name: Knights Corner): go to Help > Intel Processor Event Reference menu in the standalone interface or Help > Intel VTune Amplifier XE 2013 > Intel Processor Event Reference in the Microsoft* Visual Studio* IDE.

Auto-rebuild of sampling and power drivers at system boot time after Linux kernel update

A Linux kernel update can lead to incompatibility with VTune Amplifier XE drivers for event-based sampling (EBS) analysis and power analysis. If the system has installed VTune Amplifier XE boot scripts, the drivers will be automatically re-built by the boot scripts at system boot time. Note: kernel development sources that are needed on the system for driver rebuild must correspond to the Linux kernel update.

Update 15

Linux* Release NotesWindows* Release Notes

New for Update 15!

Ability to change the focus function from the Caller/Callee panes

  • Change a focus function from the Callers or Callees panes by double-clicking a function of interest. Alternatively, you may select a function by right-click and choose the Change Focus Function context menu option. For more details please refer to the "Window: Caller/Callee" topic in the product help.

Ability to collapse recursive functions in the Call Stack pane

  • To collapse all recursive functions into one entry in theCall Stack pane - select the Collapse Recursion option from Context menu.
  • VTune Amplifier updates the view and marks the entry with collapsed recursion as follows:

Improved Frame Rate Histogram representation per frame domain in the Summary Window

  • In case of many frame domains use Domain drop-down menu at Summary pane to choose a frame domain to analyze with the frame rate histogram. If only one domain is available, the drop-down menu is grayed out. For more details please refer to the "Window: Summary" topic in the product help.

Automatic positioning of the hottest line in the Source/Assembly window after drilling down from the grid

  • VTune Amplifier Source/Assembly window now automatically positions of the hottest line in the after drilling down from the grid:

Support for importing global discrete counters collected externally

  • You can import global discrete counters without specifying PID/TID. In that case the performance counter timestamp will not be bound to a particular process/thread and will be visualized at Timeline in the new area for global counters with separate rows per each counter type. For more details please refer to the "Creating a CSV File with External Data" topic in the product help, paragraph "Format for Discrete Values" and examples.
Update 14

Linux* Release NotesWindows* Release Notes

New for Update 14!

GPU OpenCL™ kernel analysis extended with memory transfers, visualization of OpenCL API and computing queue (Windows only)

If your application uses OpenCL software technology and is doing substantial computation work on the GPU, you may capture the timing (and other information) of OpenCL kernels running on Intel HD Graphics by enabling the Trace OpenCL kernels on Processor Graphics option during analysis configuration. To view information about all OpenCL kernels running on the GPU, in the Graphics window switch Grouping to Computing Task Purpose / Computing Task (GPU) / Instance. VTune Amplifier identifies the following computing task purposes: Compute (kernels), Transfer (OpenCL routines responsible for transferring data from the host to a GPU), and Synchronization (for example, clEnqueueBarrierWithWaitList). The column “Data Transferred” representing all the data “transferred” with average bandwidth:

To view details on OpenCL kernels submission, in particular distinguish the order of submission and execution, and analyze the time spent in the queue, zoom in and explore the Computing Queue data in the Timeline pane. You can click a kernel task to highlight the whole queue to the execution displayed at the top layer:

Synchronization tasks are marked with vertical hatching . Data transfers are marked with cross-diagonal hatching .

For more details please refer to the “Analyzing Applications Using Intel® HD Graphics” and “Interpreting GPU OpenCL™ Application Analysis Data” topics in the product help.

Standalone interface improved to provide more workspace for the analysis results

In the VTune Amplifier XE Update 14 standalone interface menu and toolbar layout was improved to provide more vertical space while exploring analysis results. Notice that Menu is now invoked by the button at the top right corner, use it to control result collection, define and view project properties, and set various options:

For more details on the user interface controls please refer to the “Standalone VTune Amplifier Interface” topic in the product help.

Ability to cache source files and explore collected performance statistics later even if the source file has been changed

Save your source files in the cache. You can go back to the cached sources at any time in the future and explore the performance data collected per code line at that moment of time. To enable the option go to Menu > Options > Intel VTune Amplifier XE 2013 > Source/Assembly and check Cache source files check box. Then VTune Amplifier caches your sources in the result database when you open the Source window for the first time and provides the following message:

When you open the Source window for this result for the second time and if the source file has been changed, the VTune Amplifier opens the source from the cached file with the proper notification. For more details please refer to the “Pane: Options - Source/Assembly” topic in the product help.

Event-based stack sampling analysis of system processes for kernels and drivers (Windows only)

You can use the VTune Amplifier to profile the Windows kernel-mode process and analyze all privileged resource operations (for example, memory management, paging) it is responsible for or to explore your multithreaded kernel-mode drivers running in the context of this process. If you are a driver developer, this option can help you profile asynchronous driver threads and identify system resource utilization issues (for example, issues caused by frequent page allocations). To analyze the system process, run the VTune Amplifier with administrative privileges and configure the analysis target to attach to PID 4. For more details please refer to the “Attaching to a Process” topic in the product help.

Ability to show kernel stacks as continuation of user stacks

To view kernel stacks in the user functions stacks select the User/system functions call stack mode on the filter toolbar:

To locate the call of the kernel function in the assembly code, double click the function in the Call Stack pane:

Support for Intel(R) microarchitecture code named Silvermont

With the VTune Amplifier XE 2013 Update 14 you may perform hardware event-based sampling analysis on Intel(R) microarchitecture code named Silvermont by using Advanced hotspots from Algorithm Analysis tree and General Exploration from Intel Atom Processor Analysis, or by creating a new custom Hardware Event-Based Sampling Analysis.

Support for Intel(R) Xeon(R) E5-2600 v2 & E5-1600 v2 processors based on the Intel microarchitecture code name IvyBridge-EP

With the VTune Amplifier XE 2013 Update 14 you may perform hardware event-based sampling analysis on Intel(R) microarchitecture code named IvyBridge-EP by using Advanced hotspots from Algorithm Analysis tree or General Exploration and Bandwidth from Sandy Bridge/Ivy Bridge/Haswell Analysis tree.

Simplified syntax for searching binary and symbol files with the -search-dir and -source-search-dir command line options

When finalizing the collected data and generating reports, the Intel® VTune™ Amplifier searches supporting user files to display analysis information in relation to your source code. For proper resolving symbol information, use -search-dir action-option to specify directories that should be searched for binary (executables and dynamic libraries) and symbol files (typically .pdb files). To enable the source code view in the command line report use -source-search-dir option for searching source files.

For more details please refer to the “Specifying Search Directories” topic in the product help.

Update 13

Linux* Release NotesWindows* Release Notes

New for Update 13!

  • Microsoft* Windows* 8.1 and integration with Microsoft* Visual Studio* 2013 IDEnow supported! (Windows* only)
  • ITT pause/resume APIs supported on the Intel® Xeon Phi™ coprocessor
  • Display of externally collected data (CSV format with a predefined structure, only) with VTune Amplifier collected data
  • SSH-based remote collection via amplxe-cl
  • Debian* 7.1, SLES* 11 SP3 supported (Linux* only)
  • Bug fixes

Support for ITT pause/resume APIs on the Intel® Xeon Phi™ coprocessor

Now you can use pause/resume ITT API to control collection on Intel® Xeon Phi™ coprocessor. Please note that To profile applications with user APIs on the Intel Xeon Phi coprocessor, environment variables that control collection must be propagated from the host to the Intel Xeon Phi coprocessor card. See User API Collection on the Intel® Xeon Phi™ Coprocessor help topic for more details.

SSH-based remote collection via amplxe-cl

Intel® VTune™ Amplifier enables you to collect data on a remote application from the host system (remote usage mode) via command line interface (amplxe-cl) and view the analysis result locally in the GUI. Remote data collection using the amplxe-cl command running on the host is very similar to the native collection on the target except that the -target ssh:user@target option is added to the command line.

As prerequisites you need to install collectors on the remote target and enable pasword-less SSH access to the target.

Example: to run event-based stack sampling collection for the application:

host>./amplxe-cl --target=ssh:user@target –collect advanced-hotspots -knob collection-detail=stack-sampling -- 

To control collection from the command line – pause resume or detach you can use commands from host as follows:

host>./amplxe-cl -r result@@@ -C pause

See Collecting Data Remotely from Command Line help topic to learn details on the collection set up and setting of search directories for proper symbol resolving.

Support for adding external collection data (in the CSV format with a predefined structure) to the VTune Amplifier analysis result collected in parallel with external statistics

VTune Amplifier provides an option to correlate interval or discrete data, provided by an external collector, with the regular data provided by the analyzer.

For example, you can see how the data captured from SoCs or peripheral devices (camera, touch screen, sensors, and so on) correlate with VTune Amplifier metrics collected for your analysis target.

You can extend standard VTune Amplifier performance analysis and launch a custom data collector directly from the VTune Amplifier. Your custom collector can be an application you analyze with the VTune Amplifier or a collector that can be launched with the VTune Amplifier. Learn more about configuring and launching a custom collector from GUI and command line from Using a Custom Collector help topic.

>amplxe-cl -collect hotspots -knob custom-collector="python.exe C:\work\custom_collector.py" -- notepad.exe

VTune Amplifier can process and integrate performance statistics collected externally with a custom collector in parallel with the native VTune Amplifier analysis. To achieve this, provide the collected custom data as a csv file with a predefined structure and save this file to the VTune Amplifier result directory.

VTune Amplifier can load and process the following data types:

  • Interval data with start time and end time
  • Samples with a set of counters

Data may be optionally bound to process and thread ID. VTune Amplifier represents data not bound to a particular process and thread (there are no PID and TID values in the csv file) as frames. Data bound to a process and a thread (there are PID and TID values in the csv file) is represented as tasks. Learn more about csv data format from Creating a CSV File with External Data help topic.

Example: Integrating Interval Data Not Bound to a Particular Process

You have a csv file with the following data types:

VTune Amplifier processes this data as frames (there are no TID and PID values specified) and displays the result as follows:

With the VTune Amplifier, you can easily correlate the frame data in the Timeline pane and grid view. You see that frame 4 took longer time to process than subsequent frames 5 and 6 due to the poll_idle() call.

Update 12

Linux* Release NotesWindows* Release Notes

New for Update 12!

Tracing of OpenCL™ kernels execution on Intel® Processor Graphics (Windows* only)

If your application uses OpenCL™ on Intel® Processor Graphics you can analyze GPU computing efficiency with VTune Amplifier XE by tracing of OpenCL™ kernels execution on GPU. To know OpenCL kernels execution time, monitor performance of each kernel per GPU metrics and identify hotspot kernels, select the Trace OpenCL kernels on Processor Graphics option while configuring a new analysis. When collection and post-processing is complete and the result is open, click to the Graphics tab to see details of GPU activity, also correlated with CPU processes and threads. Use grid groupings “Computing tasks (GPU)“ or “Source Computing Task (GPU)” to see average values of GPU hardware metrics aggregated per kernels or their instances. Timeline shows kernel instances within a thread submitted them. For more information please refer to the “GPU Analysis” and “Analyzing Applications Using Intel® HD Graphics” topics in the product help.

Graphical User Interface (GUI) install on Linux* (via special script)

Now on you can install the VTune Amplifier XE on Linux* via graphical user interface by invoking install_GUI.sh script. The flow is identical to the command line install, but allows easier understanding and configuring of available install options.

Update 11

Linux* Release NotesWindows* Release Notes

New for Update 11!

Support for identifying function boundaries using static binary analysis methods for binaries without symbol information

To provide accurate performance data and enable source analysis, the Intel® VTune™ Amplifier requires debug information for the binary files it analyzes. Effective Update 11 if it does not find debug information in the binaries, the VTune Amplifier statically identifies function boundaries and assigns hotspot addresses to generated pseudo names func@address for such functions. For more information please refer to the “Using Debug Information” topic in the product help.

NOTE: If debug information is absent, the Call Stack pane may not unwind the call stack correctly for user-mode sampling and tracing analysis types. Additionally in some cases, it can take significantly more time to finalize the results for modules that do not have debug information.

General Exploration metrics summary for hardware event-based sampling analysis results in the command line reports

Command line reports now provide General Exploration metrics summary for hardware event-based sampling analysis results providing a high-level overview of performance problems. The General Exploration Metrics section appears in a Summary report if events were collected during analysis. The set of metrics displayed in the summary depends on the profiled CPU type and list of events. For more information please refer to the “Viewing a Summary Report” topic in the product help.

Source Function Stack grouping level enabling more accurate result comparison in the Top-down Tree pane

Use Source Function Stack grouping level in the Top-down Tree pane for enabling more accurate result comparison for recompiled binary files when addresses of the same source function or same loop are different, like in these cases:

  • You slightly changed the source and recompiled
  • You changed compilation options and recompiled
  • You are doing compare between results compiled and collected for different microarchitectures.

By default, compared functions are grouped by the Function Stack granularity, which is based on function instances. VTune Amplifier treats the same functions with different addresses as separate instances and does not compare them:

When the data is aggregated by Source Function Stack, the VTune Amplifier ignores start addresses and compares functions by source file objects:

For more information please refer to the “About Viewing Comparison Data” topic in the product help.

Change Stack Layout option in the Top-down Tree and Bottom-up panes to switch between chain and tree types of stack layout

Use the Change Stack Layout option in the Top-down Tree and Bottom-up panes to manage stack data in the grids and switch between chain and tree types of stack layout. Click the Change Stack Layout button to switch between call stack layouts.

  • Chain layoutsare typically more useful for the bottom-up view:
  • Tree layouts are more natural for the top-down view:

Support for scientific data representation in the grid

Bottom-up and Top-down Tree panes now support displaying performance values in the scientific notation via Show Data As context menu. Typically this format is recommended for analyzing values < 0.001. For more information please refer to the “Choosing Data Format” topic in the product help.

Update 10

Linux* Release NotesWindows* Release Notes

  • Bug fixes.
Update 9

Linux* Release NotesWindows* Release Notes

New for Update 9!

Support for Hotspots, General Exploration and Bandwidth analysis types on the Intel® Xeon Phi™ coprocessor (except for the user API analysis) from Windows* host

With VTune Amplifier you can now tune on the Intel® Xeon Phi™ coprocessor from Windows* host. Choose one of pre-defined analysis: Hotspots, General Exploration and Bandwidth, or create a custom one. Follow Finding Hotspots on the Intel® Xeon Phi™ Coprocessor tutorial and refer to the document “Optimization – Part 2: Hardware Events” for optimizing applications on the Intel Xeon Phi coprocessor using VTune™ Amplifier XE 2013 for Windows. To get more information about Windows* early enabling program for Intel® Xeon Phi™ Coprocessor please visit http://software.intel.com/en-us/mic-developer and http://software.intel.com/en-us/articles/windows-early-enabling-for-intelr-xeon-phitm-coprocessor.

NOTE: User API analysis is not yet supported by VTune Amplifier from Windows* host and will be enabled in future updates

Advanced Hotspots analysis (formerly, Lightweight Hotspots) introducing several collection levels

The former “Hotspots” and “Lightweight Hotspots” analysis types were renamed in GUI to “Basic Hotspots” and “Advanced Hotspots” respectively introducing several collection levels. “Basic hotspots” provides general performance profile on user level. “Advanced Hotspots” performs Hardware Event Based Sampling analysis by using PMU counters with ability to specify collection with different levels of details and overhead:

  • “Hotspots” – no stacks, context switches and call counts - low overhead
  • “Hotspots, stacks and context switches” – medium overhead
  • “Hotspots, call counts, stacks and context switches” – the highest level of details for the cost of more overhead

For more information on the interface changes please refer to the Intel® IDZ KB article

NOTE: Command line interface still supports former analysis format in deprecated mode to allow gradual migration to a new analysis

GPU analysis for Intel Processor Graphics based on hardware metrics such as Execution Units (EU) Array Active/EU Array Stalled/EU Array Idle, GPU Memory Bandwidth, GPU L3 Cache Misses, and others (Windows* only)

For applications using a Graphics Processing Unit (GPU) for rendering, video processing, and computations VTune Amplifier can monitor, analyze, and correlate activities on both the CPU and GPU (Windows* only). To enable the GPU analysis, you have to configure your predefined or custom configuration to Analyze Processor Graphics and DirectX* pipeline events. GPU analysis for Intel Processor Graphics is based on hardware metrics such as Execution Units (EU) Array Active/EU Array Stalled/EU Array Idle, GPU Memory Bandwidth, GPU L3 Cache Misses, and others, it helps to estimate how effectively the Intel Integrated Graphics is used. Analysis of DirectX* pipeline events is used to correlate CPU/GPU usage and helps to identify whether an application is CPU or GPU bound. For more information please refer to the “GPU Analysis” and “GPU Metrics” topics in the product help.

GPU analysis based on DirectX* pipeline events and used to correlate CPU/GPU usage and identify whether an application is CPU or GPU bound (Windows* only)

Explore Summary pane for GPU Usage and DirectX frame rate histogram:

Switch to “Graphics” tab to see distribution of the GPU metrics over time.

Top-Down performance analysis methodology in General Exploration analysis type for the 4th generation Intel® Core™ processors based on the Intel microarchitecture code name Haswell

The Update 9 introduces Top-Down performance analysis methodology for the 4th generation Intel® Core™ processors based on the Intel microarchitecture code name Haswell integrated into the General Exploration analysis type. Hierarchical data display corresponds to how available execution slots in each core’s pipeline are utilized. Expand a column to see a breakdown of issues pertaining to its category of pipeline utilization: Retiring, Bad Speculation, Back-end Bound, or Front-end Bound Slots. For more details refer to the Haswell tuning guide.

Overhead and Spin time classification for GCC* and Microsoft* OpenMP* runtimes

VTune Amplifier is now capable to classify Overhead and Spin time for GCC* and Microsoft* OpenMP* runtimes and show the metrics in the grid and Timeline pane allowing to identify inefficiencies in using the threading runtimes when a significant portion of time may be spent inside the parallel runtime wasting CPU time at high concurrency levels (overhead), or when a significant portion of CPU time is spent on spin (active) waits. For more information please refer to “Overhead and Spin time” topic in the product help.

Overhead and Spin time for GCC* OpenMP*:

Overhead and Spin time for Microsoft* OpenMP*:

Source and assembly data available in the command line reports

Source and assembly data available in the all command line reports. Use the “-source-object” option to switch a report to source or assembly view mode, including associated performance data. Specify “-group-by address” to see disassembly view. For more information please refer to the “Source-object” topic in the product help.

Example 1: $ amplxe-cl -report hotspots -source-object function=foo

Example 2: $ amplxe-cl -report hotspots -source-object function=foo -group-by basic-block, address

Total metric in the Source/Assembly panes

Analyze collected data in Source/Assembly pane per code line using the Self and Total types of performance metrics. For example, for the Basic Hotspots analysis, the CPU Time: Self column shows the amount of processor time (in seconds) taken to execute a code line while the CPU Time: Total column shows the processor time spent on the code line execution and calls from this line, if any.

Update 8

Linux* Release NotesWindows* Release Notes

  • Bug fixes for upcoming 4th generation Intel® Core™ processors based on the Intel microarchitecture code name Haswell.

NOTE: If you are interested in tuning on Haswell-based systems we recommend to upgrade to the Update 8. Otherwise you may continue to use the previously-released Update 7.

Update 7

Linux* Release NotesWindows* Release Notes

New for Update 7!

Support for the hardware event-based sampling analysis of Windows Store C# and JavaScript applications on Microsoft Windows 8* via the Attach to Process or Profile System modes

Windows Store C# and JavaScript applications can be profiled by using the event-based sampling analysis in “Attach to Process” and “Profile System” modes. Before analysis make sure you have administrative privileges to run the data collection. Mapping to the source file is supported for JavaScript modules. For more information and support limitations please refer to the “Windows Store Applications Analysis” topic in the product help.

Assembly grouping by RVA, basic blocks, and function ranges

Assembly view can be grouped by RVA, Basic Block, or Function Range. To change the hierarchy of the instructions - select the required granularity from the Assembly grouping drop-down menu on the Source/Assembly window toolbar. For more information on grouping capabilities please refer to the “Grouping Data” topic in the product help.

Support for applications generated by MinGW/Cygwin GCC*

Amplifier XE now supports profiling of applications built by the GCC* (MinGW and Cygwin) on Windows. The VTune Amplifier XE 2013 Update 7 release was qualified against Cygwin 1.7.17 with GCC* 4.5.3 and MinGW with GCC* 4.6.2. Pictures below demonstrating the analysis result view before and after the feature is introduced in the Update 7:

Before Update 7:

Since Update 7:

Event summary for hardware event-based sampling analysis results in the command line reports

Command line summary report is extended with the “Event summary” for the hardware event-based sampling analysis results showing summary for core and uncore PMU events.

Highlighting performance issues based on filtered-in data

Highlighting performance issues is now based on filtered-in data. See the example for CPI rate issues below.

  1. Observe data
  2. Filter in by selection
  3. Results:
    • Before Update 7:
    • Since Update 7:

Stitching stacks for Intel® OpenMP* applications

Since Update 7 during the user-mode sampling and tracing analysis of an OpenMP application using Intel runtime libraries, the VTune Amplifier XE automatically enables the Stitch stacks option to restore a logical call tree by catching notifications from the runtime and attach stacks to a point introducing a parallel workload. To view the OpenMP objects hierarchy, explore the data provided in the Top-down Tree pane. To analyze a logically structured OpenMP call flow, make sure to compile and run your code with the Intel® Compiler 13.1 Update 3 or higher (part of the Intel Composer XE 2013 Update 3). For more information please refer to the “Stitching Stacks” topic in the product help.

Stitch stack option disabled:

Stitch stack option enabled (default behavior from Update 7 onwards):

Update 6

Linux* Release NotesWindows* Release Notes

New for Update 6!

Details:

  • The Caller/Callee window is available in all viewpoints that provide call stack data. Use this window to analyze parent and child functions of the selected focus function and identify the most time-critical call paths. You can double-click a function of interest to go to the source view and explore the function performance by a source line. Use the Filter In by Selection grid context menu option on a function of interest to display functions included into all sub-trees that contain the selected function at any level. For more information please refer to the “Window: Caller/Callee” topic in the product help.
  • Improved welcome page now provides quick access to the recently used analysis configurations and analysis results.
  • Separate configuration tabs for Binary/Symbol Search and Source Search. Use the tabs to configure the search directories for binary/symbol and source files required to finalize collected data and work with source/assembly view. For example: if an application to analyze and the source files were moved from the location where the application was compiled then directories for separate debug files and source files should be specified in the tabs for proper symbol resolving and work with source/assembly view.
  • To get context help on a particular hardware PMU event or performance metric select What’s This Column? grid context menu.
  • Overhead and Spin time metrics are provided in the grid and Timeline pane of the Hotspots by CPU Usage, Hotspots by Thread Concurrency, and Lightweight Hotspots viewpoints. The metrics will allow to identify inefficiencies in using threading runtimes (for example, Intel® Threading Building Blocks, Intel® Cilk™, OpenMP*) when a significant portion of time may be spent inside the parallel runtime wasting CPU time at high concurrency levels (overhead), or when a significant portion of CPU time is spent on spin (active) waits. For more information please refer to “Overhead and Spin time” topic in the product help.
    NOTE: VTune Amplifier ignores the Overhead and Spin time when calculating the CPU Usage metric.
  • To change the measurement units on the time scale select the Show Time Scale As context menu option in Timeline, and choose from the following values:
    • Elapsed Time (default)
    • OS Timestamp
    • CPU Timestamp
    For all Timeline view control capabilities refer to “Managing Timeline View” topic in the product help
  • On Fedora* 18 pango packages should be installed, including pangox-compat
Update 5

Linux* Release NotesWindows* Release Notes

All Operating Systems

  • Support of Hotspots, General Exploration and Bandwidth viewpoints for upcoming 4th generation Intel® Core™ processors based on the Intel microarchitecture code name Haswell
  • Bug fixes

Linux* Only

  • User API support on the Intel® Xeon Phi™ coprocessor
Update 4

Linux* Release NotesWindows* Release Notes

All Operating Systems

  • Support for upcoming 4th generation Intel® Core™ processors, codenamed Haswell, including Lightweight Hotspots, General Exploration and Bandwidth analysis
  • General Exploration viewpoint for Intel microarchitecture code named Ivy Bridge
  • Frame analysis for OpenMP parallel regions
  • Attaching to Java* processes for hardware event-based sampling analysis types
  • CPU utilization data in the Hotspots viewpoint for event-based sampling analysis results
  • Usability improvements in the Timeline view, including sorting and separate band height set up
  • Bug fixes

Linux* Only

  • General Exploration analysis for the Intel® Xeon Phi™ coprocessor
  • Event-based sampling analysis for OpenCL* applications on the Intel® Xeon Phi™ coprocessor (JIT collection)
  • Ubuntu* 12.10 support
Update 3

Linux* Release NotesWindows* Release Notes

All Operating Systems

  • Loop Mode switch in the filter bar enabling loop analysis
  • Support for multiple domains in __itt_frame_* API
  • Bug fixes

Linux* Only

  • Better stack quality for applications that use Java* or Intel(R) Math Kernel Library (Intel MKL)
Update 2

Linux* Release NotesWindows* Release Notes

All Operating Systems

  • Search functionality in all grid panes, including Bottom-up, Top-down Tree and Source/Assembler views
  • Expanded Analysis Type tree for the current CPU
  • Self-contained command line generation from GUI referencing low-level collection options

Windows* Only

  • Microsoft Windows* Server 2012 support
  • Improved integration with Microsoft Windows* 8 operating system and Microsoft Visual Studio* 2012 IDE

Linux* Only

  • Improved support for the Intel® Xeon Phi™ coprocessor (codename: Knights Corner), including automated install of hardware event-based sampling collector on the coprocessor card(s), predefined Memory Bandwidth analysis within the coprocessor card and getting started guide “Finding Hotspots" tutorial
Update 1

No release notes are available for this update. The only changes made were bug fixes.

Initial Release

Linux* Release NotesWindows* Release Notes

  • Call counts
  • Hardware stack sampling
  • Better bandwidth analysis
  • Java* profiling
  • Tune Intel® Xeon Phi™ products
  • User tasks support
  • DirectX* frames
  • Power analysis on Linux*

...and more!

For more complete information about compiler optimizations, see our Optimization Notice.