What's New? - Intel® VTune™ Amplifier XE 2015 Update 3

Intel® VTune™ Amplifier XE 2015 performance profiler

A performance profiler for serial and parallel performance analysis. Overviewtrainingsupport.

New for the Update 3 release! (Optional update unless you need…)

As compared to 2015 Update 2 release

All Operating Systems

Note: We are now labeling analysis tool updates as "Recommended for all users" or "Optional update unless you need…".  Recommended updates will be available about once a quarter for users who do not want to update frequently.  Optional updates may be released more frequently, providing access to new processor support, new features, and critical fixes.

Resources

  • Learn (“How to” videos, technical articles, documentation, …)
  • Support (forum, knowledgebase articles, how to contact Intel® Premier Support)
  • Release Notes (pre-requisites, software compatibility, installation instructions, and known issues)

Contents

 

File: vtune_amplifier_xe_2015_update3.tar.gz

Installer for Intel® VTune™ Amplifier XE 2015 for Linux* Update 3

File: VTune_Amplifier_XE_2015_update3_setup.exe

Installer for Intel® VTune™ Amplifier XE 2015 for Windows* Update 3

File: vtune_amplifier_xe_2015_update3.dmg

Installer for Intel® VTune™ Amplifier XE 2015 - OS X* host only Update 3

* Other names and brands may be claimed as the property of others.

Microsoft, Windows, Visual Studio, Visual C++, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries.


OpenMP Enhancements

Potential Gain expansion by parallelization inefficiencies representing their wall time cost

CPU time-based classification of Spin and Overhead time in OpenMP runtime does not reveal the elapsed-time impact of a parallel region inefficiency because it depends on the number of working threads. VTune Amplifier’s new per-OpenMP region metrics that are based on CPU time are now normalized by the number of threads in the region and represented as an expansion of the “potential gain” metric.

Besides reporting the potential gain metric in absolute elapsed time, the VTune Amplifier display breaks down the impact of various issues by percentage of total application elapsed time.

Precise trace-based imbalance calculation that is especially useful for profiling of small region instances

Imbalance of working threads on barriers is a major performance issue that prevents efficient CPU utilization by OpenMP applications. VTune Amplifier’s sampling method may miss certain situations of imbalance.  For example

  • region instances that are smaller than the sampling interval,
  • the number of parallel region instances is insufficient to get statistically correct results, or
  • threads enter a passive wait on a barrier and don’t consume CPU time on a busy wait (e.g. for KMP_BLOCKTIME=0) .

To avoid these situations, the Intel OpenMP runtime library from Parallel Studio XE 2016 Beta reports to VTune Amplifier the precise imbalance time. This additional information from the OpenMP runtime does not add overhead since the reporting is done on a per-barrier basis. The precise imbalance metrics are displayed when the OpenMP Potential Gain metric is expanded.

Detailed analysis by barrier-to-barrier region segments to explore performance of OpenMP work-sharing constructs and barrier cost inside a region

When an OpenMP region contains multiple constructs with barriers (e.g., loops with implicit barriers, a ‘single’ construct, or a user barrier), it is useful to distribute inefficiency metrics by barrier-to-barrier segments.  Below is an example a region based on four barrier-to-barrier segments.

The Intel OpenMP runtime from Intel Composer XE 2015 Update 3 (or higher) instruments barriers for VTune Amplifier to enhance its inefficiency metrics.  The barrier type is added to the segment name – loop, single, reduction, etc.  The runtime also emits additional information for parallel loops with implicit barriers, such as loop scheduling and chunk size, that is useful in understanding imbalance or the nature of the scheduling overhead. Use the /Barrier-to-Barrier Segment grouping to view the statistical distribution by barrier-to-barrier segments.

Please note that the same lexical loop constructs with different schedule types or chunk sizes will be displayed separately in different rows.  For example, if one instance had a chunk size of 1000 and another had a chunk size of 1563, there would be two entries for the construct with the same name but different sizes in the OpenMP Loop Chunk column.

Barrier-to-Barrier Segments are also available on the timeline.


Intel® MPI  and OpenMP Multi-rank Analysis on a Compute Node

Per-rank Intel MPI communication busy wait time detection and showing the metric in summary, grid and timeline view

For hybrid MPI and OpenMP applications, it is important to explore OpenMP inefficiency along with MPI communication between ranks. VTune Amplifier recognizes samples in Intel MPI communication busy wait functions and shows metrics based on that information. For multi-rank OpenMP results, VTune Amplifier’s Summary view is enriched with a table of Top MPI ranks with OpenMP metrics sorted by MPI Communication Spin time from low to high values. The lower the Communication time the more the rank was executing (vs. spinning) and the more impact OpenMP tuning will have on the application elapsed time.

Process names are hyperlinked to the Bottom-up view with ‘/Process /OpenMP Region/ …’ groupings to get details of the OpenMP metrics aggregated per-process, with the ability to expand the results by Regions and Barrier-to-Barrier Segments.

MPI Communication Spin time is highlighted on the timeline.

Intel MPI selective rank profiling configuration option, including EBS analysis for multiple ranks on a node

To simplify selective rank profiling configuration for VTune Amplifier analysis of MPI applications, Intel MPI introduced ‘-gtool’ option in version 5.0.2. The option syntax is:

$ mpirun -genvall –gtool “amplxe-cl -r <my_result> -collect <analysis type>:<rank_set>[=exclusive]” -n <n> <my_app> [my_app_ options]

where <rank_set> specifies the rank range to be included in the VTune Amplifier analysis. Separate ranks with a comma or use the “-” symbol for a set of contiguous ranks. Use the ‘all’ value to configure profiling on all the ranks. Exclusive launch mode helps prevent running more than one collection per node, which is a limitation of EBS profiling.

Starting with Intel MPI version 5.0.3, the ‘node-wide’ clause can be used instead of ‘exclusive’ to make collection on all ranks of the nodes on which the <rank_set> resides, or for all nodes in the case of ‘all’ ranks. In this case, VTune Amplifier will create a result directory per node with host name suffix for the result directory name. This is particularly convenient for EBS collection, where there are limitations on simultaneous profiling by multiple VTune Amplifier command lines.

Below is an example of node level profiling:

$mpirun –gtool “amplxe-cl –c advanced-hotspots –r my_dir:all=node-wide” –n 4 –ppn 2 my_mpi_app

VTune Amplifier XE command line generation for selective rank profiling through Intel® Trace Analyzer and Collector (ITAC) user interface

With Intel Trace Analyzer and Collector 9.0.2 and later, you can generate VTune Amplifier hotspot analysis command lines for ranks selected in the ITAC graphical user interface: from “Event Timeline” Chart, ‘Function Profile/ Load Balance’ grid view or copy the generated command line for the most CPU bound process from ITAC summary page (see more details https://software.intel.com/en-us/node/541057).

User Interface Enhancements

General Exploration analysis with confidence indication

Some of the metrics in VTune Amplifier views may now be marked as unreliable by greying out the values in the following views: Summary, Bottom-up, and Source.  This can happen when the amount of collected event samples is too low to reliably calculate the metric.

Currently it is used for EBS metrics on General Exploration analysis but it may be extended to more metrics in the future if the feedback is favorable.

Timeline “Super Tiny” bird’s-eye view

Timeline analysis of core utilization on modern server and many-core co-processor cards with a large number of ranks/threads is particularly useful with a bird’s-eye view to be able to recognize application phases and behavioral patterns for further data zooming and filtering. VTune Amplifier’s “Super Tiny” view shows all application threads at once using a pixel color intensity to reflect Efficient, Spin & Overhead and MPI Communication Time metrics. Timeline hierarchical grouping for “Super Tiny” shows leaves only grouped according to the grouping hierarchy:

Access the new view in the timeline context menu:

New Filtering Mode for Command Line Reports

To display only particular columns providing metrics/event data, use the -column option and specify a full name of the required column(s) or its substring.

Examples:

  • Show grouping and data columns only for event columns with the *INST_RETIRED.* string in the title:
$ amplxe-cl -R hw-events -r r000ah --column=INST_RETIRED.
  • Show grouping and data columns only for columns with the Idle and Spin strings in the title:
$ amplxe-cl -R hotspots -r r001hs --column=Idle,Spin
For more complete information about compiler optimizations, see our Optimization Notice.