What's new? - Intel® VTune™ Amplifier XE 2015

Intel® VTune™ Amplifier XE 2015

A performance profiler for serial and parallel performance analysis. Overviewtrainingsupport.

New for the initial 2015 release! (Recommended for all users)

As compared to 2013 SP1

All Operating Systems

Windows* Operating Systems

Linux Operating Systems

Note: We are now labeling analysis tool updates as "Recommended for all users" or "Optional update unless you need…".  Recommended updates will be available about once a quarter for users who do not want to update frequently.  Optional updates may be released more frequently, providing access to new processor support, new features, and critical fixes.

Resources

Contents

 

File: vtune_amplifier_xe_2015.tar.gz

Installer for Intel® VTune™ Amplifier XE 2015 for Linux*

File: VTune_Amplifier_XE_2015_setup.exe

Installer for Intel® VTune™ Amplifier XE 2015 for Windows*

File: vtune_amplifier_xe_2015.dmg

Installer for Intel® VTune™ Amplifier XE 2015 - OS X* host only

* Other names and brands may be claimed as the property of others.

Microsoft, Windows, Visual Studio, Visual C++, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries.


Enhanced OpenMP* region analysis on Intel® Xeon® and Xeon Phi® systems

With enhanced OpenMP* region analysis, identify common performance bottlenecks, such as load imbalance, granularity issues or synchronization issues. See serial and parallel times for your application and potential tuning gains for parallel regions. For more details refer to the “OpenMP* Analysis” topic in the product help.

Example of new OpenMP* support

Second example of OpenMP* support


Easier data collection on Intel® Xeon Phi™ coprocessors

Collecting data on Intel® Xeon Phi™ coprocessors is easier than ever with improved analysis workflow via the new target system configuration options.  Call stack collection is also now supported for Intel Xeon Phi coprocessors.  ITT API collection (including OpenMP* analysis) now works out of the box on the Intel Xeon Phi coprocessor w/o necessity to set any environment variables for both native and offload applications. For more details, refer to the “Intel Xeon Phi Coprocessor Analysis Workflow” topic in the product help.


Easier to use General Exploration and Bandwidth Analysis

Stop worrying about which microarchitecture you’re profiling and use the new General Exploration and Bandwidth analysis types, enabling you to use the same command line on any supported system!  For more details, please refer to the “About Performance Analysis with VTune Amplifier” topic in the product help.

The hardware event-based sampling analysis tree has been re-structured to introduce cross-CPU basic configurations and separate advanced CPU-specific analysis configurations. General Exploration and Bandwidth analysis types are shared between all supported CPUs.  All tuning opportunities are covered by the General Exploration analysis type for newer processor families, e.g., Ivy Bridge and beyond.  Review the Tuning Guides to take full advantage of the General Exploration analysis type.  CPU specific analysis types, when available, are expanded automatically according to the detected processor type for older processor families (see note below).

NOTE: The Ivy Bridge family of processors no longer has separate advanced analysis types, only General Exploration and Bandwidth.  The Sandy Bridge advanced analysis types that used to be available for Ivy Bridge did not work on Ivy Bridge processors because of hardware incompatibilities and the metrics of interest are now included in the General Exploration analysis type.  Also, the Haswell processor family does not have separate advanced analysis types.  Again, use the General Exploration metrics and the Haswell tuning guide.

Custom Groupings

Many new ways to group and order the performance data, including custom groupings in grid views and new groupings in the timeline pane.
To see how to create a custom grouping please refer to the "Grouping Data" and "Dialog Box: Custom Grouping" topics in the product help.

Use the Timeline grouping menu to group the data by program units. A grouping level depends on the analysis type. For more details, please refer to “Managing Timeline View” in the product help.

Enhanced navigation in the clickable Summary pane

Hyperlinks open the Bottom-up view sorted by the selected metric or directly to the selected function or OpenMP region.

Easier remote collection

Use the graphical interface running on a Windows* or Linux* host system to collect data on a remote Linux* system via SSH. Configure remote collection via the “remote Linux (SSH)” Target system configuration option in the Project Properties dialog:

NOTE:

  1. ssh/scp or plink/pscp tools must be available in the PATH
  2. When collecting data remotely, VTune Amplifier XE looks for the compatible collector on the remote system in the default install location: /opt/intel/vtune_amplifier_xe_<version>. It also temporary stores performance results on the target system in the /tmp directory. If you installed the VTune Amplifier XE to a different location on target and need to specify another temporary directory, use the appropriate configuration options in the Project Properties/Target tab in GUI, or collection knobs -target-install-dir and -target-tmp-dir in the command line.
  3. If your target application requires custom working directory or user-defined environment variables you can specify them via a launching script and use the script as an application to launch.

For more details please refer to the "Collecting Data Remotely from the VTune Amplifier GUI" topic in the product help.


Analyze Linux* or Windows* profiling data on your OS X* host

Use a Mac* computer as your main system?  Now you can host the VTune Amplifier GUI on Mac computers running OS X to view remotely collected results, including the ability to configure and launch remote collection to supported Linux systems.

Once you have registered your Windows or Linux product, an OS X viewer is available for download without additional cost (see below).  It will use your existing Windows or Linux license.  Note: performance profiling on Mac computers is not available.

The VTune Amplifier XE viewer for OS X is available as a separate download in the Intel Software Development Products Registration Center, e.g.:

After clicking on the "Version 2015" in the right column, you will see the following.  Click on the .dmg file to download it, or use the download manager.

After downloading the vtune_amplifier_xe_2015.dmg file, follow these steps to install the software:

  • Install instructions
    • Open up permissions to "/Users/Shared/Library/Application Support" to allow the installation of the license file.
    • Start the 'Finder' application on your OS X* system.
    • Find the file 'vtune_amplifier_xe_2015.dmg'
    • Open/Click on the .dmg file to mount the disk-image.
    • In new opened window, double click on the 'vtune_amplifier_xe_2015.mpkg' item to start installation.
    • Respond to the installation procedure/wizard specifying license/registration type.
    • All GUI applications use the 'Applications' folder as their destination. As a result of a successful installation, 'VTune Amplifier XE 2015' should be created in 'Applications' folder.
    • You may start VTune Amplifier XE 2015 by double-clicking on it in the 'Applications' folder.
  • Un-install instructions
    • Ensure that the 'VTune Amplifier XE 2015' application is closed.
    • Open the 'Finder' application
    • Drag the 'VTune Amplifier XE 2015' application in directory 'Applications' (or other) and drop it in the 'Trash' on the desktop.

Reduce overhead by limiting stack depth

Reduce collection overhead for custom event-based sampling analysis types using the new option to limit call stack depth (in system pages).  Use the '-stack-depth' collector knob in the command line and the corresponding GUI control "Stack size" in the Custom Analysis dialog for the hardware-based sampling.


Import externally collected data

Increase analysis by importing externally collected data into existing results. VTune Amplifier provides the ability to correlate interval or discrete data, provided by an external collector, with the regular data collected by the profiler.  To learn more, refer to the “Adding External Data to the Intel® VTune™ Amplifier” topic in the product help.

You can extend standard VTune Amplifier performance analysis and launch a custom data collector directly from the VTune Amplifier. Your custom collector can be an application you analyze with the VTune Amplifier or a collector that can be launched with the VTune Amplifier. Learn more about configuring and launching a custom collector from GUI and command line from “Using a Custom Collector” help topic.

> amplxe-cl -collect hotspots -knob custom-collector="python.exe C:\work\custom_collector.py" -- notepad.exe 

VTune Amplifier can process and integrate performance statistics collected externally with a custom collector or with your target application in parallel with the native VTune Amplifier analysis. To achieve this, provide the collected custom data as a csv file with a predefined structure and save this file to the VTune Amplifier result directory. 
VTune Amplifier can load and process the following data types: 
•    Interval data with start time and end time 
•    Samples with a set of counters 
To make the VTune Amplifier interpret the custom statistics from the csv file, make sure the file format meets the requirements specified in “Creating a CSV File with External Data” help topic.


Intel® Transactional Synchronization Extensions (Intel® TSX) Exploration analysis

Use the TSX Exploration analysis for tuning applications that use Intel® Transactional Synchronization Extensions (Intel® TSX). The analysis relies on performance counter-based profiling to understand transactional execution behavior and the causes of transactional aborts. For more information on Intel® TSX, see Web resources about Intel® Transactional Synchronization Extensions.

NOTE: the analysis is supported only for Intel processors with the Intel® TSX feature enabled.  Due to recent published errata, systems may have this feature disabled, by default.

The tuning process consists of 2 steps:

  1. Measuring transactional success
    The first step is to measure the transactional success in an application. 
    Select 'TSX Exploration' analysis type and choose ‘1. Transactional success’ from the ‘Analysis Step’ combo box, as shown below:

    Three metrics are collected:
    a)    Clockticks – total number of unhalted cycles collected
    b)    Transactional Cycles – number of cycles spent during transactions. If it is near zero then the application is either not using lock-based synchronization or not using a synchronization library enabled for lock elision through the Intel TSX instructions.
    c)    Abort Cycles - number of cycles spent during transactions which were eventually aborted. If it is small relative to Transactional Cycles, then the transactional success rate is high and additional tuning is not required. If it is almost the same as Transactional Cycles (but not very small), then most transactional regions are aborting and lock elision is not going to be beneficial. The next step would be to identify the causes for transactional aborts and reduce them, which leads us to the next step.
  2. Sampling transactional aborts
    Select the 'TSX Exploration' analysis type and choose ‘2. Aborts’ option from the ‘Analysis Step’ combo box, as shown below:

    As a result of this analysis, you’ll see where the transaction aborts are happening and for what reason. Possible reasons include:
    a)    Instruction - Some instructions, such as CPUID and IO instructions, may cause a transactional execution to abort in the implementation.
    b)    Data Conflict - A conflicting data access occurs if another logical processor either reads a location that is part of the transactional region's write-set or writes a location that is a part of either the read- or write-set of the transactional region. Since Intel TSX detects data conflicts at the granularity of a cache line, unrelated data locations placed in the same cache line will be detected as conflicts.
    c)    Capacity - Transactional aborts may also occur due to limited transactional resources. For example, the amount of data accessed in the region may exceed an implementation-specific capacity.

OpenCL™ Software Technology Kernel Analysis

OpenCL software technology kernel analysis just got better with metrics for memory transfers and visualization of APIs, computing queues and SIMD widths.

If your application uses OpenCL software technology and is doing substantial computational work on the GPU, capture the timing (and other information) of OpenCL kernels running on Intel HD Graphics by enabling the 'Trace OpenCL kernels on Processor Graphics' option during analysis configuration. To view information about all OpenCL kernels running on the GPU, in the Graphics tab of the analysis results switch the grouping to 'Computing Task Purpose / Computing Task (GPU) / Instance'. VTune Amplifier identifies the following computing task purposes: 
a)    Compute (kernels)
b)    Transfer (OpenCL routines responsible for transferring data from the host to a GPU)
c)    Synchronization (for example, clEnqueueBarrierWithWaitList)

The corresponding columns show the overall time a kernel ran on the GPU and the average time for a single invocation (corresponding to one call of clEnqueueNDRangeKernel), working group sizes, as well as averaged GPU hardware metrics collected for a kernel. The cell is highlighted (pink) when there is a potential tuning opportunity. Hover over the cell to read the issue description.

To view details on OpenCL kernels submission, in particular distinguish the order of submission and execution, and analyze the time spent in the queue, zoom in and explore the Computing Queue data in the Timeline pane. You can click a kernel task to highlight the whole queue to the execution displayed at the top layer:

Synchronization tasks are marked with vertical hatching  . Data transfers are marked with cross-diagonal hatching . For more details please refer to the “Analyzing Applications Using Intel® HD Graphics” and “Interpreting GPU OpenCL™ Application Analysis Data” topics in the product help.


Auto-driver rebuild

Did you update your Linux kernel and now the sampling driver won’t load?  No worries!  With the new auto-rebuild feature, the sampling driver detects the kernel update and automatically attempts to rebuild and load the driver.

Starting with this release, if the boot scripts have been installed so that the sampling drivers are automatically loaded during boot time, the boot scripts will check for a change in the kernel and automatically rebuild the driver, at boot time. If successfully rebuilt, new drivers will be loaded so that samples can be collected with the updated kernel.  Make sure to update the kernel sources when updating the running kernel for this feature to work.


Driver-less Event-Based Sampling collection 
Can’t install the Intel event-based sampling driver on Linux because IT won’t let you have root access? Advanced analysis is available even if you can’t install the Intel event-based sampling driver.

Driver-less event-based sampling is supported for the Advanced Hotspots, General Exploration and Custom analysis types on Linux* operating systems based on kernel 2.6.32 or higher, which exports CPU PMU programming details over /sys/bus/event_source/devices/cpu/format file system. This driver-less sampling collection mode is based on the Linux perf* functionality. VTune Amplifier automatically enables the driver-less collection if the Intel event-based sampling driver cannot be installed during product installation.

NOTE:  The Intel event-based sampling driver provides additional features not available in perf, such as:

  • Stacks
  • Uncore events
  • Multiple precise events
  • New events for the latest processors, even on older OSes

NMI Watchdog timer automatically disabled during EBS data collection

The Non Maskable Interrupt (NMI) watchdog timer causes incorrect results in the PMU event-based sampling (EBS) analysis. 
Before, VTune Amplifier XE refused to perform EBS collection if the nmi_watchdog is ON, and a user had to disable it manually. 
Now the nmi_watchdog timer is disabled automatically for EBS collection period. No more hassles turning it on and off.  Profiling just works!


Perf data visualization

Are you collecting event-based sampling data with the Linux ‘perf’ tool?  Visualize it now in the VTune Amplifier GUI for enhanced analysis!

Run the perf collection with the predefined command line options:

  • For application analysis:
    > perf record -o< trace_file_name>.perf -e cpu-cycles,instructions <application_to_launch> 
  • For process analysis:
    > perf record -o< trace_file_name>.perf -e cpu-cycles,instructions <application_to_launch> -p <PID> sleep 15 

where the -e option is used to specify a list of events to collect as -e <list of events>.

Then import the *.perf file(s) into the VTune Amplifier project by using the Import option in GUI or command line.


Linux build-id feature support

VTune Amplifier automatically resolves symbols for modules with build-id and separate files with debug information.

For more complete information about compiler optimizations, see our Optimization Notice.