Intel® Advisor Release Notes and New Features
By Vinutha S V,
Published:12/04/2020 Last Updated:03/18/2021
This page provides the current Release Notes for Intel® Advisor. The notes are categorized by year, from newest to oldest, with individual releases listed within each year.
Click a version to expand it into a summary of new features and changes in that version since the last release, and access the download buttons for the detailed release notes, which include important information, such as pre-requisites, software compatibility, installation instructions, and known issues.
You can copy a link to a specific version's section by clicking the chain icon next to its name.
All files are in PDF format - Adobe Reader* (or compatible) required.
To get product updates, log in to the Intel® Software Development Products Registration Center.
For questions or technical support, visit Intel® Software Developer Support.
2021.2 Release
Overview
- Intel® Advisor has been updated to include more recent versions of 3rd party components, which include functional and security updates. Users should update to the latest version.
- Expanded hardware support: Vectorization and Code Insights and CPU / Memory Roofline perspectives now support for 11th Gen Intel Core processors (code-named “Tiger Lake”), 3rd Gen Intel® Xeon® Scalable Processors (code-named “Ice Lake”) and 2nd Generation Intel® Xeon® Scalable Processors (code-named “Cooper Lake”).
- More accurate GPU data transfer estimation: Use a new data reuse analysis of Offload Modeling perspective identify kernels that can reuse the same memory objects and optimize data transfer costs.
- Improved user interface:
- Explore application source code and related metrics using a new full-screen Source view with syntax highlighting introduced for Offload Modeling and GPU Roofline Insights perspective reports
- Gain insights into GPU kernel execution with a new Details pane introduced for a GPU Roofline Insights perspective repo
NOTE: 2021.2 release is conflicted with 2021.1 release if it was installed by YUM, DNF, or Zypper package managers. Please remove 2021.1 release before installing 2021.2 release. To install both releases side-by-side, please use web distribution. For details, see Installation Using Package Managers
2021.1
Initial Release
Overview
- Offload Advisor: Get your code ready for efficient GPU offload even before you have the hardware. Identify offload opportunities, quantify potential speedup, locate bottlenecks, estimate data transfer costs, and get guidance on how to optimize.
- Automated Roofline Analysis for GPUs: Visualize actual performance of GPU kernels against hardware-imposed performance limitations and get recommendations for effective memory vs. compute optimization.
- Memory-level Roofline Analysis: Pinpoint exact memory hierarchy bottlenecks (L1, L2, L3 or DRAM).
- Flow Graph Analyzer support for DPC++: Visualize asynchronous task graphs, diagnose performance issues, and get recommendations to fix them.
- Intuitive User Interface: New interface workflows and toolbars incorporate Roofline Analysis for GPUs and Offload Advisor.
- Intel® Iris® Xe MAX graphics support: Roofline analysis and Offload Advisor now supports Intel® Iris® Xe MAX graphics.
2020
Update 3
Overview
- Intel® Advisor has been updated to include more recent versions of 3rd party components, which include functional and security updates.
Update 2
Overview
- Intel® Advisor has been updated to include more recent versions of 3rd party components, which include functional and security updates.
Intel® Advisor 2020 Update 2 introduces Memory-Level Roofline feature (previously known as Integrated Roofline, tech preview feature). Memory-Level Roofline:
- Visualizes arithmetic intensity for a loop/function at each memory level.
- Supports classical Roofline model by only looking at the DRAM data transfers.
- Identifies primary bottlenecks for loops/functions based on cache simulation data.
- Provides single-kernel Roofline guidance with optimization steps.
- Calculates memory metrics on different cache levels (L1, L2, L3, and DRAM).
Update 1
Overview
- Improvements to Integrated Roofline (in technical preview): easily identify the memory level limiting the performance of a loop or function and get more accurate guidance and recommendations
- Python* API now supports Python 3 versions. Python 2 support is dropped.
Initial Release
Overview
- Intel Advisor has been updated to include more recent versions of 3rd party components, which include functional and security updates.
- Intel Advisor viewer for macOS* is now notarized to run on macOS* 10.15.
2019
Update 5
Overview
- Intel Advisor has been updated to include more recent versions of 3rd party components, which include functional and security updates. Users should update to the latest version.
- Roofline Guidance in Code Analytics. This feature is intended to make the Roofline chart easier to understand and use.
- Integration support for Visual Studio 2019 Update 1.
- Improved Roofline configuration menu for easier chart customization.
- Updated command line help content.
Update 4
Overview
- Removed requirement for flock operation in file systems. Previously Advisor failed when flock was disabled, requiring projects to be kept locally. It should now run smoothly on file systems like Lustre*.
- Added per-loop and per-application recommendations to the Summary pane. This replaces and expands on the "top 5 recommendations" section which was removed in the previous update.
- Survey and Code Analytics now support and identify which loops/functions use CLX VNNI instructions, which are often used in neural network applications.
- Column configurator has been extended to the Top-Down pane. Previously this customization feature was limited to the Survey pane.
- FLOPS and Trip Counts collections now extend existing data rather than replacing it. For example, if Trip Counts are collected, and then a second analysis is run to collect FLOPS only, the FLOPS data will be added to the result without removing the Trip Counts.
- New Preview Feature: Roofline Guidance in Code Analytics. This feature is intended to make the Roofline chart easier to understand and use. Enable by setting the environment variable ADVIXE_EXPERIMENTAL=roofline_guidance
- Support for Microsoft* Visual Studio* 2019 added.
- Flow Graph Analyzer:
- Support for Red Hat Enterprise Linux* 7 restored.
- Bug fix to restore graph properties after interaction with hierarchical view.
- Added support for the use of edges across graphs. This use case is restricted by default, but can be enabled from the Preferences dialogue.
Update 3
Overview
- Improved Summary pane with a sleeker look and better program-level issue highlighting, including a new block with memory bandwidth information.
- Survey columns can now be customized using the column configurator. Configurations can be saved for later use.
- Сache Simulator configuration dialog allows visual configuration of different cache levels for Memory Access Patterns analysis.
- Simplified installation and licensing (serial numbers and license files are no longer required for this product).
- Roofline improvements:
- Improved Roofline comparison feature visually distinguishes the compared result sets. Advisor automatically maps corresponding dots from compared results to each other with arrows and displays FLOPS delta. Filtering the roofline also takes compared results into account.
- Roof values can be adjusted to the number of sockets used on the system to achieve more accurate performance ceilings for applications using thread pinning (such as MPI applications pinning ranks to sockets).
- You can now export roofline snapshots in SVG format for high-quality pictures.
- Added zone highlighting to the Roofline to visually indicate what types of bottleneck apply to dots based on position.
- Added automatic smart roof selection that highlights relevant roofs based on the types of instruction present.
- Flow Graph Analyzer improvements:
- Flow Graph Analyzer can be launched from Advisor GUI on Windows and Linux.
- Can now display execution traces on CPU/core swimlanes in the timeline charts.
- Can now select different layout schemes for the tree-map view with the squarified layout being the default.
- Tree-map display performance improved by caching the computation in the metaxml file.
- Scalability analysis now abridges long-running computations. Users have control over how much they want a computation abridged.
- Support for TBB’s lightweight policy added to scalability analysis.
- Support for Fedora* 29 added.
Details
Roofline Enhancements
The Roofline Chart now has optional zone highlighting to visually indicate which boundary types apply to a given loop or function. The leftmost region contains loops that would be limited by the highest memory-related roof before reaching a compute-related roof, so these loops are memory bound. The rightmost region contains loops that would be limited by the highest compute-related roof before encountering a memory-related roof, so these are compute-bound. The center region is mixed, and contains loops that may be affected by either or both types of bottleneck.
The ability to compare Roofline Charts has been vastly improved. Data points from different sets now display as different shapes, and equivalent entries are linked with arrows. Previously, attempting to compare results from multiple steps of optimization created a cloud of circles and required manual identification of individual loops to see how they moved as a result of code changes. The visual enhancements in this update enable you to see at a glance how each loop was affected by each optimization.
Filtering has also been adjusted to account for equivalent loops in multiple data sets.
Improved Summary pane
The new and improved summary pane with a sleeker look and better program-level issue highlighting, including a new block with memory bandwidth information.
Customized column configurator
The survey view now allows you to select the columns you would like to display and save the configuration.
Update 2
Overview
- Intel® Advisor 2019 Update 2 includes functional and security updates. Users should update to the latest version.
Update 1
Overview:
- Ability to switch between “all integer operations” and “pure compute integer operations” in the Survey Grid column settings
- Integrated Roofline (preview)
- Ability to select mode of memory-related metrics by cache level and memory operations type (Loads, Stores, both operations) in the Survey Grid column settings
- Ability to export Roofline html report for different memory levels and operation types via command line interface
- Ability to export Integer and INT+FLOAT operations Roofline html report via command line interface
- Detecting usage of VNNI instructions in Traits column
- Introduced 64-bit Graphical User Interface on Windows, that allows working with huge results
- Added “expand sub-tree” context menu option to Top-down tree, simplifying exploring long call chains
- Decreased overheads for Trip Counts, FLOPS and Roofline analyses with stacks collection
- Recommendations enhancements:
- Added recommendation for Reciprocal instruction in AVX512
- Added recommendation to enable code alignment with “#pragma code_align” for efficient use of L1 instruction cache
- Added recommendation on vector length overriding in ICC 2019
- Updated recommendation "Use the Fortran 2008 CONTIGUOUS attribute" by adding new compiler options
- New OS support:
- Red Hat* Enterprise Linux* 6.10
- Ubuntu* 18.10
- MacOS* 10.14
Initial Release
Overview:
New Since Last Update:
- Optimize integer calculations using Integer Roofline analysis
- Get a more accurate memory footprint and check multiple hardware configurations with cache simulation
- macOS* user interface for viewing and analyzing data collected on Linux* or Windows*
- Productively prototype graph algorithms with Flow Graph Analyzer
- New recommendation: optimize standard algorithms in C++ with Parallel STL
New Since 2018:
- Roofline enhancements
- Roofline with Callstacks is now available (more information)
- Share results with colleagues by exporting a dynamic HTML copy of the roofline analysis: advixe-cl –report roofline –report-output=path/to/output.html
- "Filter-in" arbitrary subset of dots on Roofline chart for simplifying complex charts analysis and saving only selected dots to bitmaps
- Ability to adjust the Roofline to a custom number of threads to see the practical performance limits for a given application.
- Roofline benchmarks are now synchronized for multi-rank MPI applications, so that the roof values will be the same on all ranks running on the same physical node
- Ability to compare several roofline results on the same chart
- Collect roofline data with a single CLI command: advixe-cl –collect roofline …
- Reduce overhead for faster analysis results using selective profiling
- Decrease overhead for Memory Access Patterns and Dependencies analyses by limiting loop call count and analysis duration in project properties
- Selective profiling for Roofline, FLOPS and Trip Counts collections to decrease analysis scope and decrease overhead
- Usability improvements
- Font size can be now customized in Options menu. This may help to adjust GUI appearance in SSH X-forwarding sessions
- Ability to select loops on the command line by source file and line: advixe-cl --mark-up-loops --select main.cpp:12,other.cpp:198
- Easily generate HTML report with python: advixe-python to_html.py ProjectDir
- Recommendations tab improved:
- The Recommendations tab has a new, easier to use layout
- User code specific parameters for recommendations, such as a certain value of suggested unroll factor or the name of a function to be inlined
- New recommendation: use non-temporal store (NTS) instructions to improve memory bound application performance
Preview Features:
- Integrated Roofline showing which exact memory layer is the bottleneck for each loop. Set the environment variable ADVIXE_EXPERIMENTAL=int_roofline to activate this feature. (NEW)
Operating System Support:
- Fedora* 28 (NEW)
- Red Hat* Enterprise Linux* 7.5 (NEW)
- SUSE* Linux Enterprise Server* 15 (NEW)
- MacOS*: 10.11.x, 10.12.x and 10.13.x (NEW)
- Microsoft* Windows* 10 build 17134
- Ubuntu* 18.04
- SUSE* Linux Enterprise Server* 12 SP3
Details:
Cache Simulator
Intel® Advisor now includes a Cache Simulation feature, allowing you to get accurate memory footprints and miss information for your application. You can enable this feature and set the specifications for your cache in the Memory Access Patterns tab of the project properties. The simulator can be set to model misses and either cache line utilization or footprint. On the command line, the following flags control the Cache Simulator during a MAP analysis:
- -enable-cache-simulation
- -cachesim-mode=<footprint/cache-misses/utilization>
- -cachesim-associativity=<number>
- -cachesim-sets=<number>
- -cachesim-cacheline-size=<number>
With the environment variable ADVIXE_EXPERIMENTAL=int_roofline set, the Cache Simulator functionality is expanded to enable the integrated roofline preview feature. This version of the Cache Simulator models multiple levels of cache for data such as counts of loaded or stored bytes for each loop. Optionally, it can be set to simulate a specific cache configuration. The format for a configuration is the specifications for each cache level, strung together with slashes, starting with level 1. Each level’s specification is formatted as count:ways:size.
For example, 4:8w:32k/4:4w:256k/16w:6m is the configuration for
- Four eight-way 32KB level 1 caches
- Four four-way 256KB level 2 caches
- (One) sixteen-way 6MB level 3 cache
The enhanced Cache Simulator can be enabled in the GUI by checking the appropriate checkbox in the Trip Counts section of the Advisor project properties, where an entry field is also provided for cache configurations. On the command line, add the -enable-cache-simulation flag during a trip counts analysis. Configurations can be specified using the -cache-config=yourconfighere flag.
Integer roofline
Integer operation support has been added to the Roofline feature. Previously, only floating point operations (FLOPs) were recorded. Additional roofs have been added for peak integer performance levels, and the Roofline can be set to count only FLOPs, only IntOPs, or all operations. Intel® Advisor will adjust the positions and presence of dots and roofs according to this setting automatically.
Integrated Roofline
The addition of the Cache Simulation feature has enabled the implementation of the Integrated Roofline model as an experimental feature enabled by setting the environment variable ADVIXE_EXPERIMENTAL=int_roofline. Until now, the Roofline model in Intel® Advisor has been exclusively Cache-Aware, a model that calculates Arithmetic Intensity based on all memory traffic. While this functionality is preserved in the L1/CARM setting, Integrated Roofline also includes charts which calculate Arithmetic Intensity based on traffic in specific levels of cache, as predicted by the Cache Simulator. Data points from any combination of these charts can be displayed together using a dropdown.
The data points from a specific level’s chart should be read in relation to that level’s roof, and cannot break through it. This is not as severe a restriction as it may sound, because Arithmetic Intensity is variable in non-L1 levels of the Integrated Roofline. In the CARM, all performance optimizations change the vertical position of a data point, but the AI remains constant (barring alterations to the algorithm by either programmer or compiler). In the other layers of Integrated Roofline, memory optimizations may, in addition to improving performance, increase the AI of the data point – improving the memory usage should move the data point to the right.
This behavior helps you identify memory bottlenecks. While the method of identifying compute bottlenecks remains the same as before (check the metrics associated with the roofs above), bandwidth limitations can now be identified by comparing each level’s data point to its roof. As no point can break its bandwidth roof, those that are pressed against their roofs are clear bottlenecks, and must be allowed to move upward by increasing their arithmetic intensity through memory optimization.
Flow Graph Analyzer
Flow Graph Analyzer (FGA) is available as a feature of Intel® Advisor. This new tool, found in the same installation directory as Advisor itself, provides a convenient GUI based approach to design and analysis of parallel applications that rely on the Intel® Threading Building Blocks (Intel® TBB) flow graph interface. The Intel® TBB library is a widely used C++ template library that provides features that enable developers to easily create parallel applications to take advantage of multicore architectures and heterogeneous systems. The flow graph interface was introduced to Intel TBB in 2011 to exploit parallelism at higher levels by providing efficient implementations of dependency graphs and data flow algorithms.
2018
Update 4
Overview:
- Optimize integer calculations using Integer Roofline analysis
- Get a more accurate memory footprint and check multiple hardware configurations with cache simulation
- New recommendation: optimize standard algorithms in C++ with Parallel STL
- Preview Feature: Integrated Roofline showing which exact memory layer is the bottleneck for each loop. Set the environment variable ADVIXE_EXPERIMENTAL=int_roofline to activate this feature.
- New OS Support:
- Fedora* 28
- Red Hat* Enterprise Linux* 7.5
- SUSE* Linux Enterprise Server* 15
Details:
Cache Simulator
Intel® Advisor now includes a Cache Simulation feature, allowing you to get accurate memory footprints and miss information for your application. You can enable this feature and set the specifications for your cache in the Memory Access Patterns tab of the project properties. The simulator can be set to model misses and either cache line utilization or footprint. On the command line, the following flags control the Cache Simulator during a MAP analysis:
- -enable-cache-simulation
- -cachesim-mode=<footprint/cache-misses/utilization>
- -cachesim-associativity=<number>
- -cachesim-sets=<number>
- -cachesim-cacheline-size=<number>
With the environment variable ADVIXE_EXPERIMENTAL=int_roofline set, the Cache Simulator functionality is expanded to enable the integrated roofline preview feature. This version of the Cache Simulator models multiple levels of cache for data such as counts of loaded or stored bytes for each loop. Optionally, it can be set to simulate a specific cache configuration. The format for a configuration is the specifications for each cache level, strung together with slashes, starting with level 1. Each level’s specification is formatted as count:ways:size.
For example, 4:8w:32k/4:4w:256k/16w:6m is the configuration for
- Four eight-way 32KB level 1 caches
- Four four-way 256KB level 2 caches
- (One) sixteen-way 6MB level 3 cache
The enhanced Cache Simulator can be enabled in the GUI by checking the appropriate checkbox in the Trip Counts section of the Advisor project properties, where an entry field is also provided for cache configurations. On the command line, add the -enable-cache-simulation flag during a trip counts analysis. Configurations can be specified using the -cache-config=yourconfighere flag.
Integer roofline
Integer operation support has been added to the Roofline feature. Previously, only floating point operations (FLOPs) were recorded. Additional roofs have been added for peak integer performance levels, and the Roofline can be set to count only FLOPs, only IntOPs, or all operations. Intel® Advisor will adjust the positions and presence of dots and roofs according to this setting automatically.
Integrated Roofline
The addition of the Cache Simulation feature has enabled the implementation of the Integrated Roofline model as an experimental feature enabled by setting the environment variable ADVIXE_EXPERIMENTAL=int_roofline. Until now, the Roofline model in Intel® Advisor has been exclusively Cache-Aware, a model that calculates Arithmetic Intensity based on all memory traffic. While this functionality is preserved in the L1/CARM setting, Integrated Roofline also includes charts which calculate Arithmetic Intensity based on traffic in specific levels of cache, as predicted by the Cache Simulator. Data points from any combination of these charts can be displayed together using a dropdown.
The data points from a specific level’s chart should be read in relation to that level’s roof, and cannot break through it. This is not as severe a restriction as it may sound, because Arithmetic Intensity is variable in non-L1 levels of the Integrated Roofline. In the CARM, all performance optimizations change the vertical position of a data point, but the AI remains constant (barring alterations to the algorithm by either programmer or compiler). In the other layers of Integrated Roofline, memory optimizations may, in addition to improving performance, increase the AI of the data point – improving the memory usage should move the data point to the right.
This behavior helps you identify memory bottlenecks. While the method of identifying compute bottlenecks remains the same as before (check the metrics associated with the roofs above), bandwidth limitations can now be identified by comparing each level’s data point to its roof. As no point can break its bandwidth roof, those that are pressed against their roofs are clear bottlenecks, and must be allowed to move upward by increasing their arithmetic intensity through memory optimization.
Update 3
Overview:
- Roofline Enhancements:
- Experimental Feature: IntOPS-based Roofline. Access this feature by setting the environment variable ADVIXE_EXPERIMENTAL=int_roofline
- Ability to export Roofline chart in HTML format from the command line.
- Roofs can now be scaled to a custom number of threads.
- Usability Improvements:
- MAP analysis can now be stopped by a set condition to reduce collection overhead.
- Batch mode can now be limited to a specified number of top hot innermost loops.
- Bug fixes
- Additional OS Support:
- Microsoft* Windows* 10 build 17134
- Ubuntu* 18.04
- SUSE* Linux Enterprise Server* 12 SP3
Details:
Several improvements have been made to Roofline in this update. Roofline charts can be exported in HTML format from the command line without having to configure a graphical user interface, which will improve ease of use in environments such as clusters. These exported charts can then be moved to another machine and opened in a web browser. The roofs on the chart can also now be scaled to arbitrary numbers of threads; previously, roofs were only automatically configured to the use of all the machine's resources or the use of only one thread, and roofs for applications using a number of threads between these two values previously had to be calculated and entered by hand.
An experimental integer-based version of the Roofline has also been added. It can be enabled by setting the environment variable ADVIXE_EXPERIMENTAL=int_roofline before launching Intel Advisor. This feature also includes Total Roofline, which displays both integer and floating point operations on the same chart. Previously, the Roofline was strictly based on floating point operations, and integer operations were not recorded or displayed on the chart.
Update 2
Overview:
- "First Site Instance Footprint" metric added to Memory Access Patterns.
- Roofline improvements:
- Loop performance limits are visualized on mouseover.
- Loops can be filtered into or out of a Roofline chart.
- Multiple Roofline results can be compared on the same chart.
- Roofline benchmarks can be synchronized for multi-rank MPI applications with the "--benchmark-sync" option.
- Controls that aren't visible due to the Roofline window being too small are now accessible from the
button.
- Recommendations improvements:
- The layout of the recommendations pane has been improved.
- Certain recommendations now reflect details of the analyzed code.
- New recommendation added: use non-temporal store (NTS) instructions to improve memory bound application performance.
- Usability improvements:
- Loops can be selected for advanced analysis on the command line using source file and line number.
- Font size can be customized in Options menu.
- Overhead for Memory Access Patterns and Dependencies analyses can be decreased with new project properties to limit loop call counts (Dependencies only) and analysis duration (both analyses).
- Filter state is now persistent on re-opening results.
- New operating system support:
- Fedora* 27
- Ubuntu* 17.10
- Microsoft* Windows* 10 RS3
Details:
A new metric called First Site Instance Footprint has been added to the Memory Access Patterns analysis. This metric reports the number of unique memory locations, in bytes, touched across all iterations of the first call of a loop.
Roofline Improvements
While the distance between a dot on a Roofline and the roofs above it can be used to estimate its performance limits and get an idea of whether there's "a lot" or "a little" room for improvement, Advisor now provides numerical estimates, giving you a much better idea of the limits of what you could get out of optimization.
These estimates display both the potential GFLOPS and times-speedup of a loop, in relation to both the first suitable roof above it and the last roof above it, when the loop is hovered over with the mouse cursor.
Loops can now be filtered out of or into a Roofline chart. Users of Intel® VTune™ Amplifier may be familiar with this functionality. Filtering in on a selection of loops hides all unselected loops, while filtering out hides the selected loops. Filters can be removed as well, restoring all loops to a visible state. These filters are taken into account when saving a chart as a bitmap.
Multiple results can now be compared on the same chart. The compare button can be used to import .advixeexp files from other results or snapshots. These imported results will be faded out unless selected as the active result in the dropdown, allowing you to easily distinguish different result sets. Note that only the dots are imported, not the roofs, so this comparison should only be used for results originating on systems with the same peaks.
Recommendations Improvements
The layout of the recommendations tab has been streamlined for ease of use. A table of contents is provided on the right, while individual recommendations are listed under each issue on the left. Each recommendation has an icon indicating the confidence level by the number of bars next to the light bulb, and can be expanded to display more details by clicking on it. Additionally, some recommendations have been modified to reference actual details specific to the analyzed loop, such as vector length or trip count, rather than generic statements.
Usability Improvements
In previous versions, selecting loops for advanced analysis on the command line required listing their loop IDs, obtained by viewing the survey report. It is now possible to select loops using the source file and line number instead of loop IDs.
advixe-cl --mark-up-loops --select main.cpp:42,foobar.cpp:183
Usability of the advanced analysis types Memory Access Patterns and Dependencies has been further improved by allowing loop call count (for the latter only) and analysis duration (for both) to be limited in the project properties, decreasing the overhead of these analysis types.
Font size can be adjusted in the options menu. This may be helpful for users who wish to adjust the GUI's appearance in SSH X-forwarding sessions.
Update 1
Overview:
- Selective profiling for Roofline, FLOPS and Trip Counts collections to decrease analysis scope and decrease overhead
- General availability of Roofline with callstacks (a.k.a. Hierarchical Roofline)
- Improved finalization speed
- Usability Improvements
- Run Roofline in command line with single command: advixe-cl –collect roofline
- Improved UI responsiveness
- Progress bar for issues detection
- Single html report generated from Python to share data easily in one file
- Suppressed warnings on missing debug info from system modules
- Hints to configure search paths if symbols not resolved
Details:
Selective profiling for Roofline, FLOPS and Trip Counts collections to decrease analysis scope and decrease overhead
To selectively profile your application, you will also need modify your application to call our API, as the following example shows. You can now control when to start and stop a Survey, Roofline, FLOPS and Trip Counts collection of your application. To start a collection you need to click on the "start paused" button, the button to the right of the "Collect" button.
int main(int argc, char* argv[])
{
// Do initialization work here
__itt_resume();
// Do profiling work here
__itt_pause();
// Do finalization work here
return 0;
}
To start paused from the command-line use the -start-paused option.
advixe-cl -collect tripcounts -flop -stacks -start-paused -project-dir MyProjectDirectory -- MyExecutable
General availability of Roofline with callstacks (a.k.a. Hierarchical Roofline)
To collect Hierarchical Data on the command line, use the -callstack-flops flag during the trip counts collection:
advixe-cl -collect tripcounts -flop -stacks -project-dir MyProjectDirectory -- MyExecutable
Total FLOPS, calculated as Total GFLOP / Elapsed Time for the given level of the call stack, will be presented in the Survey report and in the Top Down tab. If you have enabled the Hierarchical Roofline feature, checking the "Show Roofline with Callstacks" checkbox at the top of the roofline will display dots representing the next level of the call stack, linked to their components, and allow you to collapse and expand the containing dots in the roofline.
Under the traditional cache-aware roofline model (more information here), introduced in previous updates, a large cluster of small, green dots would not immediately indicate where to focus your time, as none of them on their own would make good optimization candidates. If, however, most of these dots collapsed into one large, red dot, the path forward would be much clearer, as this would indicate that they are all called from the same place, and together they are a much clearer candidate for optimization.
Depending on the nature of the code, either applying optimizations to each of the smaller component dots, or perhaps applying an optimization within the calling dot that improved the efficiency of its components overall, may have a large impact on the performance, which would not have been indicated without the realization that the assortment of small dots came from one place.
Usability Improvements
Run Roofline in command line with single command
advixe-cl -collect roofline -project-dir MyProjectDirectory -- MyExecutable
Single html report generated from Python to share data easily in one file
advixe-python to_html.py MyProjectDirectory
Initial Release
Overview:
New Since Last Update:
- Total FLOPS metric for Hierarchical Roofline available in Top Down tab when hierarchical data is collected
- New recommendation: "Possible inefficient conflict-detection instructions present"
- Advisor now reports names of serialized/scalar functions preventing vectorization
- GUI Improvements
- Analysis Start/Toolbar button improvements
- Recommendations Tab navigation/design improvements
- MKL overview and Dynamic Instruction Mix output visibility has been extended.
- General Roofline GUI improvements
New Since 2017 Initial Release:
- Cache-Aware Roofline modeling (more information)
- Filtering by module
- Added the "joined" command line report type that outputs both Survey and Refinement data.
The command format is: advixe-cl -report joined -project-dir MyResults
Preview Features:
- Cache Simulator: set ADVIXE_EXPERIMENTAL=cachesim before launching Advisor (NEW)
- Experimental support for analyzing Python code (NEW)
- Hierarchical Roofline (a modification of the Cache-Aware Roofline)
- Experimental support for viewing Advisor data through a Python API
Examples can be found in {INSTALL_DIR}/pythonapi/examples
Improved OS/Processor/IDE Support:
- Intel® Xeon® Scalable Processors (NEW)
- Fedora* 26 and SUSE* Linux Enterprise Server* 12 SP2 (NEW)
- Microsoft* Windows Server* 2016, Debian* 9, and Ubuntu* 17.04
- Microsoft Visual Studio* 2017 IDE
Details:
Total FLOPS in the Top Down tab
Total FLOPS or Hierarchical FLOPS are now available in the Top Down tab when Hierarchical Data is enabled. Hierarchical FLOPS are used by the Hierarchical Roofline preview feature, but are also available without enabling it.
To collect Hierarchical Data, check the "FLOP with callstacks" checkbox in the Trip Counts tab of the Advisor Project Properties.
Alternatively, if you have set ADVIXE_EXPERIMENTAL=roofline_ex before launching Advisor to enable the Hierarchical Roofline preview feature, simply collecting Hierarchical Roofline data using the checkbox under the roofline button will produce Hierarchical FLOPs.
To collect Hierarchical Data on the command line, use the -callstack-flops flag during the trip counts collection:
advixe-cl -collect tripcounts -flops-and-masks -callstack-flops -project-dir MyProjectDirectory -- MyExecutable
Total FLOPS, calculated as Total GFLOP / Elapsed Time for the given level of the call stack, will be presented in the Survey report and in the Top Down tab. If you have enabled the Hierarchical Roofline preview feature, checking the "Show Hierarchical Data" checkbox at the top of the roofline will display dots representing the next level of the call stack, linked to their components, and allow you to collapse and expand the containing dots in the roofline.
Under the traditional cache-aware roofline model (more information here), introduced in previous updates, a large cluster of small, green dots would not immediately indicate where to focus your time, as none of them on their own would make good optimization candidates. If, however, most of these dots collapsed into one large, red dot, the path forward would be much clearer, as this would indicate that they are all called from the same place, and together they are a much clearer candidate for optimization.
Depending on the nature of the code, either applying optimizations to each of the smaller component dots, or perhaps applying an optimization within the calling dot that improved the efficiency of its components overall, may have a large impact on the performance, which would not have been indicated without the realization that the assortment of small dots came from one place.
GUI Improvements
The standalone GUI in Intel Advisor 2018 has had its toolbar improved by replacing the analysis buttons with a single dropdown/start button.
The icons for the various analysis types have also been replaced with images that better represent their functionality (and also look quite a bit sleeker).
MKL data can now be viewed in the Survey report and on the Roofline graph as individual functions. This data's visibility can be toggled using the MKL toggle switch located at the top of the GUI, alongside existing toggles for displaying vectorized and unvectorized loops/functions. Previously, only the Summary pane breakdown of the overall proportions of MKL versus user code was available.
Dynamic Instruction Mix information can now be viewed on the Code Analytics tab in the GUI. Dynamic Instruction Mixes, added in a previous update as a then-CLI-only feature, are a variation of the Instruction Mixes feature. While the Static Instruction Mixes report how many instructions of a given type exist in the code, the Dynamic Instruction Mixes report how many instructions of a given type were actually executed. It should be noted that the Static Mix is counted per iteration while the Dynamic Mix is for the entire execution of the application.
The Recommendations tab has been overhauled for better readability and usability. A description of the issue(s) as a whole can be found on the right, along with any suggestions on how to proceed. Each recommendation, when clicked, will display a detailed explanation on the left.
Filtering by Module
Filter-by-module capability was added in 2017 update 2, allowing application analysis scope to be narrowed down before collection begins by restricting collection to specific modules. This cuts out extraneous data and reduces overhead for Survey and Trip Counts analysis types. Modules can be filtered by excluding unwanted files or by restricting collection to a specific list of desired modules, using either the command line or the GUI project properties.
Example Command Lines:
advixe-cl -collect survey -module-filter-mode=exclude -module-filter=DoNotAnalyze.so -project-dir MyProject -- MyApplication
advixe-cl -collect survey -module-filter-mode=include -module-filter=AnalyzeMyApp.exe,AnalyzeThisToo.dll -project-dir MyProject -- AnalyzeMyApp.exe
Preview Feature: Cache Simulator
Effective memory cache usage is critical to getting optimal performance out of an application. The Cache Simulator preview feature provides insight into how well your program is using memory by providing information on cache line utilization, cache misses, and memory loads and stores. The simulation can be customized for specific cache types and sizes.
To collect cache simulation data, first enable the preview feature by setting the environment variable ADVIXE_EXPERIMENTAL=cachesim before launching Advisor, then checking the "Enable CPU cache simulation" box in the Memory Access Patterns tab of the Advisor Project Properties.
Run a Survey analysis as normal. In the Survey results, select the desired loops using the checkboxes in the inkdrop column. When you run a Memory Access Patterns analysis, the cache behavior observed during the data collection will be displayed in the Refinement Report tab. The Cache Line Utilization metric in particular describes the overall quality of the cache usage, which is very important for memory-bound applications.
Preview Feature: View Data with Python API
The Python API preview feature originally introduced in 2017 update 3 has been improved. It is now possible to run a collection from a Python script, and make custom data post-processing. The Python API tool allows flexible interpretation of Advisor output data according to your selection of relevant information from over 500 metrics, in an easy-to-read HTML format, including Roofline output. This short example prints Cache Utilization data:
import sys
try:
import advisor
except ImportError:
sys.exit(1)
project = advisor.open_project(sys.argv[1])
data = project.load(advisor.MAP)
for site in data.map:
site_id = site['site_id']
cachesim = data.get_cachesim_info(site_id)
attrs = cachesim.attrs
print(indent + 'Evicted cache lines utilization:')
print(indent * 2 + 'Average utilization'.ljust(width) + ' = {:.2f}%'.format(cachesim.utilization))
Detailed information on how to use this feature can be found here.
2017
Update 5
Linux* Release Notes Windows* Release Notes
Overview:
- Added support for Intel® Xeon® Scalable Processors
- Added support for SUSE* Linux Enterprise Server* 12 SP2
- Added support for Microsoft* Windows Server 2016
- Bug fixes
Update 4
Linux* Release Notes Windows* Release Notes
Overview:
- Bug fixes
Update 3
Linux* Release Notes Windows* Release Notes
Overview:
- Hierarchical Roofline preview feature. Set the environment variable ADVIXE_EXPERIMENTAL=roofline_ex before launching Advisor (data re-collection may be required).
- Experimental support for accessing Advisor data via Python API. See examples in the {install_dir}/pythonapi/examples. API is subject to change in the future releases.
- New recommendations:
- Force scalar remainder for loops with low mask utilization on AVX512.
- Extended “Gather recommendation” with “Constant (non-unit)” pattern.
- Roofline enhancements:
- Customization of roof values and chart borders.
- Persistence for roofs visibility and custom values (note that auto save delay is 20 seconds).
- Improved scaling of histogram.
- Joined Survey and Refinement CLI report.
- MKL breakdown in Summary.
- Divided read/write access in Memory Access Patterns report.
- Added support for Microsoft Visual Studio* 2017.
Update 2
Linux* Release Notes Windows* Release Notes
Overview
- Cache-Aware Roofline modeling
- Improved Trip Counts and FLOPS
- Filtering by Module
- Dynamic Instruction Mixes
- Re-finalization
- Updated OS Support:
- Support for Windows* Server 2016, Fedora* 25, and Ubuntu* 16.10
- All license types now support cross-OS analysis
Details:
Cache-Aware Roofline
Cache-Aware Roofline, introduced as a preview in the previous update, is now an official feature of Intel® Advisor, and no longer requires setting an environment variable to access. This visual performance model provides insight into the source of performance limitations and reveals how much potential performance your application is currently leaving on the table, allowing you to spend your time efficiently in optimizing where it counts most.
More information about the Roofline feature can be found in the Intel® Advisor Roofline main article.
Roofline modeling was first proposed in a paper by Berkeley researchers in 2009. The model has been improved since then, particularly in a 2013 paper from the Technical University of Lisbon that introduced the Cache-Aware model, which is now available in Advisor as an automatically-generated chart when both Survey and Trip Counts with FLOPS have been collected. The Roofline chart and Survey report can either be viewed on their own using the toggle bar on the left to switch between them, or side by side by clicking the white ribbon with four gray dots (initially located by the toggle bar).
Improved Trip Counts and FLOPS
The general quality and coverage of the Trip Counts collection has been improved. Collecting Trip Counts data also collects a new Call Count metric for functions.
Trip Counts and FLOPS can now be collected independently of each other. In the GUI, there are checkboxes in the work flow. On the command line, FLOPS can be enabled by adding the -flops-and-masks option to a tripcounts collection, while Trip Counts can be disabled by adding -no-trip-counts.
Example Command Lines:
### Collect only FLOPS ###
advixe-cl -collect tripcounts -no-trip-counts -flops-and-masks -project-dir MyProject -- MyApplication
### Collect only Trip Counts ###
advixe-cl -collect tripcounts -project-dir MyProject -- MyApplication
### Collect both ###
advixe-cl -collect tripcounts -flops-and-masks -project-dir MyProject -- MyApplication
Filtering By Module
Application analysis scope can now be narrowed down before collection begins to cut out extraneous data and reduce overhead for Survey and Trip Counts analysis types, by restricting collection to specific modules. Modules can be filtered by excluding unwanted files or by restricting collection to a specific list of desired modules, using either the command line or the GUI project properties.
Example Command Lines:
advixe-cl -collect survey -module-filter-mode=exclude -module-filter=DoNotAnalyze.so -project-dir MyProject -- MyApplication
advixe-cl -collect survey -module-filter-mode=include -module-filter=AnalyzeMyApp.exe,AnalyzeThisToo.dll -project-dir MyProject -- AnalyzeMyApp.exe
Updated Operating System Support
Intel® Advisor now supports Windows* Server 2016, Fedora* 25, and Ubuntu* 16.10.
All license types now have support for cross-operating system analysis. This allows for the use of a single license for both Linux* and Windows* systems, making it easier to collect data on one operating system and view it on the other. Installation packages for additional operating systems can be downloaded from the Registration Center. More information about this feature is available here.
Dynamic Instruction Mixes
Dynamic Instruction Mixes are a new variation of the Instruction Mixes feature. While the previously-existing Static Instruction Mixes report how many instructions of a given type exist in the code, the Dynamic Instruction Mixes report how many instructions of a given type were actually executed. It should be noted that the Static Mix is counted per iteration while the Dynamic Mix is for the entire execution of the application.
Dynamic Instruction Mixes are currently available only on the command line by adding the –mix option to a survey or trip counts report command, and only on a result that has had trip counts collected with the -flops-and-masks flag.
Example Command Line:
advixe-cl -report survey -mix -project-dir MyProject
Re-Finalization
Survey and Trip Counts analysis results can now be re-finalized from the GUI, allowing for easy correction or updating of binary and source search directories. This may be necessary if source files are moved after compilation, or remotely collected results are opened on a viewing machine before setting the search paths appropriately. Correcting these situations is now as simple as changing the search directories and re-finalizing.
Update 1
Linux* Release Notes Windows* Release Notes
Overview:
- Cache-Aware Roofline preview feature
- Analysis workflow improvements
- Recommendations display in Summary and Refinement Reports
- New recommendations
- Ability to stop refinement analysis collection if every site has executed at least once.
Details:
Cache-aware roofline modeling
The Intel Advisor now offers a great step forward in visual performance optimization with a new Roofline model analysis feature. This new feature provides insights beyond vectorization, such as memory usage and the quality of algorithm implementation.
To enable this preview feature, set the environment variable ADVIXE_EXPERIMENTAL=roofline before launching the Intel Advisor.
Analysis workflow improvements:
Analysis workflow improvements include
- Intel® Math Kernel Library (Intel® MKL) support: Intel Advisor results now show Intel MKL function calls.
- Improved FLOPs analysis performance.
- Decreased Survey analysis overhead for the Intel® Xeon Phi™ processor.
- New category for instruction mixed data: compute with memory operands.
- Finalize button for “no-auto-finalize” results, when result is finalized on a separate machine. On collecting speciy the -no-auto-finalize option. Then you can copy your project to a different system, Open the project, specify new paths to your source and binary and then you will be ready to finalize.
- MPI support in the command line dialog. We now have ability to generate the MPI command line. After clicking on the command-line button click on generate MPI command-line.
Recommendations:
Recommendations are now cached in result snapshots to speed up display. Additionally, recommendations now display in in the Summary and Refinement reports.
New recommendation: Vectorize call(s) to virtual method.
Memory analysis:
You can now track a refinement analysis progress, and can stop collection if every site executes at least once.
Initial Release
Linux* Release Notes Windows* Release Notes
Overview:
- Workflow improvements
- Batch Mode
- Improved MPI workflow
- Memory Access Patterns, Survey and Loop Analytics improvements
Details:
Workflow
Batch mode, which lets you to automate collecting multiple analysis types at once. You can collect Survey and Trip Counts in single run – Advisor will run the application twice, but automatically, without user actions. For Memory Access Patterns (MAP) and Dependencies analyses, there are pre-defined auto-selection criterias, e.g. check Dependencies only for loops with “Assumed dependencies” issue.
Improved MPI workflow allows you to create snapshots for MPI results, so you can collect data in CLI and transfer self-contained packed result to a workstation with GUI for analysis. We also fixed some GUI and CLI interoperability issues.
Memory Access Patterns
MAP analysis now detects Gather instruction usage, unveiling more complex access patterns. A SIMD loop with Gather instructions will work faster than scalar one, but slower, than SIMD loop without Gather operations. If a loop has “Gather stride” category, check new “Details” tab in Refinement report for information about strides and mask shape for the gather operation. One of possible solutions is to inform compiler about your data access patterns via OpenMP 4.x options – for cases, when gather instructions are not necessary actually.
MAP report in enriched with Memory Footprint metric – distance in address ranges touched by given instruction. The value represents maximal footprint across all loop instances.
Variable name is now reported for memory accesses, in addition to source line and assembly instruction. Therefore, you have more accuracy in determining the data structure of interest. Advisor can detect global, static, stack and heap-allocated variables.
We added new recommendation to use SDLT for loops with an “Ineffective memory access” issue.
Survey and Loop Analytics
Loop Analytics tab has got trip counts and extended instruction mix, so you can see compute vs memory instruction distribution, scalar vs vector, ISA details, etc.
We have improved usability of non-Executed code paths analysis, so that you can see ISA and traits in the virtual loops and sort and find AVX512 code paths more easily.
Loops with vector intrinsics are exposed as vectorized in the Survey grid now.
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.