Identify Performance Bottlenecks Using Roofline

This section shows how to get started using all Vectorization Advisor analyses, starting with the Roofline analysis. The main advantage of using this multi-analysis Vectorization Advisor workflow is the potential to generate an ideal roadmap of optimization steps. The main disadvantage is high runtime overhead. For example:

  • Roofline analysis runtime overhead can be 3x - 8x greater than native target application runtime.

  • Memory Access Patterns (MAP) analysis runtime overhead can be 5x - 20x greater.

  • Dependencies analysis runtime can be 5x - 100x greater.

Intel Advisor Typical Workflow: Identify Performance Bottlenecks Using Roofline

Roofline analysis - Helps visualize actual performance against hardware-imposed performance ceilings, as well as determine the main limiting factor (memory bandwidth or compute capacity). When you run a Roofline analysis, the Intel Advisor:

  • Measures the hardware limitations of your machine and collects loop/function timings using the Survey analysis.

  • Collects floating-point and integer operations data, and memory data using the Trip Counts and FLOP analysis.

Dependencies analysis - Checks for real data dependencies in loops the compiler did not vectorize because of assumed dependencies.

Memory Access Patterns (MAP) analysis - Checks for various memory issues, such as non-contiguous memory accesses and unit stride vs. non-unit stride accesses.

Learn More About Roofline Charts

The Roofline chart plots an application's achieved performance and arithmetic intensity against the machine's maximum achievable performance:

  • Arithmetic intensity (x axis) - measured in number of floating-point operations (FLOPs) and/or integer operations (INTOPs) per byte, based on the loop/function algorithm, transferred between CPU/VPU and memory

  • Performance (y axis) - measured in billions of floating-point operations per second (GFLOPS) and/or billions of integer operations per second (GINTOPS)

In general:

  • The size and color of each Roofline chart dot represent relative execution time for each loop/function. Large red dots take the most time, so are the best candidates for optimization. Small green dots take less time, so may not be worth optimizing.

  • Roofline chart diagonal lines indicate memory bandwidth limitations preventing loops/functions from achieving better performance without some form of optimization. For example: The L1 Bandwidth roofline represents the maximum amount of work that can get done at a given arithmetic intensity if the loop always hits L1 cache. A loop does not benefit from L1 cache speed if a dataset causes it to miss L1 cache too often, and instead is subject to the limitations of the lower-speed L2 cache it is hitting. So a dot representing a loop that misses L1 cache too often but hits L2 cache is positioned somewhere below the L2 Bandwidth roofline.

  • Roofline chart horizontal lines indicate compute capacity limitations preventing loops/functions from achieving better performance without some form of optimization. For example: The Scalar Add Peak represents the peak number of add instructions that can be performed by the scalar loop under these circumstances. The Vector Add Peak represents the peak number of add instructions that can be performed by the vectorized loop under these circumstances. So a dot representing a loop that is not vectorized is positioned somewhere below the Scalar Add Peak roofline.

  • A dot cannot exceed the topmost rooflines, as these represent the maximum capabilities of the machine; however, not all loops can utilize maximum machine capabilities.

  • The greater the distance between a dot and the highest achievable roofline, the more opportunity exists for performance improvement.

In the following Roofline chart representation, loops A and G (large red dots), and to a lesser extent B (yellow dot far below the roofs), are the best candidates for optimization. Loops C, D, and E (small green dots) and H (yellow dot) are poor candidates because they do not have much room to improve or are too small to have significant impact on performance.
This is a visual model, not an actual screenshot, of the Roofline Chart

The Intel Advisor basic roofline model, the Cache-Aware Roofline Model (CARM), offers self data capability. The Intel Advisor Roofline with Callstacks feature extends the basic model with total data capability:

  • Self data = Memory access, FLOPs, and duration related only to the loop/function itself and excludes data originating in other loops/functions called by it

  • Total data = Data from the loop/function itself and its inner loops/functions

The total-data capability in the Roofline with Callstacks feature can help you:

  • Investigate the source of loops/functions instead of just the loops/functions themselves.

  • Get a more accurate view of loops/functions that behave differently when called under different circumstances.

  • Uncover design inefficiencies higher up the call chain that could be the root cause of poor performance by smaller loops/functions.

The following Roofline chart representation shows some of the added benefits of the Roofline with Callstacks feature, including:

  • A navigable, color-coded Callstack pane that shows the entire call chain for the selected loop/function, but excludes its callees

  • Visual indicators (caller and callee arrows) that show the relationship among loops and functions

  • The ability to simplify dot-heavy charts by collapsing several small loops into one overall representation

    Loops/functions with no self data are grayed out when expanded and in color when collapsed. Loops/functions with self data display at the coordinates, size, and color appropriate to the data when expanded, but have a gray halo of the size associated with their total time. When such loops/functions are collapsed, they change to the size and color appropriate to their total time and, if applicable, move to reflect the total performance and total arithmetic intensity.


Intel Advisor: Roofline with Callstacks

For more information on how to produce, display, and interpret the Roofline with Callstacks extension to the Roofline chart, see Roofline with Callstacks.

There are several controls to help you show/hide the Roofline chart:
Intel Advisor: Roofline Chart & Survey Report

1

Click to toggle between Roofline chart view and Survey Report view.

2

Click to toggle to and from side-by-side Roofline chart and Survey Report view.

3

Drag to adjust the dimensions of the Roofline chart and Survey Report.

There are several controls to help you focus on the Roofline chart data most important to you, including the following.
Intel Advisor: Roofline controls

1

  • Select one or more loops/functions by tracing a rectangle with your mouse.

  • Zoom in and out by tracing a rectangle with your mouse. You can also zoom in and out using your mouse wheel.

  • Move the chart left, right, up, and down.

  • Undo or redo the previous zoom action.

  • Reset to the default zoom level.

  • Export the chart as a dynamic and interactive HTML or SVG file that does not require the Intel Advisor viewer for display. Use the arrow to toggle between the options.

2

  • Adjust rooflines to see practical performance limits if an application uses fewer threads than available cores.

  • Build roofs for single-threaded applications (or for multi-threaded applications configured to run single threaded, such as one thread-per-rank for MPI applications. (You can use Intel Advisor filters to control the loops displayed in the Roofline chart; however, the Roofline chart does not support the Threads filter.)

3

  • Toggle the display between floating-point, integer operations, and mixed operations (floating-point and integer).

  • Enable the display of Roofline with Callstacks additions to the Roofline chart.

4

Display Roofline chart data from other Intel Advisor results or non-archived snapshots for comparison purposes.

Use the drop-down toolbar to:

  • Load a result/snapshot and display the corresponding filename in the Compared Results region.

  • Clear a selected result/snapshot and move the corresponding filename to the Ready for comparison region.

    Note: Click a filename in the Ready for comparison region to reload the result/snapshot.

  • Save the comparison itself to a file.

    Note: The arrowed lines showing the relationship among loops/functions do not reappear if you upload the comparison file.

Click a loop/function dot in the current result to show the relationship (arrowed lines) between it and the corresponding loop/function dots in loaded results/snapshots.

Intel Advisor: Roofline Comparison

5

  • Color Roofline chart zones to show if loops/functions are essentially:

    • Memory bound - If so, consider improving memory access patterns or using cache blocking.

    • Compute bound - If so, consider using a different instruction set architecture (ISA) or faster instructions, such as fused multiply-add (FMA) instructions.

    • Compute bound with memory roofs.

  • Adjust the default scale setting to show:

    • The optimal scale for each Roofline chart view

    • A scale that accommodates all Roofline chart views

  • Change the visibility and appearance of roofline representations (lines).

  • Change the appearance of loop/function weight representations (dots).

  • Manually fine-tune roof values to set hardware limits specific to your code.

6

Zoom in and out using numerical values.

7

Hover your mouse over an item to display metrics for it.

If you hover your mouse over a loop/function dot, the Roofline chart displays two blue projection dots with metrics that show potential performance if you optimize the loop/function to reach the next roofline and the maximum achievable roofline. (If the next roofline and maximum achievable roofline are the same, the Roofline chart displays only one blue projection dot.)

Click a loop/function dot to:

  • Outline it in black.

  • Display metrics for it.

  • If Roofline with Callstacks is enabled, display the corresponding, navigable, color-coded callstack.

  • Display corresponding data in other window tabs.

You can also click an item in the Callstack pane to flash the corresponding loop/function dot in the Roofline chart.

If Roofline with Callstacks is enabled, click a loop/function dot Intel Advisor: Collapse control control to collapse descendant dots into the parent dot, or click a loop/function dot Intel Advisor: Expand control control to show descendant dots and their relationship via visual indicators to the parent dot.

Right-click a loop/function dot or a blank area in the Roofline chart to perform more functions, such as:

  • Further simplify the Roofline chart by filtering out (temporarily hiding a dot), filtering in (temporarily hiding all other dots), and clearing filters (showing all originally displayed dots).

  • Copy data to the clipboard.

8

If Roofline with Callstacks is enabled, show/hide the Callstack pane.

9

Display the number and percentage of loops in each loop weight representation category.

Set Up Environment

Environment

Set-Up Tasks

Intel® Parallel Studio XE/Linux* OS

  • Do one of the following:

    • Run one of the following source commands:

      • For csh/tcsh users: source <advisor-install-dir>/advixe-vars.csh

      • For bash users: source <advisor-install-dir>/advixe-vars.sh

      The default installation path, <advisor-install-dir>, is below:

      • /opt/intel/ for root users

      • $HOME/intel/ for non-root users

    • Add <advisor-install-dir>/bin32 or <advisor-install-dir>/bin64 to your path.

    • Run the <parallel-studio-install-dir>/psxevars.csh or <parallel-studio-install-dir>/psxevars.sh command. The default installation path, <parallel-studio-install-dir>, is below:

      • /opt/intel/ for root users

      • $HOME/intel/ for non-root users

  • Set the VISUAL or EDITOR environment variable to identify the external editor to launch when you double-click a line in an Intel Advisor source window. (VISUAL takes precedence over EDITOR.)

  • Set the BROWSER environment variable to identify the installed browser to display Intel Advisor documentation.

  • If you are using Intel® Threading Building Blocks (Intel® TBB), set the TBBROOT environment variable so your compiler can locate the installed Intel TBB include directory.

  • Make sure you run your application in the same Linux* OS environment as the Intel Advisor.

Intel Parallel Studio XE/Windows* OS

Note

Setting up the Windows* OS environment is necessary only if you plan to use the advixe-cl command to run the command line interface, or choose to use the advixe-gui command to launch the Intel Advisor standalone GUI instead of using available GUI or IDE launch options.

Do one of the following:

  • Run the <advisor-install-dir>\advixe-vars.bat command.

    The default installation path, <advisor-install-dir>, is below C:\Program Files (x86)\IntelSWTools\ (on certain systems, instead of Program Files (x86), the directory name is Program Files ).

  • Run the <parallel-studio-install-dir>\psxevars.bat command.

    The default installation path, <parallel-studio-install-dir>, is below C:\Program Files (x86)\IntelSWTools\.

Intel® System Studio

Note

Setting up the environment is necessary only if you plan to use the advixe-cl command to run the command line interface, or choose to use the advixe-gui command to launch the Intel Advisor standalone GUI instead of using available GUI or IDE launch options.

Run the <advisor-install-dir>\advixe-vars.bat command to set up your environment. The default installation path, <advisor-install-dir>, is below C:\Program Files (x86)\IntelSWTools\ (on certain systems, instead of Program Files (x86), the directory name is Program Files ).

Launch Intel Advisor and Create a Project

To launch the:

  • Intel Parallel Studio XE/Intel Advisor standalone GUI:

    • In the Linux* OS: Run the advixe-gui command.

    • In the Windows* OS: From the Microsoft Windows* All Apps screen, select Intel Parallel Studio XE 201n > Intel Advisor 201n

  • Intel System Studio/Intel Advisor standalone GUI: Choose Tools > Intel Advisor > Launch Intel Advisor from the IDE menu.

  • Intel Advisor plug-in to the Visual Studio* IDE: Open your solution in the Visual Studio* IDE.

To create an Intel Advisor project:

  1. Do one of the following

    • In the standalone GUI: Choose File > New > Project… to open the Create a Project dialog box. Supply a name and location for your project, then click the Create Project button to open the Project Properties dialog box.

    • In the Visual Studio* IDE: Choose Project > Intel Advisor 201n Project Properties... to open the Project Properties dialog box.

  2. On the left side of the Analysis Target tab, ensure the Survey Hotspots Analysis type is selected and set appropriate parameters.

  3. Set appropriate parameters for other analysis types and tabs. (Setting the binary/symbol search and source search directories is optional for the Vectorization Advisor.)

Tip

  • If possible, use the Inherit settings from Survey Hotspots Analysis Type checkbox for other analysis types.

  • The Trip Counts and FLOP Analysis type has similar parameters to the Survey Hotspots Analysis type.

  • The Dependencies Analysis and Memory Access Patterns Analysis types consume more resources than the Survey Hotspots Analysis type. If these Refinement analyses take too long, consider decreasing the workload.

  • Select Track stack variables in the Dependencies Analysis type to detect all possible dependencies.

Run Roofline Analysis

Intel Advisor Vectorization Workflow Tab: Run Roofline

Under Run Roofline in the Vectorization Workflow, click the Intel Advisor control: Run analysis control to execute your target application. Upon completion, the Intel Advisor displays a Roofline chart.

To implement the Roofline with Callstacks feature:
Intel Advisor: Roofline with Callstacks

  1. Run the Roofline analysis with the With Callstacks checkbox enabled. Upon completion, the Intel Advisor displays a Roofline chart.

  2. Enable the With Callstacks checkbox in the Roofline chart.

Note

If the Workflow is not displayed in the Visual Studio IDE: Click the Intel Advisor toolbar icon icon on the Intel Advisor toolbar.

Investigate Loops

If all loops are vectorizing properly and performance is satisfactory, you are done! Congratulations!

If one or more loops is not vectorizing properly and performance is unsatisfactory:

  1. Check data in associated Intel Advisor views to support your Roofline chart interpretation. For example: Check the Vectorized Loops/Efficiency values in the Survey Report or the data in the Code Analytics tab.

  2. Improve application performance using various Intel Advisor features to guide your efforts, such as:

    • Information in the Intel Advisor control: RecommendationsPerformance Issues column and associated Intel Advisor control: RecommendationsRecommendations tab
      Intel Advisor Recommendations

      Table of contents on right, showing recommendations for each issue relevant to the loop. Expandable/collapsible recommendations on left (some reference details specific to the analyzed loop, such as vector length or trip count). Number of bars on recommendation icon shows confidence this recommendation is the appropriate fix.

    • Information in the Intel Advisor control: Compiler diagnostic detailsWhy No Vectorization? column and associated Intel Advisor control: Compiler diagnostic detailsWhy No Vectorization? tab

    • Suggestions in Next Steps: After Running Survey Analysis in the Intel Advisor User Guide

If you need more information, continue your investigation by:

  1. Marking one or more loops/functions for deeper analysis in the column AND

  2. Running a Dependencies analysis to discover why the compiler assumed a dependency and did not vectorize a loop/function, and/or running a Memory Access Patterns (MAP) analysis to identify expensive memory instructions

Run Dependencies Analysis

To run a Dependencies analysis:

  1. Mark one or more un-vectorized loops for deeper analysis in the column in the Survey Report.

  2. Under Check Dependencies in the Vectorization Workflow, click the Intel Advisor control: Run analysis control to collect Dependencies data while your application executes.

After the Intel Advisor collects the data, it displays a Dependencies-focused Refinement Report similar to the following:


Intel Advisor: Dependencies Report
There are many controls available to help you focus on the data most important to you, including the following:

1

To display more information in the Dependencies Report about a loop you selected for deeper analysis: Click the associated data row.

2

To display instruction addresses and code snippets for associated code locations in the Code Locations pane: Click a data row.

To choose a problem of interest to display in the Dependencies Source window: Right click a data row, then choose View Source.

To open your default editor in another tab/window: Right-click a data row, then choose Edit Source to open an editor tab.

3

To choose a code location of interest to display in the Dependencies Source window: Right-click a data row, then choose View Source.

To open your default editor in another tab/window: Right-click a data row, then choose Edit Source to open an editor tab.

4

Use the Filter pane to:

  • Temporarily limit the items displayed in the Problems and Messages pane by clicking filter criteria in one or more filter categories.

  • Deselect filter criteria in one filter category, or deselect filter criteria in all filter categories.

  • Sort all filter criteria by name in ascending alphabetical order or by count in descending numerical order. (You cannot change the order in which filter categories are presented.

5

To populate these columns and the Memory Access Patterns Report with data, run a Memory Access Patterns analysis.

If the Dependencies Report shows:

  • There is no real dependency in the loop for the given workload, follow Intel Advisor guidance to tell the compiler it is safe to vectorize.

  • There is an anti-dependency (often called a Write after read dependency or WAR), follow Intel Advisor guidance to enable vectorization.

Intel Advisor code improvement guidance is available in the Intel Advisor control: RecommendationsRecommendations tab and Next Steps: After Running Survey Analysis in the Intel Advisor User Guide. After you finish improving your code:

  1. Run a Memory Access Patterns (MAP) analysis if desired.

  2. Rebuild your modified code.

  3. Run another Roofline analysis to verify your application still runs correctly and all test cases pass, all loops are vectorizing properly, and performance is satisfactory.

Run Memory Access Patterns (MAP) Analysis

To run a Memory Access Patterns (MAP) analysis:

  1. Mark one or more un-vectorized loops for deeper analysis in the column in the Survey Report.

  2. Under Check Memory Access Patterns in the Vectorization Workflow, click the Intel Advisor control: Run analysis control to collect MAP data while your application executes.

After the Intel Advisor collects the data, it displays a MAP-focused Refinement Report similar to the following:

Intel Advisor: Memory Access Patterns (MAP) Report

Intel Advisor code improvement guidance is available in the Intel Advisor control: RecommendationsRecommendations tab and Next Steps: After Running Survey Analysis in the Intel Advisor User Guide. After you finish improving your code:

  1. Rebuild your modified code.

  2. Run another Roofline analysis to verify your application still runs correctly and all test cases pass, all loops are vectorizing properly, and performance is satisfactory.

For more complete information about compiler optimizations, see our Optimization Notice.