User Guide

Contents

Roofline Analysis

Roofline Analysis Purpose

Roofline analysis helps you
visualize actual performance against hardware-imposed performance ceilings, as well as determine the main limiting factor (memory bandwidth or compute capacity), thereby providing an ideal roadmap of potential optimization steps.
Use the
Roofline
chart to answer the following questions:
  • What is the maximum achievable performance with your current hardware resources?
  • Does your application work optimally on current hardware resources?
  • If not, what are the best candidates for optimization?
  • Is memory bandwidth or compute capacity limiting performance for each optimization candidate?
Intel® Advisor
includes the following Roofline models that you can use to analyze your application:
  • Basic Cache-Aware Roofline (default), which represents self data and cumulative traffic-based arithmetic intensity
  • Roofline with Callstacks, which represents total data and allow you to investigate the source of loops/functions
  • Memory-Level Roofline, which collects metrics for all memory levels and allows you to observe each loop/function at different cache levels

Basic Roofline Analysis

In the Vectorization Workflow tab, click the Intel Advisor control: Run analysis 
			 control under
Run Roofline
.
The
Intel Advisor
executes the target application twice to:
  • Measure the hardware limitations of your machine and collect loop/function timings using the Survey analysis.
  • Collect FLOP and integer operations data, and memory traffic data, using the Trip Counts and FLOP analysis - this collection can take three to four times longer than the Survey analysis.
After both analyses are complete, the
Intel Advisor
adds a
Roofline
chart to the
Survey Report
.
By default,
Intel Advisor
runs the Cache-Aware Roofline, which represents self data and cumulative traffic-based arithmetic intensity.
Roofline Chart Data
The
Roofline
chart plots an application's
achieved performance
and
arithmetic intensity
against the machine's
maximum achievable performance
:
  • Arithmetic intensity (x axis) - measured in number of floating-point operations (FLOPs) and/or integer operations (INTOPs) per byte, based on the loop/function algorithm, transferred between CPU/VPU and memory
  • Performance (y axis) - measured in billions of floating-point operations per second (GFLOPS) and/or billions of integer operations per second (GINTOPS)
Intel Advisor Cache-Aware Roofline Chart
In general:
  • The size and color of each
    Roofline
    chart dot represent relative execution time for each loop/function. Large red dots take the most time, so are the best candidates for optimization. Small green dots take less time, so may not be worth optimizing.
  • Roofline
    chart diagonal lines indicate memory bandwidth limitations preventing loops/functions from achieving better performance without some form of optimization. For example: The
    L1 Bandwidth
    roofline represents the maximum amount of work that can get done at a given arithmetic intensity if the loop
    always
    hits L1 cache. A loop does not benefit from L1 cache speed if a dataset causes it to miss L1 cache too often, and instead is subject to the limitations of the lower-speed L2 cache it
    is
    hitting. So a dot representing a loop that misses L1 cache too often but hits L2 cache is positioned somewhere below the
    L2 Bandwidth
    roofline.
  • Roofline
    chart horizontal lines indicate compute capacity limitations preventing loops/functions from achieving better performance without some form of optimization. For example: The
    Scalar Add Peak
    represents the peak number of add instructions that can be performed by the scalar loop under these circumstances. The
    Vector Add Peak
    represents the peak number of add instructions that can be performed by the vectorized loop under these circumstances. So a dot representing a loop that is not vectorized is positioned somewhere below the
    Scalar Add Peak
    roofline.
  • A dot cannot exceed the topmost rooflines, as these represent the maximum capabilities of the machine; however, not all loops can utilize maximum machine capabilities.
  • The greater the distance between a dot and the highest achievable roofline, the more opportunity exists for performance improvement.
In the following
Roofline
chart representation, loops A and G (large red dots), and to a lesser extent B (yellow dot far below the roofs), are the best candidates for optimization. Loops C, D, and E (small green dots) and H (yellow dot) are poor candidates because they do not have much room to improve or are too small to have significant impact on performance.
This is a visual model, not an actual screenshot, of the Roofline Chart
Roofline Chart Controls
There are several controls to help you show/hide the
Roofline
chart:
Intel Advisor: Roofline Chart & Survey Report
1
Click to toggle between
Roofline
chart view and
Survey Report
view.
2
Click to toggle to and from side-by-side
Roofline
chart and
Survey Report
view.
3
Drag to adjust the dimensions of the
Roofline
chart and
Survey Report
.
There are several controls to help you focus on the
Roofline
chart data most important to you, including the following.
Intel Advisor Cache-Aware Roofline Chart
1
  • Select Loops by Mouse Rect
    : Select one or more loops/functions by tracing a rectangle with your mouse.
  • Zoom by Mouse Rect
    : Zoom in and out by tracing a rectangle with your mouse. You can also zoom in and out using your mouse wheel.
  • Move View By Mouse
    : Move the chart left, right, up, and down.
  • Undo
    or
    Redo
    : Undo or redo the previous zoom action.
  • Cancel Zoom
    : Reset to the default zoom level.
  • Export as x
    : Export the chart as a dynamic and interactive HTML or SVG file that does not require the
    Intel Advisor
    viewer for display. Use the arrow to toggle between the options.
2
Use the
Cores
drop-down toolbar to:
  • Adjust rooflines to see practical performance limits for your code on the host machine.
  • Build roofs for single-threaded applications (or for multi-threaded applications configured to run single threaded, such as one thread-per-rank for MPI applications. (You can use Intel Advisor filters to control the loops displayed in the
    Roofline
    chart; however, the
    Roofline
    chart does not support the
    Threads
    filter.)
Choose the appropriate number of CPU cores to scale roof values up or down:
  • 1 – if your code is single-threaded
  • Number of cores equal or close to the number of threads – if your code has fewer threads than available CPU cores
  • Maximum number of cores – if your code has more threads than available CPU cores
By default, the number of cores is set to the number of threads used by the application (even values only).
You’ll see the following options if your code is running on a multisocket PC:
  • Choose
    Bind cores to 1 socket
    (default) if your application binds memory to one socket. For example, choose this option for MPI applications structured as one rank per socket.
    This option may be disabled if you choose a number of CPU cores exceeding the maximum number of cores available on one socket.
  • Choose
    Spread cores between all n sockets
    if your application binds memory to all sockets. For example, choose this option for non-MPI applications.
3
  • Toggle the display between floating-point (FLOP), integer (INT) operations, and mixed operations (floating-point and integer).
  • If you collected Roofline with Calltacks:
    Enable the display of Roofline with Callstacks additions to the
    Roofline
    chart.
4
Display
Roofline
chart data from other
Intel Advisor
results or non-archived snapshots for comparison purposes.
Use the drop-down toolbar to:
  • Load a result/snapshot and display the corresponding filename in the
    Compared Results
    region.
  • Clear a selected result/snapshot and move the corresponding filename to the
    Ready for comparison
    region.
    Note
    : Click a filename in the
    Ready for comparison
    region to reload the result/snapshot.
  • Save the comparison itself to a file.
    The arrowed lines showing the relationship among loops/functions do not reappear if you upload the comparison file.
Click a loop/function dot in the current result to show the relationship (arrowed lines) between it and the corresponding loop/function dots in loaded results/snapshots.
Intel Advisor: Roofline Comparison
5
Add visual indicators to the Roofline chart to make the interpretation of data easier, including performance limits and whether loops/functions are memory bound, compute bound, or both.
Use the drop-down toolbar to:
  • Show a vertical line from a loop/function to the nearest and topmost performance ceilings by enabling the
    Display roof rulers
    checkbox. To view the ruler, hover the cursor over a loop/function. Where the line intersects with each roof, labels display hardware performance limits for the loop/function.
  • If you collected Roofline for All Memory Levels
    : Visually emphasize the relationships among displayed memory levels and roofs and for a selected loop/function dot by enabling the
    Show memory level relationships
    checkbox.
  • Color the roofline zones to make it easier to see if enclosed loops/functions are fundamentally memory bound, compute bound, or bound by compute and memory roofs by enabling the
    Show Roofline boundaries
    checkbox.
The preview picture is updated as you select guidance options, allowing you to see how changes will affect the Roofline chart’s appearance. Click
Apply
to apply your changes, or
Default
to return the Roofline chart to its original appearance.
Once you have a loop/function's dots highlighted, you can zoom and fit the Roofline chart to the dots for the selected loop/function by once again double-clicking the loop/function or pressing
SPACE
or
ENTER
with the loop/function selected. Repeat this action to return to the original Roofline chart view.
To hide the labeled dots, select another loop/function, or double-click an empty space in the Roofline chart.
6
  • Roofline View Settings:
    Adjust the default scale setting to show:
    • The optimal scale for each
      Roofline
      chart view
    • A scale that accommodates all
      Roofline
      chart views
  • Roofs Settings:
    Change the visibility and appearance of roofline representations (lines):
    • Enable calculating roof values based on single-threaded benchmark results instead of multi-threaded.
    • Click a
      Visible
      checkbox to show/hide a roofline.
    • Click a
      Selected
      checkbox to change roofline appearance: display a roofline as a solid or a dashed line.
    • Manually fine-tune roof values in the
      Value
      column to set hardware limits specific to your code.
  • Loop Weight Representation
    : Change the appearance of loop/function weight representations (dots):
    • Point Weight Calculation
      : Change the
      Base Value
      for a loop/function weight calculation.
    • Point Weight Ranges
      : Change the
      Size
      ,
      Color
      , and weight
      Range (R)
      of a loop/function dot. Click the
      +
      button to split a loop weight range in two. Click the
      -
      button to merge a loop weight range with the range below.
    • Point Colorization
      : color loop/function dots by weight ranges or by type (vectorized or scalar). You can also change the color of loop with no self time.
You can save your Roofs Settings or Point Weight Representation configuration to a JSON file or load a custom configuration.
7
Zoom in and out using numerical values.
8
Click a loop/function dot to:
  • Outline it in black.
  • Display metrics for it.
  • Display corresponding data in other window tabs.
Right-click a loop/function dot or a blank area in the
Roofline
chart to perform more functions, such as:
  • Further simplify the
    Roofline
    chart by filtering out (temporarily hiding a dot), filtering in (temporarily hiding all other dots), and clearing filters (showing all originally displayed dots).
  • Copy data to the clipboard.
9
Show/hide the metrics pane:
  • Review the basic performance metrics in the
    Point Info
    pane.
  • If you collected the Roofline for All Memory Levels
    : Review how efficiently the loop/function uses cache and what memory level bounds the loop/function in the
    Memory Metrics
    pane.
10
Display the number and percentage of loops in each loop weight representation category.

Roofline with Callstacks

Intel Advisor
basic Roofline model, the Cache-Aware Roofline Model (CARM), offers
self data
capability.
Intel Advisor
Roofline with Callstacks feature extends the basic model with
total data
capability:
  • Self data = Memory access, FLOPs, and duration related only to the loop/function itself and excludes data originating in other loops/functions called by it
  • Total data = Data from the loop/function itself and its inner loops/functions
The total-data capability in the Roofline with Callstacks feature can help you:
  • Investigate the source of loops/functions instead of just the loops/functions themselves.
  • Get a more accurate view of loops/functions that behave differently when called under different circumstances.
  • Uncover design inefficiencies higher up the call chain that could be the root cause of poor performance by smaller loops/functions.
To run Roofline with Callstacks:
  1. Enable the
    With Callstacks
    checkbox in the
    Vectorization Workflow
    tab under
    Run Roofline
    .
  2. Run the Roofline analysis. Upon completion, the
    Intel Advisor
    displays a
    Roofline
    chart.
  3. Enable the
    With Callstacks
    checkbox in the
    Roofline
    chart.
Roofline with Callstacks Chart Data
The following
Roofline
chart representation shows some of the added benefits of the Roofline with Callstacks feature, including:
  • A navigable, color-coded
    Callstack
    pane that shows the entire call chain for the selected loop/function, but excludes its callees
  • Visual indicators (caller and callee arrows) that show the relationship among loops and functions
  • The ability to simplify dot-heavy charts by collapsing several small loops into one overall representation
    Loops/functions with no self data are grayed out when expanded and in color when collapsed. Loops/functions with self data display at the coordinates, size, and color appropriate to the data when expanded, but have a gray halo of the size associated with their total time. When such loops/functions are collapsed, they change to the size and color appropriate to their total time and, if applicable, move to reflect the total performance and total arithmetic intensity.
Intel Advisor: Roofline with Callstacks
Roofline with Callstacks Chart Controls
Intel Advisor Roofline with Callstacks
1
Enable the display of Roofline with Callstacks additions to the Roofline chart.
2
Show/hide loop/function descendants:
  • Click a loop/function dot Intel Advisor: Collapse control 
								control to collapse descendant dots into the parent dot.
  • Click a loop/function dot Intel Advisor: Expand control 
								control to show descendant dots and their relationship via visual indicators to the parent dot.
You can also right-click a loop/function dot to open the context menu and expand/collapse the loop/function subtree.
3
Show/hide the
Callstack
and other panes.
4
  • Click an item in the
    Callstack
    pane to flash the corresponding loop/function dot in the Roofline chart.
  • Right-click an item in the
    Callstack
    pane to open the context menu and expand/collapse the item subtree.
You can also click an item in the
Callstack
pane to flash the corresponding loop/function dot in the
Roofline
chart.

Memory-Level Roofline

Using the cache simulation, Intel Advisor evaluates the data transactions between the different memory layers available on your system and generate a Memory-level Roofline chart. You can choose which memory levels (L1, L2, L3, DRAM) to plot dots and examine this data for a selected loop/function in greater detail, displaying labeled dots with arithmetic intensity for the loop/function at each memory level.
To run Memory-Level Roofline:
Enable the For All Memory Levels check box
  1. Enable the
    For All Memory Levels
    checkbox in the
    Vectorization Workflow
    tab under
    Run Roofline
    .
  2. Run the Roofline analysis. Upon completion, the Intel Advisor displays a
    Roofline
    chart.
  3. In the Roofline chart, verify that
    Show memory level relationships
    checkbox is enabled in the
    Guidance
    drop-down menu.
  4. In the filter drop-down menu, select which memory levels to show dots for from the
    Memory Level
    section.
By default, the Memory-Level Roofline chart is generated for the system cache configuration. You can also generate the chart for a custom cache configuration:
  1. Go to
    Project Properties
    Trip Count and FLOP
    .
  2. In the
    Cache simulator configuration
    field, click
    Modify
    .
  3. Click
    Add
    and enter/select the desired cache configurations.
  4. Run the Roofline for all memory levels.
Memory-Level Roofline Data
Memory-Level Roofline model allows you to observe each loop/function at different cache level and compare arithmetic intensities to understand where performance decreases. The roofs represent the best possible bandwidths for each memory level.
Review the changes in the traffic from one memory level to another and compare it to respective to identify the memory hierarchy bottleneck for the kernel and determine optimization steps based on this information.
  • The vertical distance between memory dots and their respective roofline shows how much you are limited by a given memory subsystem. If a dot is close to its roof line, it means that the kernel is limited by the performance of this memory level.
  • The horizontal distance between each dot indicates how efficiently the loop/function uses cache. For example, if L3 and DRAM dots are very close on the horizontal axis for a single loop, the loop/function uses L3 and DRAM similarly. This mean that it does not use L3 and DRAM efficiently. Improve re-usage of data in the code to improve application performance
  • Arithmetic intensity determines the order in which dots are plotted, which can provide some insight into your code's performance. For example, the L1 dot should be the largest and first plotted dot on the chart from left to right. However, memory access type, latency, or technical issues can change the order of the dots. Continue to run the Memory Access Pattern analysis to investigate this issue.
Memory-Level Roofline Chart Controls
Intel Advisor Memory-Level Roofline Chart
1
Visually emphasize the relationships among displayed memory levels and roofs for a selected loop/function dot by enabling the
Show memory level relationships
checkbox.
This checkbox is enabled by default.
2
Use the drop-down toolbar to:
  • Select the
    Memory Level
    (s) to show for each loop/function in the chart (L1, L2, L3, DRAM).
  • Select which
    Memory Operation Types
    (s) to display data for in the Roofline chart:
    Loads
    ,
    Stores
    , or
    Loads and Stores
    .
3
Double-click a dot or select a dot and press
SPACE
or
ENTER
to examine how the relationships between displayed memory levels and roofs:
  • Labeled dots are displayed, representing memory levels for the selected loop/function. Lines connect the dots to indicate that they correspond to the selected loop/function.
    If you have chosen to display only some memory levels in the chart using the
    Memory Level
    option, unselected memory levels are displayed with X marks.
  • An arrowed line is displayed, pointing to the memory level roofline that bounds the selected loop. If the arrowed line cannot be displayed, a message will pop up with instructions on how to fix it.
4
Show/hide the
Memory Metrics
and other panes.
In the
Memory Metrics
pane:
  • Review the time spent processing requests for each memory level reported in the
    Impacts
    histogram. A big value indicates a memory level that bounds the selected loop.
  • Review an amount of data that passes through each memory level reported in the
    Shares
    histogram.

What Do I Do Next?

See the Intel® Advisor Cookbook recipes to learn how to use the Roofline for specific use cases:
  • Address memory bandwidth bottlenecks.
  • Address compute capacity bottlenecks.
  • Identify the real bottlenecks.

See Also

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804