Roofline Analysis Purpose
- What is the maximum achievable performance with your current hardware resources?
- Does your application work optimally on current hardware resources?
- If not, what are the best candidates for optimization?
- Is memory bandwidth or compute capacity limiting performance for each optimization candidate?
- Basic Cache-Aware Roofline (default), which represents self data and cumulative traffic-based arithmetic intensity
- Roofline with Callstacks, which represents total data and allow you to investigate the source of loops/functions
- Memory-Level Roofline, which collects metrics for all memory levels and allows you to observe each loop/function at different cache levels
Basic Roofline Analysis
- Measure the hardware limitations of your machine and collect loop/function timings using the Survey analysis.
- Collect FLOP and integer operations data, and memory traffic data, using the Trip Counts and FLOP analysis - this collection can take three to four times longer than the Survey analysis.
- Arithmetic intensity (x axis) - measured in number of floating-point operations (FLOPs) and/or integer operations (INTOPs) per byte, based on the loop/function algorithm, transferred between CPU/VPU and memory
- Performance (y axis) - measured in billions of floating-point operations per second (GFLOPS) and/or billions of integer operations per second (GINTOPS)
- The size and color of eachRooflinechart dot represent relative execution time for each loop/function. Large red dots take the most time, so are the best candidates for optimization. Small green dots take less time, so may not be worth optimizing.
- Rooflinechart diagonal lines indicate memory bandwidth limitations preventing loops/functions from achieving better performance without some form of optimization. For example: TheL1 Bandwidthroofline represents the maximum amount of work that can get done at a given arithmetic intensity if the loopalwayshits L1 cache. A loop does not benefit from L1 cache speed if a dataset causes it to miss L1 cache too often, and instead is subject to the limitations of the lower-speed L2 cache itishitting. So a dot representing a loop that misses L1 cache too often but hits L2 cache is positioned somewhere below theL2 Bandwidthroofline.
- Rooflinechart horizontal lines indicate compute capacity limitations preventing loops/functions from achieving better performance without some form of optimization. For example: TheScalar Add Peakrepresents the peak number of add instructions that can be performed by the scalar loop under these circumstances. TheVector Add Peakrepresents the peak number of add instructions that can be performed by the vectorized loop under these circumstances. So a dot representing a loop that is not vectorized is positioned somewhere below theScalar Add Peakroofline.
- A dot cannot exceed the topmost rooflines, as these represent the maximum capabilities of the machine; however, not all loops can utilize maximum machine capabilities.
- The greater the distance between a dot and the highest achievable roofline, the more opportunity exists for performance improvement.
Click to toggle between
Rooflinechart view and
Click to toggle to and from side-by-side
Drag to adjust the dimensions of the
Coresdrop-down toolbar to:
Choose the appropriate number of CPU cores to scale roof values up or down:
By default, the number of cores is set to the number of threads used by the application (even values only).
You’ll see the following options if your code is running on a multisocket PC:
Rooflinechart data from other
Intel Advisorresults or non-archived snapshots for comparison purposes.
Use the drop-down toolbar to:
Click a loop/function dot in the current result to show the relationship (arrowed lines) between it and the corresponding loop/function dots in loaded results/snapshots.
Add visual indicators to the Roofline chart to make the interpretation of data easier, including performance limits and whether loops/functions are memory bound, compute bound, or both.
Use the drop-down toolbar to:
The preview picture is updated as you select guidance options, allowing you to see how changes will affect the Roofline chart’s appearance. Click
Applyto apply your changes, or
Defaultto return the Roofline chart to its original appearance.
Once you have a loop/function's dots highlighted, you can zoom and fit the Roofline chart to the dots for the selected loop/function by once again double-clicking the loop/function or pressing
ENTERwith the loop/function selected. Repeat this action to return to the original Roofline chart view.
To hide the labeled dots, select another loop/function, or double-click an empty space in the Roofline chart.
You can save your Roofs Settings or Point Weight Representation configuration to a JSON file or load a custom configuration.
Zoom in and out using numerical values.
Click a loop/function dot to:
Right-click a loop/function dot or a blank area in the
Rooflinechart to perform more functions, such as:
Show/hide the metrics pane:
Display the number and percentage of loops in each loop weight representation category.
Roofline with Callstacks
- Self data = Memory access, FLOPs, and duration related only to the loop/function itself and excludes data originating in other loops/functions called by it
- Total data = Data from the loop/function itself and its inner loops/functions
- Investigate the source of loops/functions instead of just the loops/functions themselves.
- Get a more accurate view of loops/functions that behave differently when called under different circumstances.
- Uncover design inefficiencies higher up the call chain that could be the root cause of poor performance by smaller loops/functions.
- Enable theWith Callstackscheckbox in theVectorization Workflowtab underRun Roofline.
- Run the Roofline analysis. Upon completion, theIntel Advisordisplays aRooflinechart.
- Enable theWith Callstackscheckbox in theRooflinechart.
- A navigable, color-codedCallstackpane that shows the entire call chain for the selected loop/function, but excludes its callees
- Visual indicators (caller and callee arrows) that show the relationship among loops and functions
- The ability to simplify dot-heavy charts by collapsing several small loops into one overall representationLoops/functions with no self data are grayed out when expanded and in color when collapsed. Loops/functions with self data display at the coordinates, size, and color appropriate to the data when expanded, but have a gray halo of the size associated with their total time. When such loops/functions are collapsed, they change to the size and color appropriate to their total time and, if applicable, move to reflect the total performance and total arithmetic intensity.
Enable the display of Roofline with Callstacks additions to the Roofline chart.
Show/hide loop/function descendants:
You can also right-click a loop/function dot to open the context menu and expand/collapse the loop/function subtree.
Callstackand other panes.
You can also click an item in the
Callstackpane to flash the corresponding loop/function dot in the
- Enable theFor All Memory Levelscheckbox in theVectorization Workflowtab underRun Roofline.
- Run the Roofline analysis. Upon completion, the Intel Advisor displays aRooflinechart.
- In the Roofline chart, verify thatShow memory level relationshipscheckbox is enabled in theGuidancedrop-down menu.
- In the filter drop-down menu, select which memory levels to show dots for from theMemory Levelsection.
- Go to.
- In theCache simulator configurationfield, clickModify.
- ClickAddand enter/select the desired cache configurations.
- Run the Roofline for all memory levels.
- The vertical distance between memory dots and their respective roofline shows how much you are limited by a given memory subsystem. If a dot is close to its roof line, it means that the kernel is limited by the performance of this memory level.
- The horizontal distance between each dot indicates how efficiently the loop/function uses cache. For example, if L3 and DRAM dots are very close on the horizontal axis for a single loop, the loop/function uses L3 and DRAM similarly. This mean that it does not use L3 and DRAM efficiently. Improve re-usage of data in the code to improve application performance
- Arithmetic intensity determines the order in which dots are plotted, which can provide some insight into your code's performance. For example, the L1 dot should be the largest and first plotted dot on the chart from left to right. However, memory access type, latency, or technical issues can change the order of the dots. Continue to run the Memory Access Pattern analysis to investigate this issue.
Visually emphasize the relationships among displayed memory levels and roofs for a selected loop/function dot by enabling the
Show memory level relationshipscheckbox.
This checkbox is enabled by default.
Use the drop-down toolbar to:
Double-click a dot or select a dot and press
ENTERto examine how the relationships between displayed memory levels and roofs:
Memory Metricsand other panes.
What Do I Do Next?
- Address memory bandwidth bottlenecks.
- Address compute capacity bottlenecks.
- Identify the real bottlenecks.