Quick Analysis of Vectorization Using Intel® Advisor 2019

In this article, we discuss how to use the command-line interface in Intel® Advisor 2019 for a quick, initial analysis of loop performance that gives an overview of the hotspots in the code, as well as a roofline analysis chart, which helps visualize performance and optimize code for your current hardware. This initial analysis can then be followed by more in-depth analysis using the graphical user interface (GUI) in Intel Advisor 2019.

Introduction

Intel has developed several software products aimed at increasing productivity of software developers and helping them to make the best use of Intel® processors. One of these tools is Intel® Parallel Studio XE, which contains a set of compilers and analysis tools that let the user write, analyze, and optimize their application on Intel® hardware.

In this article we explore Intel Advisor 2019, which is one of the analysis tools in the Intel Parallel Studio XE suite that lets us analyze our application and gives us advice on how to improve vectorization in our code.

How to use Intel® Advisor

The most effective way to use Intel Advisor is via the GUI. This interface gives us access to all the information and recommendations that Intel Advisor collects from our code. Detailed information, documentation, training materials, and code samples, as well as product support and access to the Intel Advisor community forum can be found at Intel Advisor Support.

Intel Advisor also offers a command-line interface (CLI) that lets the user work on remote hosts, clusters, and/or generate information in a way that is easy to automate analysis tasks; for example, using scripts.

When working on Intel® Xeon® Scalable processors using the Linux* OS, we might need to use a combination of the Intel Advisor GUI and CLI for our specific analysis workflow. In some cases the CLI will be a good starting point for a quick view of a performance summary as well as in the initial phases of our workflow analysis. Detailed information can be found at the Intel® Advisor User Guide Command Line Interface Reference.

A comprehensive guide to the tool can be found at Intel® Advisor User Guide.

In the next sections, a procedure for a quick, initial performance analysis on Linux using the Intel Advisor CLI is described. This analysis gives us an idea of the performance bottlenecks in our application and where to focus initial optimization efforts. Also, for testing purposes, this procedure allows the user to automate testing and results reporting. This procedure might also be useful, in some cases, in the Windows* OS.

This analysis is intended as an initial step and provides access to only limited information. The full extent of the information and help offered by Intel Advisor is available using a combination of the Intel Advisor GUI and CLI.

Using Intel Advisor: Running a Quick Survey Analysis

To illustrate this procedure, we will show how to use the Intel Advisor Roofline chart using the Intel Advisor standalone GUI, using the code sample from the tutorial, Intel® Advisor Tutorial: Use Automated Roofline Chart to Make Optimization Decisions. Details of the source code are discussed in that tutorial. Intel Advisor can be downloaded from the Intel Advisor web site.

The first step for a quick analysis is to create an optimized executable that runs on the Intel® Xeon® processor. For this, we start by compiling our application with a set of options that direct the compiler to create this executable in a way from which Intel Advisor will be able to extract information.

The code sample includes a Makefile file that can be used to build the executable. In this file, the line where the variable CXXFLAGS is defined can be changed to specify the Intel® C++ Compiler options for the desired purpose.

NOTE: Although Intel Advisor works with any compiler, it is particularly effective when using the information from the reports generated by Intel compilers.

For example, to enable the use of all the subsets of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) that are supported by the Intel Xeon Scalable processors, using the Intel C++ Compiler, we can define the variable as follows:

CXXFLAGS=-c -O3 -xCORE-AVX512 -g -qopt-report=5 $(INCLUDES)

where the flag –g is to generate debugging information and symbols that will be used by Intel Advisor to generate a complete analysis. The -qopt-report=5 option produces a compiler report with information about vectorizations and other optimizations. The –O3 option is also used because the executable must be optimized. We can use either the –O2 or the -O3 options for this purpose.

Next, we can run the CLI version of the Intel Advisor tool. The survey analysis is a good starting point because it provides information that lets us identify how our code is using vectorization and where the hotspots for analysis are.

advixe-cl -collect survey -project-dir ./AdvProj_RooflineExample -search-dir all:=./src -- ./release/roofline_demo

The above command runs the Intel Advisor tool and creates a project directory, AdvProj_ RooflineExample, in the current directory. Inside this directory, Intel Advisor creates, among other things, a directory named e000 containing the results of the analysis. If we list the contents of the results directory, we see the following:

$ ls -l AdvProj_RooflineExample/e000/
callstacks.def  e000.advixeexp  hs000  loop_hashes.def
$

The directory hs000 contains results from the survey analysis just created.

The next step is to view the results of the survey analysis performed by the Intel Advisor tool. Here we use the CLI to generate the report. To do this, we replace the -collect option with the -report one, making sure we refer to the project directory as where the data has been collected. We can use the following command to generate a survey report from the survey data that is contained in the results directory in our project directory:

$ advixe-cl -report survey -project-dir ./AdvProj_RooflineExample -format=text -report-output=./REPORTS/report_survey.txt

The above command creates a report named "report_survey.txt" in the subdirectory "REPORTS". This report is in a column format and contains several columns, so it might be difficult to read on a console. One option for a quick read is to limit the number of columns to be displayed using the –filter option (only the survey report is supported in the current version of Intel Advisor).

Another option is to create an XML-formatted report. We can do this if we change the value for the -format option from text to xml:

$ advixe-cl -report survey -project-dir ./AdvProj_RooflineExample -format=xml -report-output=./REPORTS/report_survey.xml

The XML-formatted report might be easier to read on a small screen, or by a script, because the information in the columns in the report file is condensed into one column. Here is a fragment of it:

(…)
  <function_call_site_or_loop ID="21"
            Function_Call_Sites_and_Loops="[loop in main at roofline.cpp:295]"
                              Self_Time="12.628s"
                              Total_Time="12.628s"
                              Type="Vectorized (Body)"
                              Why_No_Vectorization=""
                              Vector_ISA="AVX512"
                              Compiler_Estimated_Gain="7.88x"
                              Average_Trip_Count=""
                              Min_Trip_Count=""
                              Max_Trip_Count=""
                              Call_Count=""
                              Transformations=""
                              Source_Location="roofline.cpp:295"
                              Module="roofline_demo">
  </function_call_site_or_loop>
  <function_call_site_or_loop ID="11"
            Function_Call_Sites_and_Loops="[loop in main at roofline.cpp:221]"
                              Self_Time="10.052s"
                              Total_Time="10.052s"
                              Type="Scalar"
                              Why_No_Vectorization="novector directive used"
                              Vector_ISA=""
                              Compiler_Estimated_Gain=""
                              Average_Trip_Count=""
                              Min_Trip_Count=""
                              Max_Trip_Count=""
                              Call_Count=""
                              Transformations="Unrolled"
                              Source_Location="roofline.cpp:221"
                              Module="roofline_demo">
  </function_call_site_or_loop>
(…)

Recall that the survey option in the Intel Advisor tool generates a performance overview of the loops in the application. The example above shows that the loop starting on line 295 in the source code has been vectorized using Intel AVX-512 ISA. It also shows an estimate of the improvement of the loop’s performance (compared to a scalar version) and timing information. The second block in the example above gives the performance overview for the loop at line 221 in the source code. It shows that this loop was not vectorized (it runs in scalar mode) because of a #pragma novector directive used in the source code.

Also notice that the different loops have been assigned a loop ID, which is the way the Intel Advisor tool labels the loops, in order to keep track of them in future analysis. For example, after looking at the performance overview shown above, we might want to generate more detailed information about a specific loop by including the loop ID in the command line. An example of this procedure is given later in this tutorial.

The above is a quick way to run and visualize a vectorization analysis in the Intel Xeon processor. This procedure lets us visualize the basic vectorization information from our codes with minimum effort. It also lets us create summaries of progressive optimization steps in the form of tables or plots (if we have run several of these analyses at different stages of the optimization process). However, if we need to access more advanced information from our analysis, like traits or the assembly code, we can use the Intel Advisor GUI, possibly from a different computer (either by copying the project folder to another computer or by accessing it over the network), and access the complete information that Intel Advisor offers.

For example, Figures 1 and 2 show what the Intel Advisor GUI looks like for the survey analysis shown above. We can see that, besides the information contained in the CLI report, the Intel Advisor GUI offers other information like traits, and source, and assembly code.

Intel® Advisor GUI

Figure 1. Fragment of Intel® Advisor GUI showing the information about the topmost time-consuming loops.

Intel® Advisor GUI

Figure 2. Fragment of the Intel® Advisor GUI showing the source and assembly code for the most time-consuming loop (in line 295).

Collecting More Detailed Information

Once we look at the performance summary reported by the Intel Advisor tool using the Survey option, we can use other options to add more specific information to the reports. One option is to run the tripcounts analysis to get information about the number of times loops are executed.

To add this information to our project, we can use the Intel Advisor tool to run a tripcounts analysis on the same project we used for the survey analysis:

$ advixe-cl -collect tripcounts -project-dir ./AdvProj_RooflineExample -search-dir all:=./src -- ./release/roofline_demo

And similarly, to generate a tripcounts report:

$ advixe-cl -report tripcounts -project-dir ./AdvProj_RooflineExample -format=xml -report-output=./REPORTS/report_survey.xml

Now the XML-formatted report contains information about the number of times the loops have been executed. Specifically, the Trip_Counts fields in the xml report will be populated, while the information from the survey report will be preserved. Next is a fragment of the enhanced report (only the first, most time-consuming loop is shown):

(…)
  <function_call_site_or_loop ID="21"
            Function_Call_Sites_and_Loops="[loop in main at roofline.cpp:295]"
                              Self_Time="12.628s"
                              Total_Time="12.628s"
                              Type="Vectorized (Body)"
                              Why_No_Vectorization=""
                              Vector_ISA="AVX512"
                              Compiler_Estimated_Gain="7.88x"
                              Average_Trip_Count="250"
                              Min_Trip_Count="250"
                              Max_Trip_Count="250"
                              Call_Count="30000000"
                              Transformations=""
                              Source_Location="roofline.cpp:295"
                              Module="roofline_demo">
  </function_call_site_or_loop>

Lastly, if we want to generate a roofline report, we need to add the –flop option to the analysis. This option adds the floating point and integer operations data to the tripcounts analysis:

advixe-cl -collect tripcounts -flop -project-dir ./AdvProj_RooflineExample -search-dir all:=./src -- ./release/roofline_demo

advixe-cl -collect tripcounts -flop -project-dir ./AdvProj_RooflineExample -search-dir all:=./src -- ./release/roofline_demo

And we can create an HTML formatted roofline chart:

advixe-cl -report roofline -project-dir ./AdvProj_RooflineExample  -report-output=./REPORTS/report_roofline.html

Intel Advisor 2019 (and 2018) lets us collect roofline data with a single command (replacing the three steps shown above). To create a project containing the survey, tripcounts, and flop information, it is sufficient to type the following command:

advixe-cl -collect roofline -project-dir ./AdvProj_RooflineExample2 -search-dir all:=./src -- ./release/roofline_demo

followed by the command to create the roofline chart:

advixe-cl -report roofline -project-dir ./AdvProj_RooflineExample2  -report-output=./REPORTS/report_roofline.html

The HTML formatted roofline chart can be copied to other machines and opened with a web browser. This makes it easier to do a quick analysis without a GUI, as well as to share the roofline results with colleagues. Figure 3 shows a snapshot of the HTML formatted chart displayed in a web browser.

roofline chart

Figure 3. Roofline chart displayed in a web browser.

In Figure 3 (the upper-left section), notice that it is possible to choose the number of cores used for roof modeling. As this sample code is a single-threaded application, one core is selected. For multithreaded applications, the number of cores used for roof modeling can be changed to adequately represent peak performance on the host machine.

On top of the survey, tripcounts, and roofline reports, we can generate other types of reports that will give us other useful information about our loops. The –help collect and –help report options in the command-line Intel Advisor tool shows what types of collections and reports are available:

$ advixe-cl -help collect
(…)

For example, to obtain memory access pattern details in our source code, we can run a memory access patterns (MAP) analysis and report using the map option, and use the flag --mark-up-list to focus the analysis only on the most time-consuming loop (which is labeled as “21” in the previous XML-formatted reports):

$ advixe-cl -collect map -project-dir ./AdvProj_RooflineExample --mark-up-list=21 -search-dir all:=./src -- ./release/roofline_demo
$ advixe-cl -report map –format=’xml’ -project-dir ./AdvProj_RooflineExample -search-dir all:=./src -- ./release/roofline_demo

In all the above cases, the project directory (in this example, AdvProj_RooflineExample) contains all the information necessary to perform a full analysis using the GUI. When we are ready to use the GUI, we can copy the project directory to a workstation/laptop (or access it over the file system) and run the GUI-based Intel Advisor from there.

An example of a situation where we should consider using the GUI might be if we have results from different analyses (performed on the same machine) and we want to consolidate and compare them on a single roofline chart. Specifically, let us generate results from running the sample code twice, first using an executable created using the Intel C++ Compiler option -qopt-zmm-usage=low, which tells the compiler that the compiled program is unlikely to benefit from zmm registers usage, and store the results in the directory AdvProj_RooflineExample. Then, we generate a second set of results using the option -qopt-zmm-usage=high, which tells the compiler to generate code using zmm registers, and store the results in directory AdvProj_RooflineExample2. We can run the Intel Advisor GUI to compare both results in the same roofline chart, as shown in Figure 4.

Notice that in the figure, the results that are faded out (those not selected in the drop-down menu) correspond to the results from using the low option, while those with bright colors (selected in the drop-down menu) correspond to the results from using the high option. There is no difference in the loops that were run in scalar mode. Only the vectorized loops show the actual performance gains when using zmm registers. More information about the -qopt-zmm-usage option can be obtained from the Intel C++ Compiler documentation.

One more feature that is new in Intel Advisor 2019 is support for integer operations in the roofline charts. If the integer operations are selected (see menu in upper-left section in Figure 4), peak integer performance roofs will be added to the chart.

To get more information about the new features described above and more, please go to the Intel Advisor web site. The What’s New section lists the latest features available both in the CLI and the GUI interfaces.

roofline chart

Figure 4. Using Intel® Advisor's GUI to compare several results in the same roofline chart.

Conclusion

Intel Advisor lets us identify issues in our code that might be preventing us from using our CPU resources efficiently. New features in Intel Advisor 2019 include, among others, new functionality and formats for roofline charts. These charts are useful for applications that are bound by hardware limits (L1/L2/DRAM, for example). Intel Advisor functionality can be accessed from both a GUI and a CLI.

This article described a way to explore vectorization performance using the Intel Advisor CLI to achieve a quick and preliminary analysis and generate reports and roofline charts in Intel Xeon Scalable processors using a text window, with the idea of later obtaining more information about our codes by using the Intel Advisor GUI interface.

This procedure is also useful for consolidating performance information after several iterations of source code optimization. A Unix* script (or similar) can be used to collect information from different reports and quickly consolidate it into tables or plots.

For more complete information about compiler optimizations, see our Optimization Notice.