Quick Analysis of Vectorization Using Intel® Advisor

In this article we continue our exploration of vectorization on an Intel® Xeon Phi™ processor using examples of loops that we used in Improve Vectorization Performance with Intel® Advanced Vector Extensions 512 (AVX512). We will discuss how to use the command-line interface in Intel® Advisor 2017 for a quick, initial analysis of loop performance that gives an overview of the hotspots in the code. This initial analysis can be then followed by more in-depth analysis using the graphical user interface (GUI) in Intel Advisor 2017.

Introduction

Intel has developed several software products aimed at increasing productivity of software developers and helping them to make the best use of Intel® processors. One of these tools is Intel® Parallel Studio XE, which contains a set of compilers and analysis tools that let the user write, analyze and optimize their application on Intel hardware.

In this article, we explore Intel® Advisor 2017, which is one of the analysis tools in the Intel Parallel Studio XE suite that lets us analyze our application and gives us advice on how to improve vectorization in our code.

How does Intel® Advisor help with Vectorization?

Vector-level parallelism allows the software to use special hardware like vector registers and SIMD (Single Instruction Multiple Data) instructions. New Intel® processors, like the Intel® Xeon Phi™ processor features 512-bit wide vector registers which, in conjunction with the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) ISA, allows the use of two vector processing units in each individual core, each of them capable of processing 16 single-precision (32-bit) or 8 double-precision (64-bit) floating point numbers.

To further realize the full performance of modern processors, code must be also threaded to take advantage of multiple cores. The multiplicative effect of vectorization and threading will accelerate code more than the effect of only vectorization or threading.

Intel Advisor analyzes our application and reports not only the extent of vectorization but also possible ways to achieve more vectorization and increase the effectiveness of the current vectorization.

Although Intel Advisor works with any compiler, it is particularly effective when applications are compiled using Intel compilers, because Intel Advisor will use the information from the reports generated by Intel compilers.

How to use Intel® Advisor

The most effective way to use Intel Advisor is via the GUI. This interface gives us access to all the information and recommendations that Intel Advisor collects from our code. Detailed information can be found in https://software.intel.com/en-us/intel-advisor-xe-support, where documentation, training materials, and code samples can be found. Product support and access to the Intel Advisor community forum can be also found in that link.

Intel Advisor also offers a command-line interface (CLI) that lets the user work on remote hosts, and/or generate information in a way that is easy to automate analysis tasks, for example using scripts.

When working on Intel Xeon Phi processor, which is based on the Linux* OS, we might need to use a combination of Advisor’s GUI and CLI for our specific analysis workflow, and in some cases the CLI will be a good starting point for a quick view of a performance summary, as well as in the initial phases of our workflow analysis. Detailed information about the Intel Advisor CLI for Linux can be found at https://software.intel.com/en-us/node/634769.

In the next sections, a procedure for a quick initial performance analysis on Linux using the Intel Advisor CLI will be described. This quick analysis will give us an idea of the performance bottlenecks in our application and where to focus initial optimization efforts. Also, for testing purposes, this procedure will also allow the user to automate testing and results reporting.

This analysis is intended as an initial step and will provide access to only limited information. The full extent of the information and help offered by Intel Advisor is available using a combination of the Intel Advisor GUI and CLI.

Using Intel Advisor on an Intel® Xeon Phi™ processor 

Running a quick survey analysis

To illustrate this procedure, I will use the code sample from a previous article, Improve Vectorization Performance with Intel® Advanced Vector Extensions 512 (AVX512) that shows vectorization improvements when using the Intel AVX-512 ISA. Details of the source code are discussed in that article. The sample code can be downloaded from here.

This example will be run in the following hardware:

Processor: Intel Xeon Phi processor, model 7250 (1.40 GHz)
Number of cores: 68
Number of threads: 272

The first step for a quick analysis is to create an optimized executable that will run on the Intel Xeon Phi processor. For this, we start by compiling our application with a set of options that will direct the compiler to create this executable in a way that Intel Advisor will be able to extract information from. The options that must be used are –xMIC-AVX512, which enables the use of all the subsets of Intel Advanced Vector Extensions 512 that are supported by the Intel® Xeon Phi™ processor (Zhang, 2016), and –g to generate debugging information and symbols. The –O3 option is also used because the executable must be optimized. We can use either the –O2 or the -O3 options for this purpose.

$ icpc Histogram_Example.cpp -g -O3 -restrict -xMIC-AVX512 -o run512 -lopencv_highgui -lopencv_core -lopencv_imgproc

Notice that we have also used the –restrict option, which informs the compiler that the pointers used in this application are not aliased. Also notice that we are linking the application with the OpenCV* library (www.opencv.org), which we use in this application to read an image from disk. A Makefile file is included if you download the sample code. This Makefile file can be used to generate an executable for Intel Advisor.

Next, we can run the CLI version of the Intel Advisor tool. The survey analysis is a good starting point for analysis, because it provides information that will let us identify how our code is using vectorization and where the hotspots for analysis are.

$ advixe-cl -collect survey -project-dir ./AdvProj-Example-AVX512 -search-dir all:=./src -- ./run512 image01.jpg

The above command runs the Intel Advisor tool and creates a project directory AdvProj-Example-AVX512. Inside this directory, Intel Advisor creates, among other things, a directory named e000 containing the results of the analysis. If we list the contents of the results directory, we see the following:

$ ls AdvProj-Example-AVX512/e000/
e000.advixeexp  hs000  loop_hashes.def
$

The directory hs000 contains results from the survey analysis just created.

The next step is to view the results of the survey analysis performed by the Intel Advisor tool. Here we will use the CLI to generate the report. To do this, we replace the -collect option with the -report one, making sure we refer to the same project directory where the data has been collected. We can use the following command to generate a survey report from the survey data that is contained in the results directory in our project directory:

$ advixe-cl -report survey -project-dir ./AdvProj-Example-AVX512 -format=text -report-output=./REPORTS/survey-AVX512.txt

The above command will create a report named survey-AVX512.txt in the subdirectory REPORTS. This report is in a column format and contains several columns, so it can be a little difficult to read on a console. One option for a quick read is to limit the number of columns to be displayed using the –filter option (only the survey report is supported in the current version of Intel Advisor).

Another option is to create an xml-formatted report. We can do this if we change the value for the -format option from text to xml:

$ advixe-cl -report survey -project-dir ./AdvProj-Example-AVX512 -format=xml -report-output=./REPORTS/survey-AVX512.xml

The xml-formatted report might be easier to read on a small screen, because the information in the columns in the report file is condensed into one column. Here is a fragment of it:

(…)
</function_call_site_or_loop>
  <function_call_site_or_loop ID="4"
   Function_Call_Sites_and_Loops="[child]-[loop in main at Histogram_Example.cpp:107]"
                              Self_Time="0.060s"
                              Total_Time="0.120s"
                              Type="Vectorized (Body)"
                              Why_No_Vectorization=""
                              Vector_ISA="AVX512"
                              Compiler_Estimated_Gain="3.37x"
                              Trip_Counts_Average=""
                              Trip_Counts_Min=""
                              Trip_Counts_Max=""
                              Trip_Counts_Call_Count=""
                              Transformations=""
                              Source_Location="Histogram_Example.cpp:107"
                              Module="run512">
  (…)
  </function_call_site_or_loop>
  <function_call_site_or_loop ID="8" name="[loop in main at Histogram_Example.cpp:87]"
                              Self_Time="0.030s"
                              Total_Time="0.030s"
                              Type="Vectorized (Body; [Remainder])"
                              Why_No_Vectorization="1 vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override "
                              Vector_ISA="AVX512"
                              Compiler_Estimated_Gain="20.53x"
                              Trip_Counts_Average=""
                              Trip_Counts_Min=""
                              Trip_Counts_Max=""
                              Trip_Counts_Call_Count=""
                              Transformations=""
                              Source_Location="Histogram_Example.cpp:87"
                              Module="run512">
  </function_call_site_or_loop>
  <function_call_site_or_loop ID="1"
   Function_Call_Sites_and_Loops="[child]-[loop in main at Histogram_Example.cpp:87]"
                              Self_Time="0.030s"
                              Total_Time="0.030s"
                              Type="Vectorized (Body)"
                              Why_No_Vectorization=""
                              Vector_ISA="AVX512"
                              Compiler_Estimated_Gain="20.53x"
                              Trip_Counts_Average=""
                              Trip_Counts_Min=""
                              Trip_Counts_Max=""
                              Trip_Counts_Call_Count=""
                              Transformations=""
                              Source_Location="Histogram_Example.cpp:87"
                              Module="run512">

Recall that the survey option in the Intel Advisor tool will generate a performance overview of the loops in the application. For example, the example shown above shows that the loop starting on line 107 in the source code has been vectorized using Intel AVX-512 ISA. It also shows an estimate of the improvement of the loop’s performance (compared to a scalar version) and timing information. The second and third blocks in the example above give performance overview for the loop at line 87 in the source code. It shows that the body of the loop has been vectorized, but the reminder of the loop has not.

Also notice that the different loops have been assigned a loop ID, which is the way the Intel Advisor tool labels the loops in order to keep track of them in future analysis (for example, after looking at the performance overview shown above, we might want to generate more detailed information about a specific loop by including the loop ID in the command line).

The above is a quick way to run and visualize a vectorization analysis in the Intel Xeon Phi processor. This procedure will let us quickly visualize the basic vectorization information from our codes with minimum effort. It will also let us create quick summaries of progressive optimization steps in the form of tables or plots (if we have run several of these analysis at different stages of the optimization process). However, if we need to access more advanced information from our analysis, like traits or the assembly code, we can use the Intel Advisor GUI possibly from a different computer (either by copying the project folder to another computer or by accessing it over the network) and access the complete information that Intel Advisor offers.

For example, the next figure shows what the Intel Advisor GUI looks like for the survey analysis shown above. We can see that, besides the information contained in the CLI report, The Intel Advisor GUI offers other information, like traits and source and assembly code.

Intel Advisor GUI

Collecting more detailed information

Once we have looked at the performance summary reported by the Intel Advisor tool using the Survey option, we can use other options to add more specific information to the reports. One option is to run the Tripcounts analysis to get information about the number of times loops are executed.

To add this information to our project, we can use the Intel Advisor tool to run a tripcounts analysis on the same project we used for the survey analysis:

$ advixe-cl -collect tripcounts -project-dir ./AdvProj-Example-AVX512 -search-dir all:=./src -- ./run512 image01.jpg

And similarly to generate a tripcounts report:

$ advixe-cl -report tripcounts -project-dir ./AdvProj-Example-AVX512 -format=xml -report-output=./REPORTS/tripcounts-AVX512.xml

Now the xml-formatted report will contain information about the number of times the loops have been executed. Specifically, the Trip_Counts fields in the xml report will be populated, while the information from the survey report will be preserved. Next is a fragment of the enhanced report (only the first, most time-consuming loop is shown):

(…)
  </function_call_site_or_loop>
  <function_call_site_or_loop ID="4"
   Function_Call_Sites_and_Loops="[child]-[loop in main at Histogram_Example.cpp:107]"
                              Self_Time="0.070s"
                              Total_Time="0.120s"
                              Type="Vectorized (Body)"
                              Why_No_Vectorization=""
                              Vector_ISA="AVX512"
                              Compiler_Estimated_Gain="3.37x"
                              Trip_Counts_Average="761670"
                              Trip_Counts_Min="761670"
                              Trip_Counts_Max="761670"
                              Trip_Counts_Call_Count="1"
                              Transformations=""
                              Source_Location="Histogram_Example.cpp:107"
                              Module="run512">

In a similar way, we can generate other types of reports that will give us other useful information about our loops. The –help collect and –help report options in the command line Intel Advisor tool will show what types of collections and reports are available:

$ advixe-cl -help collect
Intel(R) Advisor Command Line Tool
Copyright (C) 2009-2016 Intel Corporation. All rights reserved.

-c, -collect=<string>         Collect specified data. Specifying --search-dir
                              when collecting data is strongly recommended.

Usage: advixe-cl -collect=<string> [-action-option] [-global-option] [--]
        <target> [<target options>]

        <string> is one of the following analysis types to perform on <target>:

            survey        - Explore where to add efficient vectorization and/or threading.
            dependencies  - Identify and explore loop-carried dependencies for marked loops.
            map           - Identify and explore complex memory accesses for marked loops.
            suitability   - Analyze the annotated program to check its predicted parallel performance.
            tripcounts    - Find how many iterations are executed.
$ advixe-cl -help report
Intel(R) Advisor Command Line Tool
Copyright (C) 2009-2016 Intel Corporation. All rights reserved.

-R, -report=<string>          Report the results that were previously gathered.

Generates a formatted data report with the specified type and action options.

 Usage: advixe-cl -report=<string> [-action-option] [-global-option] [--]
        <target> [<target options>]

        <string> is the list of available reports:

            survey        - shows results of the survey analysis
            annotations   - lists the annotations in the sources
            dependencies  - shows possible dependencies
            hotspots      -
            issues        -
            map           - reports memory access patterns
            suitability   - shows possible performance gains
            summary       - shows the collection summary
            threads       - shows the list of threads
            top-down      - shows the report in a top-down view
            tripcounts    - shows survey report with tripcounts data added

For example, to obtain memory access pattern details in our source code, we can run a memory access patterns (MAP) analysis using the map option:

$ advixe-cl -collect map -project-dir ./AdvProj-Example-AVX512 -search-dir all:=./src -- ./run512 image01.jpg

$ advixe-cl -report map -project-dir ./AdvProj-Example-AVX512 -format=xml -report-output=./REPORTS/map-AVX512.xml

In all the above cases, the project directory (in this example, AdvProj-Example-AVX512) contains all the information necessary to perform a full analysis using the GUI. When we are ready to use the GUI, we can copy the project directory to a workstation/laptop (or access it over the filesystem) and run the GUI-based Intel Advisor from there, as was shown in a previous section in this article.

Summary

This article showed a simple way to quickly explore vectorization performance using Intel Advisor 2017. This was achieved using the CLI of Intel Advisor to perform a quick and preliminary analysis and report in the Intel Xeon Phi processor using a text window, with the idea of later obtaining more information about our codes by using the Intel Advisor GUI interface.

This procedure will also be useful for consolidating performance information after several iterations of source code optimization. A Unix* script (or similar) can be used to collect information from different reports and quickly consolidate it into tables or plots.

References

Zhang, B. (2016). "Guide to Automatic Vectorization With Intel AVX-512 Instructions in Knights Landing Processors."

For more complete information about compiler optimizations, see our Optimization Notice.