In this article we continue our exploration of vectorization on an Intel® Xeon Phi™ processor using examples of loops that we used in Improve Vectorization Performance with Intel® Advanced Vector Extensions 512 (AVX512). We will discuss how to use the command-line interface in Intel® Advisor 2017 for a quick, initial analysis of loop performance that gives an overview of the hotspots in the code. This initial analysis can be then followed by more in-depth analysis using the graphical user interface (GUI) in Intel Advisor 2017.
Intel has developed several software products aimed at increasing productivity of software developers and helping them to make the best use of Intel® processors. One of these tools is Intel® Parallel Studio XE, which contains a set of compilers and analysis tools that let the user write, analyze and optimize their application on Intel hardware.
In this article, we explore Intel® Advisor 2017, which is one of the analysis tools in the Intel Parallel Studio XE suite that lets us analyze our application and gives us advice on how to improve vectorization in our code.
How does Intel® Advisor help with Vectorization?
Vector-level parallelism allows the software to use special hardware like vector registers and SIMD (Single Instruction Multiple Data) instructions. New Intel® processors, like the Intel® Xeon Phi™ processor features 512-bit wide vector registers which, in conjunction with the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) ISA, allows the use of two vector processing units in each individual core, each of them capable of processing 16 single-precision (32-bit) or 8 double-precision (64-bit) floating point numbers.
To further realize the full performance of modern processors, code must be also threaded to take advantage of multiple cores. The multiplicative effect of vectorization and threading will accelerate code more than the effect of only vectorization or threading.
Intel Advisor analyzes our application and reports not only the extent of vectorization but also possible ways to achieve more vectorization and increase the effectiveness of the current vectorization.
Although Intel Advisor works with any compiler, it is particularly effective when applications are compiled using Intel compilers, because Intel Advisor will use the information from the reports generated by Intel compilers.
How to use Intel® Advisor
The most effective way to use Intel Advisor is via the GUI. This interface gives us access to all the information and recommendations that Intel Advisor collects from our code. Detailed information can be found in https://software.intel.com/en-us/intel-advisor-xe-support, where documentation, training materials, and code samples can be found. Product support and access to the Intel Advisor community forum can be also found in that link.
Intel Advisor also offers a command-line interface (CLI) that lets the user work on remote hosts, and/or generate information in a way that is easy to automate analysis tasks, for example using scripts.
When working on Intel Xeon Phi processor, which is based on the Linux* OS, we might need to use a combination of Advisor’s GUI and CLI for our specific analysis workflow, and in some cases the CLI will be a good starting point for a quick view of a performance summary, as well as in the initial phases of our workflow analysis. Detailed information about the Intel Advisor CLI for Linux can be found at https://software.intel.com/en-us/node/634769.
In the next sections, a procedure for a quick initial performance analysis on Linux using the Intel Advisor CLI will be described. This quick analysis will give us an idea of the performance bottlenecks in our application and where to focus initial optimization efforts. Also, for testing purposes, this procedure will also allow the user to automate testing and results reporting.
This analysis is intended as an initial step and will provide access to only limited information. The full extent of the information and help offered by Intel Advisor is available using a combination of the Intel Advisor GUI and CLI.
Using Intel Advisor on an Intel® Xeon Phi™ processor
Running a quick survey analysis
To illustrate this procedure, I will use the code sample from a previous article, Improve Vectorization Performance with Intel® Advanced Vector Extensions 512 (AVX512) that shows vectorization improvements when using the Intel AVX-512 ISA. Details of the source code are discussed in that article. The sample code can be downloaded from here.
This example will be run in the following hardware:
Processor: Intel Xeon Phi processor, model 7250 (1.40 GHz)
Number of cores: 68
Number of threads: 272
The first step for a quick analysis is to create an optimized executable that will run on the Intel Xeon Phi processor. For this, we start by compiling our application with a set of options that will direct the compiler to create this executable in a way that Intel Advisor will be able to extract information from. The options that must be used are
–xMIC-AVX512, which enables the use of all the subsets of Intel Advanced Vector Extensions 512 that are supported by the Intel® Xeon Phi™ processor (Zhang, 2016), and
–g to generate debugging information and symbols. The
–O3 option is also used because the executable must be optimized. We can use either the
–O2 or the
-O3 options for this purpose.
$ icpc Histogram_Example.cpp -g -O3 -restrict -xMIC-AVX512 -o run512 -lopencv_highgui -lopencv_core -lopencv_imgproc
Notice that we have also used the
–restrict option, which informs the compiler that the pointers used in this application are not aliased. Also notice that we are linking the application with the OpenCV* library (www.opencv.org), which we use in this application to read an image from disk. A
Makefile file is included if you download the sample code. This
Makefile file can be used to generate an executable for Intel Advisor.
Next, we can run the CLI version of the Intel Advisor tool. The survey analysis is a good starting point for analysis, because it provides information that will let us identify how our code is using vectorization and where the hotspots for analysis are.
$ advixe-cl -collect survey -project-dir ./AdvProj-Example-AVX512 -search-dir all:=./src -- ./run512 image01.jpg
The above command runs the Intel Advisor tool and creates a project directory
AdvProj-Example-AVX512. Inside this directory, Intel Advisor creates, among other things, a directory named
e000 containing the results of the analysis. If we list the contents of the results directory, we see the following:
$ ls AdvProj-Example-AVX512/e000/ e000.advixeexp hs000 loop_hashes.def $
hs000 contains results from the survey analysis just created.
The next step is to view the results of the survey analysis performed by the Intel Advisor tool. Here we will use the CLI to generate the report. To do this, we replace the
-collect option with the
-report one, making sure we refer to the same project directory where the data has been collected. We can use the following command to generate a survey report from the survey data that is contained in the results directory in our project directory:
$ advixe-cl -report survey -project-dir ./AdvProj-Example-AVX512 -format=text -report-output=./REPORTS/survey-AVX512.txt
The above command will create a report named
survey-AVX512.txt in the subdirectory
REPORTS. This report is in a column format and contains several columns, so it can be a little difficult to read on a console. One option for a quick read is to limit the number of columns to be displayed using the
–filter option (only the survey report is supported in the current version of Intel Advisor).
Another option is to create an xml-formatted report. We can do this if we change the value for the
-format option from
$ advixe-cl -report survey -project-dir ./AdvProj-Example-AVX512 -format=xml -report-output=./REPORTS/survey-AVX512.xml
The xml-formatted report might be easier to read on a small screen, because the information in the columns in the report file is condensed into one column. Here is a fragment of it:
(…) </function_call_site_or_loop> <function_call_site_or_loop ID="4" Function_Call_Sites_and_Loops="[child]-[loop in main at Histogram_Example.cpp:107]" Self_Time="0.060s" Total_Time="0.120s" Type="Vectorized (Body)" Why_No_Vectorization="" Vector_ISA="AVX512" Compiler_Estimated_Gain="3.37x" Trip_Counts_Average="" Trip_Counts_Min="" Trip_Counts_Max="" Trip_Counts_Call_Count="" Transformations="" Source_Location="Histogram_Example.cpp:107" Module="run512"> (…) </function_call_site_or_loop> <function_call_site_or_loop ID="8" name="[loop in main at Histogram_Example.cpp:87]" Self_Time="0.030s" Total_Time="0.030s" Type="Vectorized (Body; [Remainder])" Why_No_Vectorization="1 vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override " Vector_ISA="AVX512" Compiler_Estimated_Gain="20.53x" Trip_Counts_Average="" Trip_Counts_Min="" Trip_Counts_Max="" Trip_Counts_Call_Count="" Transformations="" Source_Location="Histogram_Example.cpp:87" Module="run512"> </function_call_site_or_loop> <function_call_site_or_loop ID="1" Function_Call_Sites_and_Loops="[child]-[loop in main at Histogram_Example.cpp:87]" Self_Time="0.030s" Total_Time="0.030s" Type="Vectorized (Body)" Why_No_Vectorization="" Vector_ISA="AVX512" Compiler_Estimated_Gain="20.53x" Trip_Counts_Average="" Trip_Counts_Min="" Trip_Counts_Max="" Trip_Counts_Call_Count="" Transformations="" Source_Location="Histogram_Example.cpp:87" Module="run512">
Recall that the survey option in the Intel Advisor tool will generate a performance overview of the loops in the application. For example, the example shown above shows that the loop starting on line 107 in the source code has been vectorized using Intel AVX-512 ISA. It also shows an estimate of the improvement of the loop’s performance (compared to a scalar version) and timing information. The second and third blocks in the example above give performance overview for the loop at line 87 in the source code. It shows that the body of the loop has been vectorized, but the reminder of the loop has not.
Also notice that the different loops have been assigned a loop ID, which is the way the Intel Advisor tool labels the loops in order to keep track of them in future analysis (for example, after looking at the performance overview shown above, we might want to generate more detailed information about a specific loop by including the loop ID in the command line).
The above is a quick way to run and visualize a vectorization analysis in the Intel Xeon Phi processor. This procedure will let us quickly visualize the basic vectorization information from our codes with minimum effort. It will also let us create quick summaries of progressive optimization steps in the form of tables or plots (if we have run several of these analysis at different stages of the optimization process). However, if we need to access more advanced information from our analysis, like traits or the assembly code, we can use the Intel Advisor GUI possibly from a different computer (either by copying the project folder to another computer or by accessing it over the network) and access the complete information that Intel Advisor offers.
For example, the next figure shows what the Intel Advisor GUI looks like for the survey analysis shown above. We can see that, besides the information contained in the CLI report, The Intel Advisor GUI offers other information, like traits and source and assembly code.
Collecting more detailed information
Once we have looked at the performance summary reported by the Intel Advisor tool using the
Survey option, we can use other options to add more specific information to the reports. One option is to run the
Tripcounts analysis to get information about the number of times loops are executed.
To add this information to our project, we can use the Intel Advisor tool to run a
tripcounts analysis on the same project we used for the survey analysis:
$ advixe-cl -collect tripcounts -project-dir ./AdvProj-Example-AVX512 -search-dir all:=./src -- ./run512 image01.jpg
And similarly to generate a
$ advixe-cl -report tripcounts -project-dir ./AdvProj-Example-AVX512 -format=xml -report-output=./REPORTS/tripcounts-AVX512.xml
Now the xml-formatted report will contain information about the number of times the loops have been executed. Specifically, the
Trip_Counts fields in the
xml report will be populated, while the information from the survey report will be preserved. Next is a fragment of the enhanced report (only the first, most time-consuming loop is shown):
(…) </function_call_site_or_loop> <function_call_site_or_loop ID="4" Function_Call_Sites_and_Loops="[child]-[loop in main at Histogram_Example.cpp:107]" Self_Time="0.070s" Total_Time="0.120s" Type="Vectorized (Body)" Why_No_Vectorization="" Vector_ISA="AVX512" Compiler_Estimated_Gain="3.37x" Trip_Counts_Average="761670" Trip_Counts_Min="761670" Trip_Counts_Max="761670" Trip_Counts_Call_Count="1" Transformations="" Source_Location="Histogram_Example.cpp:107" Module="run512">
In a similar way, we can generate other types of reports that will give us other useful information about our loops. The
–help collect and
–help report options in the command line Intel Advisor tool will show what types of collections and reports are available:
$ advixe-cl -help collect Intel(R) Advisor Command Line Tool Copyright (C) 2009-2016 Intel Corporation. All rights reserved. -c, -collect=<string> Collect specified data. Specifying --search-dir when collecting data is strongly recommended. Usage: advixe-cl -collect=<string> [-action-option] [-global-option] [--] <target> [<target options>] <string> is one of the following analysis types to perform on <target>: survey - Explore where to add efficient vectorization and/or threading. dependencies - Identify and explore loop-carried dependencies for marked loops. map - Identify and explore complex memory accesses for marked loops. suitability - Analyze the annotated program to check its predicted parallel performance. tripcounts - Find how many iterations are executed.
$ advixe-cl -help report Intel(R) Advisor Command Line Tool Copyright (C) 2009-2016 Intel Corporation. All rights reserved. -R, -report=<string> Report the results that were previously gathered. Generates a formatted data report with the specified type and action options. Usage: advixe-cl -report=<string> [-action-option] [-global-option] [--] <target> [<target options>] <string> is the list of available reports: survey - shows results of the survey analysis annotations - lists the annotations in the sources dependencies - shows possible dependencies hotspots - issues - map - reports memory access patterns suitability - shows possible performance gains summary - shows the collection summary threads - shows the list of threads top-down - shows the report in a top-down view tripcounts - shows survey report with tripcounts data added
For example, to obtain memory access pattern details in our source code, we can run a memory access patterns (MAP) analysis using the
$ advixe-cl -collect map -project-dir ./AdvProj-Example-AVX512 -search-dir all:=./src -- ./run512 image01.jpg
$ advixe-cl -report map -project-dir ./AdvProj-Example-AVX512 -format=xml -report-output=./REPORTS/map-AVX512.xml
In all the above cases, the project directory (in this example,
AdvProj-Example-AVX512) contains all the information necessary to perform a full analysis using the GUI. When we are ready to use the GUI, we can copy the project directory to a workstation/laptop (or access it over the filesystem) and run the GUI-based Intel Advisor from there, as was shown in a previous section in this article.
This article showed a simple way to quickly explore vectorization performance using Intel Advisor 2017. This was achieved using the CLI of Intel Advisor to perform a quick and preliminary analysis and report in the Intel Xeon Phi processor using a text window, with the idea of later obtaining more information about our codes by using the Intel Advisor GUI interface.
This procedure will also be useful for consolidating performance information after several iterations of source code optimization. A Unix* script (or similar) can be used to collect information from different reports and quickly consolidate it into tables or plots.
Zhang, B. (2016). "Guide to Automatic Vectorization With Intel AVX-512 Instructions in Knights Landing Processors."