Analyze Vectorization and Memory Aspects of an MPI Application
- For profiling MPI applications written under the Single Program Multiple Data (SPMD) framework, like WRF, it is enough to collect data on asingleMPI rank only, since all ranks execute the same code for a different subset of data. This also decreases the collection overhead.You can use Application Performance Snapshot, which is part of Intel® VTune™ Profiler, to detect outlier ranks and selectively analyze them.
- For Multiple Program Multiple Data (MPMD) applications, you should analyzeallMPI ranks.
- Performance analysis tools: Intel Advisor 2020 GoldThe latest version is available for download at https://software.intel.com/content/www/us/en/develop/tools/advisor/choose-download.html.
- Application: Weather Research and Forecasting (WRF) Model version 184.108.40.206. The WRF workload used is Conus12km.The application is available for download at https://www.mmm.ucar.edu/weather-research-and-forecasting-model.You also must install the following application dependencies: zlib-1.2.11, szip-2.1.1, hdf5-1.8.21, netcdf-c-4.6.3 and netcdf-fortran-4.4.5.
- Intel® C++ Compiler 2019 Update 5The latest version is available for download at https://software.intel.com/en-us/c-compilers/choose-download.
- Intel® Fortran Compiler 2019 Update 5The latest version is available for download at https://software.intel.com/content/www/us/en/develop/tools/compilers/fortran-compilers/choose-download.html.
- Other tools: Intel® MPI Library 2019 Update 6The latest version is available for download at https://software.intel.com/content/www/us/en/develop/tools/mpi-library/choose-download.html.
- Operating system: CentOS* 7Data was collected on a CentOS 7 system remotely using Intel Advisor CLI through a Windows* system connected over SSH. The collected results were moved to the Windows system and analyzed using Intel Advisor GUI.
- CPU: Intel® Xeon® Platinum 8260L processor with the following configuration:===== Processor composition ===== Processor name : Intel(R) Xeon(R) Platinum 8260L Packages(sockets) : 2 Cores : 48 Processors(CPUs) : 96 Cores per package : 24 Threads per core : 2To view your processor configuration, source thempivars.shscript of the Intel MPI Library and run thecpuinfo -gcommand.
- Set up the environment for the required software:source <compilers_installdir>/bin/compilervars.sh intel64 source <mpi_library_installdir>/intel64/bin/mpivars.sh source <advisor_installdir>/advixe-vars.shTo verify that you successfully set up the tools, you can run the following commands. You should get the product versions.mpiicc -v mpiifort -v mpiexec -V advixe-cl --version
- Set the environment variables required for the WRF application:export LD_LIBRARY_PATH=/path_to_IO_libs/lib:$LD_LIBRARY_PATH ulimit -s unlimited export WRFIO_NCD_LARGE_FILE_SUPPORT=1 export KMP_STACKSIZE=512M export OMP_NUM_THREADS=1
- Build the application in the Release mode. The-gcompile-time flag is recommended so that Intel Advisor can show source names and locations.
Survey Your Target Application
- Pass theadvixe-clwith collection options as an argument to thempiexeclauncher.
- Use the-gtoolflag to attach analysis to the ranks specified after the name of a project directory.
mpiexec -genvall -n 48 -ppn 48 -gtool "advixe-cl --collect=survey –-project-dir=<project_dir>/project1:0" ./wrf.exe
mpiexec -genvall -n 48 -ppn 48 -gtool "advixe-cl --collect=survey –-project-dir=<project_dir>/project1:0,10-15,47" ./wrf.exe
Collect Trip Counts and FLOP Data and Review the Results
mpiexec -genvall -n 48 -ppn 48 -gtool "advixe-cl --collect=tripcounts --flop -–project-dir=<project_dir>/project1:0" ./wrf.exe
- Pack the result files, corresponding sources and binaries in a single snapshot file with.advixeexpzextension:advixe-cl --snapshot --project-dir=<project_dir>/project1 --pack --cache-sources --cache-binaries -- <snapshot_name>
- Move this snapshot to a local machine and open it with the Intel Advisor GUI.
- In theSummarytab of the survey report, review theprogram metricslike elapsed time, number of vectorized loops, vector instruction sets used, GFLOPS.
- In theSurvey & Rooflinetab, review the details about the application performance with a list of loops/functions with the most time-consuming ones at the top. Use the messages in thePerformance IssuesandWhy No Vectorizationcolumns identify the next steps for improving the application performance.
Review the Roofline Chart
- Based on dot positions along the horizontal, we can see that WRF loops are mostly bounded both by compute and memory.
- Most loops take similar amount of time, which is denoted by size and color of dots.
- In the Roofline report opened in the Intel Advisor GUI, click theExportbutton in the report toolbar and chooseExport as HTMLorExport as SVG.
- Run the CLI command with the--report=rooflineoption. For example:advixe-cl --report=roofline --project-dir=<project_dir>/project1 --report-output=./wrf_roofline.htmlFor MPI applications, using the CLI command is the recommended because you do not need to have an installation of Intel Advisor GUI to use it.
Run the Dependencies Analysis (optional)
- Loop IDs-based collection:
- Generate a Survey report to get loop IDs:advixe-cl --report=survey --project-dir=./<project_dir>/project1This command will create anadvisor-survey.txtfile with metrics for all loops in your application sorted by self time. The loop IDs are in the first column of the table:
- Identify loops to run the deeper analysis on.
- Run the Dependencies analysis for the selected loops on the rank 0 of the WRF application. In this case, we select loops 235 and 355:mpiexec -genvall -n 48 -ppn 48 -gtool "advixe-cl --collect=dependencies --mark-up-list=235,355 -–project-dir=<project_dir>/project1:0" ./wrf.exe
- Source location-based collection: Specify the source location of loops to analyze in theformat and run the Dependencies analysis for the selected loops on the rank 0 of the WRF application:file1:line1mpiexec -genvall -n 48 -ppn 48 -gtool "advixe-cl --collect=dependencies --mark-up-list=module_advect_em.f90:7637,module_big_step_utilities_em.f90:2542 –-project-dir=<project_dir>/project1:0" ./wrf.exe
Run the Memory Access Patterns Analysis (optional)
- Identify loop IDs or source locations to run the deeper analysis on.
- Run the MAP analysis for the selected loops (155 and 200 in this case) on the rank 0 of the WRF application:mpiexec -genvall -n 48 -ppn 48 -gtool "advixe-cl --collect=map --mark-up-list=155,200 --project-dir=<project_dir>/project1:0" ./wrf.exe
- You can use the Intel® Advisor to analyze your MPI applications on one, several, or all ranks. This recipe used the WRF Conus12 km workload.
- To run the Survey, Trip Counts and FLOP, Roofline, Dependencies, or Memory Access Pattern analysis on an MPI application, you can use only Intel Advisor CLI commands, but you can visualize the results generated in the Intel Advisor GUI.