Profiling MPI Applications
- Application:heart_demosample application, available from GitHub* at https://github.com/CardiacDemo/Cardiac_demo.git
- Intel® C++ Compiler
- Intel® MPI Library 2019
- Intel VTune Amplifier 2019
- VTune Amplifier's Application Performance Snapshot
- All the Cookbook recipes are scalable and can be applied to Intel VTune Amplifier 2018 and higher. Slight version-specific configuration changes are possible.
- Operating system:Linux*
- CPU: Intel® Xeon® Platinum 8168 Processor (formerly code named Skylake)
- Network Fabric: Intel® Omni-Path Architecture (Intel® OPA)
- Clone the application GitHub repository to your local system:git clone https://github.com/CardiacDemo/Cardiac_demo.git
- Set up the Intel C++ Compiler and Intel MPI Library environment:source <compiler_install_dir>/bin/compilervars.sh intel64 source <mpi_install_dir>/bin/mpivars.sh
- In the root level of the sample package, create a build directory and change to that directory:mkdir build cd build
- Build the application using the following command:mpiicpc ../heart_demo.cpp ../luo_rudy_1991.cpp ../rcm.cpp ../mesh.cpp -g -o heart_demo -O3 -std=c++11 -qopenmp -parallel-source-info=2
Establish Overall Performance Characteristics
export OMP_NUM_THREADS=12 mpirun –np 16 –ppn 4 aps ./heart_demo -m ../mesh_mid -s ../setup_mid.txt -t 100
Configure and Run the HPC Performance Characterization Analysis
- The command can be used in an interactive session or included in a batch submission script.
- The results directory specification is required for MPI applications.
- Add-trace-mpiif you are not using the Intel MPI Library.
- Prepare your environment by sourcing the relevant VTune Amplifier files. For a default installation using the bash shell, use the following command:source /opt/intel/vtune_amplifier/amplxe-vars.sh
- Collect data for theheart_demoapplication using thehpc-performanceanalysis. The application uses both OpenMP and MPI and will be executed using the configuration described earlier, with 16 MPI ranks over a total of 4 compute nodes using the Intel MPI Library. This example is run on Intel® Xeon® Platinum 8168 Processors and uses 12 OpenMP threads per MPI rank:export OMP_NUM_THREADS=12 mpirun –np 16 –ppn 4 amplxe-cl –collect hpc-performance –r vtune_mpi -- ./heart_demo -m ../mesh_mid -s ../setup_mid.txt -t 100The analysis begins and generates four output directories using the following naming convention:vtune_mpi..<node host name>
Analyze Results using the VTune Amplifier GUI
- MPI Imbalanceis an average MPI busy wait time by all ranks on the node. The value indicates how much time could be saved if balance was ideal.
- MPI Rank on the Critical Pathis the rank with minimal busy wait time.
- MPI Busy Wait TimeandTop Serial Hotspotsare shown for the rank in the critical path. This is a good way to identify severe deficiencies in scalability since they typically correlate with high imbalance or busy wait metrics. SignificantMPI Busy Wait Timefor the rank on the critical path in a multi-node run might mean that the outlier rank is on a different node.
Generate a Command Line from the VTune Amplifier GUI (optional)
- Launch VTune Amplifier and clickNew Projector open an existing project.
- ClickConfigure Analysis.
- On theWherepane, selectArbitrary Host (not connected)and specify the hardware platform.
- On theWhatpane:
- Specify the application and set the parameters and working directory.
- Select theUse MPI launcheroption and provide information related to the MPI run.
- [Optional] Choose particular ranks to profile.
- On theHowpane, change the defaultHotspotsanalysis toHPC Performance Characterizationand customize the available options.
- Click theCommand Linebutton at the bottom of the window. A pop-up window opens with the exact text corresponding to the command line you should use to perform the customize analysis you just configured on the GUI. Additional MPI options can be added to complete the command.
Analyze Results with a Command Line Report (optional)
amplxe-cl -report summary -r ./results_dir
Selective Code Area Profiling (optional)
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804