Profiling MPI Applications
- Application:heart_demosample application, available from GitHub* at https://github.com/CardiacDemo/Cardiac_demo.git
- Intel® C++ Compiler
- Intel® MPI Library 2019
- Intel VTune Amplifier 2019
- VTune Amplifier's Application Performance Snapshot
- To download the latest version ofVTune, visit the download page.Profiler
- Operating system:Linux*
- CPU: Intel® Xeon® Platinum 8168 Processor (formerly code named Skylake)
- Network Fabric: Intel® Omni-Path Architecture (Intel® OPA)
- Clone the application GitHub repository to your local system:git clone https://github.com/CardiacDemo/Cardiac_demo.git
- Set up the Intel C++ Compiler and Intel MPI Library environment:source <compiler_install_dir>/bin/compilervars.sh intel64 source <mpi_install_dir>/bin/mpivars.sh
- In the root level of the sample package, create a build directory and change to that directory:mkdir build cd build
- Build the application using the following command:mpiicpc ../heart_demo.cpp ../luo_rudy_1991.cpp ../rcm.cpp ../mesh.cpp -g -o heart_demo -O3 -std=c++11 -qopenmp -parallel-source-info=2
Establish Overall Performance Characteristics
export OMP_NUM_THREADS=12 mpirun –np 16 –ppn 4 aps ./heart_demo -m ../mesh_mid -s ../setup_mid.txt -t 100
Configure and Run the HPC Performance Characterization Analysis
- The command can be used in an interactive session or included in a batch submission script.
- The results directory specification is required for MPI applications.
- Add-trace-mpiif you are not using the Intel MPI Library.
- Prepare your environment by sourcing the relevant VTune Amplifier files. For a default installation using the bash shell, use the following command:source /opt/intel/vtune_amplifier/amplxe-vars.sh
- Collect data for theheart_demoapplication using thehpc-performanceanalysis. The application uses both OpenMP and MPI and will be executed using the configuration described earlier, with 16 MPI ranks over a total of 4 compute nodes using the Intel MPI Library. This example is run on Intel® Xeon® Platinum 8168 Processors and uses 12 OpenMP threads per MPI rank:export OMP_NUM_THREADS=12 mpirun –np 16 –ppn 4 amplxe-cl –collect hpc-performance –r vtune_mpi -- ./heart_demo -m ../mesh_mid -s ../setup_mid.txt -t 100The analysis begins and generates four output directories using the following naming convention:vtune_mpi..<node host name>
Analyze Results using the VTune Amplifier GUI
- MPI Imbalanceis an average MPI busy wait time by all ranks on the node. The value indicates how much time could be saved if balance was ideal.
- MPI Rank on the Critical Pathis the rank with minimal busy wait time.
- MPI Busy Wait TimeandTop Serial Hotspotsare shown for the rank in the critical path. This is a good way to identify severe deficiencies in scalability since they typically correlate with high imbalance or busy wait metrics. SignificantMPI Busy Wait Timefor the rank on the critical path in a multi-node run might mean that the outlier rank is on a different node.
Generate a Command Line from the VTune Amplifier GUI (optional)
- Launch VTune Amplifier and clickNew Projector open an existing project.
- ClickConfigure Analysis.
- On theWherepane, selectArbitrary Host (not connected)and specify the hardware platform.
- On theWhatpane:
- Specify the application and set the parameters and working directory.
- Select theUse MPI launcheroption and provide information related to the MPI run.
- [Optional] Choose particular ranks to profile.
- On theHowpane, change the defaultHotspotsanalysis toHPC Performance Characterizationand customize the available options.
- Click theCommand Linebutton at the bottom of the window. A pop-up window opens with the exact text corresponding to the command line you should use to perform the customize analysis you just configured on the GUI. Additional MPI options can be added to complete the command.
Analyze Results with a Command Line Report (optional)
amplxe-cl -report summary -r ./results_dir