To analyze the performance and correctness of an MPI application at the inter-process level, use the Intel® Trace Analyzer and Collector tool (located at <installdir>/itac directory after installation). The Intel Trace Analyzer and Collector attaches to the application through linkage (statically, dynamically, also through LD_PRELOAD or via the Intel Compiler -tcollect and -tcollect-filter options), or by using the itcpin tool. The tools collect information about events at the MPI level between processes and allow analyzing the performance and correctness of the MPI calls, deadlock detection, data layout errors, as well as risky or incorrect MPI constructs. The Intel Trace Analyzer and Collector data is correlated and aggregated across all processes and all nodes that participated in the execution run.
Beyond the inter-process level of MPI parallelism, the processes that make up the applications on a modern cluster often also use fork-join threading through OpenMP and Intel TBB. This is where the VTune Amplifier and the Intel Inspector should respectively be used to analyze the performance and correctness of an MPI application.
At the high level the analysis workflow consists of three steps:
Use the amplxe-cl and inspxe-cl command-line tools to collect data about an application. By default, all processes are analyzed, but it is possible (and sometimes required for VTune Amplifier - there are certain collection technology limitations) to filter the data collection to limit it to a subset of processes. An individual result directory is created for each spawned MPI application process that was analyzed with MPI process rank value captured.
Post-process the result, which is also called finalization or symbol resolution. This is done automatically for each result directory once the collection has finished.
Open the content of each result directory through the GUI standalone viewer to analyze the data for the specific process. The GUI viewers are independent: VTune Amplifier and Intel Inspector have their own user-interfaces.
The file system contents should be the same on all nodes to make sure that the modules referenced in the collected data are available automatically on the host where the collection was initiated. This limitation can be overcome by manual copying of the modules for analysis from the nodes and adjusting the VTune Amplifier / Intel Inspector project search directories to make the modules found.
For VTune Amplifier the CPU model and stepping should be the same on all nodes so that the hardware Event-based sampling operates with the same Performance Monitoring Unit (PMU) type on all nodes.
MPI Analysis Limitations
There are certain limitations in the current MPI profiling support provided by the VTune Amplifier / Intel Inspector:
MPI dynamic processes are not supported by the VTune Amplifier / Intel Inspector. An example of dynamic process API is MPI_Comm_spawn
The data collections that use the hardware event-based sampling collector are limited to only one such collection allowed at a time on a system. When the VTune Amplifier is used to profile an MPI application, it is the responsibility of the user to make sure that only one SEP data collection session is launched on a given host. Common ways to achieve this is using the host syntax and distribute the ranks running under the tool over different hosts.