Analyzing MPI Applications

As a part of the Intel® Cluster Studio XE, Intel® VTune™ Amplifier is used for performance analysis of MPI applications.

Intel Cluster Studio XE helps exploit scalable parallelism of a modern cluster at all levels of parallel or sequential computing for a Fortran, C or C++ MPI application. The levels include message passing, threading, SIMD and data levels. Intel MPI Library is used at the process messaging level. Intel OpenMP* library, Intel Threading Building Blocks (Intel TBB), and Intel Cilk™ Plus extensions can be used for thread parallelism. Intel Math Kernel Library (Intel MKL) can be used to automatically exploit threading, message passing through ScaLAPACK, and SIMD data parallelism capabilities of Intel hardware.

Intel Cluster Studio XE provides the following tools to analyze the performance and correctness of an MPI application:

  • For inter-process analysis, use the Intel Trace Analyzer and Collector. By default, the Intel Trace Analyzer and Collector is installed to the <installdir>/itac directory. It attaches to the application through linkage (statically, dynamically, through LD_PRELOAD, or via the Intel Compiler -tcollect and -tcollect-filter options) or by using the itcpin tool. The tools collect information about events at the MPI level between processes and enable analyzing the performance and correctness of the MPI calls, deadlock detection, data layout errors, as well as risky or incorrect MPI constructs. Intel Trace Analyzer and Collector data is correlated and aggregated across all processes and all nodes that participate in the execution run. For more details, see the Intel Trace Analyzer and Collector documentation.

  • For analysis of MPI applications using fork-join threading through OpenMP* and Intel TBB, use the Intel VTune Amplifier and Intel Inspector.

Note

The version of the Intel MPI library included with the Intel Cluster Studio XE makes an important switch to use the Hydra process manager by default for mpirun. This provides high scalability across the big number of nodes.

Use these basic steps required to analyze MPI applications with the VTune Amplifier:

  1. Configure installation for MPI analysis.

  2. Configure and run MPI analysis with the VTune Amplifier.

  3. Resolve symbols for MPI modules.

  4. View collected data.

Explore additional information on MPI analysis:

Configuring Installation for MPI Analysis

For MPI application analysis on a Linux* cluster, you may enable the Per-user Hardware Event-based Sampling mode when installing the Intel Cluster Studio XE. This option ensures that during the collection the VTune Amplifier collects data only for the current user. Once enabled by the administrator during the installation, this mode cannot be turned off by a regular user, which is intentional to preclude individual users from observing the performance data over the whole node including activities of other users.

After installation, you can use the respective -vars.sh files to set up the appropriate environment (PATH, MANPATH) in the current terminal session.

Configuring MPI Analysis with the VTune Amplifier

To collect performance data for an MPI application with the VTune Amplifier, use the following command:

$ mpirun -n <n> -l amplxe-cl -result-dir my_result -quiet -collect <analysis type> my_app [my_app_ options]

Note

To view a list of available analysis types, use amplxe-cl -help collect command.

where:

  • <n> is the MPI process rank detected and captured by the collector automatically. As a result, the VTune Amplifier creates a number of result directories in the current directory, named as my_result.0, my_result.1, ... my_result.n, where the numeric suffix n corresponds to the MPI process rank. Using the suffix guarantees that multiple amplxe-cl instances launched in the same directory on different nodes do not overwrite the data of each other and can work in parallel. So, the VTune Amplifier creates a separate result directory for each analyzed process in the job.

    Note

    For hardware event-based sampling analysis types, the VTune Amplifier can collect single MPI ranks only. For example:

    $ mpirun -n 1 amplxe-cl -c advanced-hotspots -r ah -- ./test.x : -n 3 ./test.x
    

    To run event-based sampling analysis for several processes on one node, consider using -analyze-system option, for example:

    $ mpirun -host myhost -n 11 ./a.out : -host myhost -n 1 amplxe-cl -result-dir foo -c advanced-hotspots -analyze-system ./a.out 
    

    But this type of analysis configuration can collect ITT API (for example, Task and Frame analysis) and rank data only for the launched process.

  • -l option of the mpiexec/mpirun tools marks stdout lines with an MPI rank.

  • -quiet / -q option suppresses the diagnostic output like progress messages.

To collect data for a subset of MPI processes in the workload, use the per-host syntax of mpirun/mpiexec* and specify different command lines to execute for different processes.

Examples:

  1. This example runs the Hotspots analysis type recommended as a starting point:

    $ mpirun -n 4 amplxe-cl -result-dir my_result -collect hotspots -- my_app [my_app_ options]

  2. This example collects Hotspots data for two out of 16 processes in the job distributed across the hosts:

    $ mpirun -host myhost -n 14 ./a.out : -host myhost -n 2 amplxe-cl -result-dir foo -c hotspots ./a.out

    As a result, the VTune Amplifier creates two directories in the current directory: foo.14 and foo.15 (given that process ranks 14 and 15 were assigned to the last two processes in the job).

  3. As an alternative to the previous example, you can create a configuration file with the following content:

    # config.txt configuration file
    -host myhost -n 14 ./a.out
    -host myhost -n 2 amplxe-cl -quiet -collect hotspots -result-dir foo ./a.out
    

    and run the data collection as:

    $ mpirun -configfile ./config.txt

    to achieve the same result as in the previous example: foo.14 and foo.15 result directories are created.

  4. Similarly, this example uses specific host names to control where the analyzed processes are executed:

    # config.txt configuration file
    -host myhost1 -n 14 ./a.out
    -host myhost2 -n 2 amplxe-cl -quiet -collect hotspots -result-dir foo ./a.out
    

    When you mention the host names, the VTune Amplifier allocates consecutive MPI ranks to the specified hosts. In this example, ranks 0 to 13, inclusive, will be assigned to myhost1. The remaining ranks 14 and 15 will be assigned to myhost2. You can omit specifying the exact hosts. In this case, the distribution of the processes between the hosts will be done in the round-robin fashion. That is, myhost1 will get MPI ranks 0, 2, and 4 thru 15, while myhost2 will get MPI ranks 1 and 3. The latter behavior may change in the future.

Note

The examples above use the mpirun command as opposed to mpiexec and mpiexec.hydra while real-world jobs might use the mpiexec* ones. mpirun is a higher-level command that dispatches to mpiexec or mpiexec.hydra depending on the current default and options passed. All the listed examples work for the mpiexec* commands as well as the mpirun command.

Resolving Symbols for MPI Modules

After data collection, the VTune Amplifier automatically finalizes the data (resolves symbols and converts them to the database). It happens on the same compute node where the command line collection was executing. So, the VTune Amplifier automatically locates binary and symbol files. In cases where you need to point to symbol files stored elsewhere, adjust the search settings using the -search-dir option:

$  mpirun -np
128 amplxe-cl -q -collect hotspots -search-dir /home/foo/syms
./a.out

Viewing Collected Data

Once the results are collected, you can open them in the graphical or command line interface of the VTune Amplifier.

To view the results in the command line interface:

Use the -report option. To get the list of all available VTune Amplifier reports, enter amplxe-cl -help report.

To view the results in the graphical interface:

  1. Launch the amplxe-gui <result path> command or launch the amplxe-gui tool.

  2. Click the menu button and select Open > Result... and browse to the required result file (*.amplxe).

Tip

You may copy a result to another system and view it there (for example, to open a result collected on a Linux* cluster on a Windows* workstation).

VTune Amplifier classifies MPI functions as system functions similar to Intel Threading Building Blocks (Intel TBB) and OpenMP* functions. This approach helps you focus on your code rather than MPI internals. You can use the VTune Amplifier GUI Call Stack Mode filter bar combo box and CLI call-stack-mode option to enable displaying the system functions and thus view and analyze the internals of the MPI implementation. The call stack mode User functions+1 is especially useful to find the MPI functions that consumed most of CPU Time (Hotspots analysis) or waited the most (Locks and Waits analysis). For example, in the call chain main() -> foo() -> MPI_Bar() -> MPI_Bar_Impl() -> ..., MPI_Bar() is the actual MPI API function you use and the deeper functions are MPI implementation details. The call stack modes behave as follows:

  • The default Only user functions call stack mode attributes the time spent in the MPI calls to the user function foo() so that you can see which of your functions you can change to actually improve the performance.

  • The User functions+1 mode attributes the time spent in the MPI implementation to the top-level system function - MPI_Bar() so that you can easily see outstandingly heavy MPI calls.

  • The User/system functions mode shows the call tree without any re-attribution so that you can see where exactly in the MPI library the time was spent.

VTune Amplifier provides Intel TBB and OpenMP support. You are recommended to use these thread-level parallel solutions in addition to MPI-style parallelism to maximize the CPU resource usage across the cluster, and to use the VTune Amplifier to analyze the performance of that level of parallelism. The MPI, OpenMP, and Intel TBB features in the VTune Amplifier are functionally independent, so all usual features of OpenMP and Intel TBB support are applicable when looking into a result collected for an MPI process.

Example:

This example displays the performance report for functions and modules analyzed for Hotspots. Note that this example opens individual analysis results each of which was collected for a specific rank of MPI process (foo.14 and foo.15 ):

$ amplxe-cl -R hotspots -q -format text -r foo.14
Function Module CPU Time
-------- ------ --------
f        a.out  6.070
main     a.out  2.990

$ amplxe-cl -R hotspots -q -format text -group-by module -r foo.14
Module CPU Time
------ --------
a.out  9.060

MPI Implementations Support

You can use the VTune Amplifier to analyze both Intel MPI library implementation and other MPI implementations. But beware of the following specifics:

  • VTune Amplifier extracts the MPI process rank from the environment variables PMI_RANK or PMI_ID (whichever is set) to make sure the process belongs to an MPI job and to capture the rank in the result directory name. If an alternative MPI implementation does not set these environment variables, the VTune Amplifier does not capture the rank in the result directory name and a usual automatic naming scheme for result directories is used. Default value for the -result-dir option is r@@@{at}, which results in sequence of result directories like r000hs, r001hs, and so on.
  • For the Intel MPI library, the VTune Amplifier classifies MPI functions/modules as system functions/modules (the User functions+1 option) and attributes their time to system functions. This option may not work for all modules and functions of non-Intel MPI implementations. In this case, the VTune Amplifier may display some internal MPI functions and modules by default.
  • You may need to adjust the command line examples in this help section to work for non-Intel MPI implementations. For example, you need to adjust command lines provided for different process ranks to limit the number of processes in the job.
  • An MPI implementation needs to operate in cases when there is the VTune Amplifier process (amplxe-cl) between the launcher process ( mpirun/ mpiexec) and the application process. It means that the communication information should be passed using environment variables, as most MPI implementations do. VTune Amplifier does not work on an MPI implementation that tries to pass communication information from its immediate parent process.

MPI System Modules Recognized by the VTune Amplifier

VTune Amplifier uses the following regular expressions in the Perl syntax to classify MPI implementation modules:

  • impi\.dll

  • impid\.dll

  • impidmt\.dll

  • impil\.dll

  • impilmt\.dll

  • impimt\.dll

  • libimalloc\.dll

  • libmpi_ilp64\.dll

Note

This list is provided for reference only. It may change from version to version without any additional notification.

Analysis Limitations

  • VTune Amplifies does not support MPI dynamic processes (for example, the MPI_Comm_spawn dynamic process API).

  • The data collections that use the hardware event-based sampling collector are limited to only one such a collection allowed at a time on a system. When you profile an MPI application with the VTune Amplifier, make sure that only one event-based sampling data collection session is launched on a given host. Common ways to achieve this is using the host syntax and distribute the ranks running under the tool over different hosts.

Additional Resources

For more details on analyzing MPI applications, see the Intel Cluster Studio XE and online MPI documentation at http://software.intel.com/en-US/articles/intel-mpi-library-documentation/.

There are also other resources available online that discuss usage of the VTune Amplifier with the Intel MPI Library:

See Also


Supplemental documentation specific to a particular Intel Studio may be available at <install-dir>/<studio>/documentation/.

For more complete information about compiler optimizations, see our Optimization Notice.