User Guide

  • 2020
  • 05/04/2020
  • Public Content
Contents

Collecting MPI Performance/Correctness Data

To collect performance or correctness data for an MPI application with the
Intel® VTune™
Profiler
/ Intel Inspector on a Windows or Linux OS, the following command should be used:
$
mpirun
-n <N> <abbr>-cl -r my_result -collect <analysis type> my_app [my_app_ options]
where
<abbr>
is
amplxe
or
inspxe
respectively. The list of analysis types available can be viewed using
amplxe-cl
-help collect
command.
As a result of using the collection commands, a number of result directories are created in the current directory, named as
my_result.0
-
my_result.3
. The numeric suffix is the corresponding MPI process rank that is detected and captured by the collector automatically. The usage of the suffix makes sure that multiple
amplxe-cl
/
inspxe-cl
instances launched in the same directory on different nodes do not overwrite the data of each other and can work in parallel. So, a separate result directory is created for each analyzed process in the job.
Sometimes it is necessary to collect data for a subset of the MPI processes in the workload. In this case the per-host syntax of
mpirun
/
mpiexec*
should be used to specify different command lines to execute for different processes.
When launching the collection on Windows OS, we recommend passing the
-genvall
option to the
mpiexec
tool to make sure that the user environment variables are passed to all instances of the profiled process. Otherwise, by default the processes are launched in the context of a system account and some environment variables (USERPROFILE, APPDATA) do not point where the tools expect them to point to.
There are also some specialties about stdout / stdin behavior in MPI jobs profiled with the tools:
  • It is recommended to pass the
    -quiet
    /
    -q
    option to
    amplxe-cl
    /
    inspxe-cl
    to avoid diagnostic output like progress messages being spilled to the console by every tool process in the job.
  • The user may want to use the
    -l
    option for
    mpiexec
    /
    mpirun
    to get stdout lines marked with MPI rank.
Example
The most reasonable analysis type to start with for the
Intel VTune
Profiler
is hotspots, so an example of full command line for collection would be:
$
mpirun
-n 4 amplxe-cl -r my_result -collect hotspots -- my_app [my_app_ options]
A similar command line for the Intel Inspector and its ti1/mi1 analysis types (the lowest overhead threading and memory correctness analysis types respectively) would look like:
$
mpirun
-n 4 inspxe-cl -r my_result -collect mi1 -- my_app [my_app_ options]
$
mpirun
-n 4 inspxe-cl -r my_result -collect ti1 -- my_app [my_app_ options]
Here is an example where there are 16 processes in the job distributed across the hosts and hotspots data should be collected for only two of them:
$
mpirun
-host myhost -n 14 ./a.out : -host myhost -n 2 amplxe-cl -r foo -c hotspots ./a.out
As a result, two directories will be created in the current directory:
foo.14
and
foo.15
(given that process ranks 14 and 15 were assigned to the last 2 processes in the job). As an alternative to specifying the command line above, it is possible to create a configuration file with the following content:
# config.txt configuration file -host myhost -n 14 ./a.out -host myhost -n 2 amplxe-cl -quiet -collect hotspots -r foo ./a.out
and run the data collection as:
$
mpirun
-configfile ./config.txt
to achieve the same result as above (
foo.14
and
foo.15
result directories will be created). Similarly, you can use specific host names to control where the analyzed processes are executed:
# config.txt configuration file -host myhost1 -n 14 ./a.out -host myhost2 -n 2 amplxe-cl -quiet -collect hotspots -r foo ./a.out
When the host names are mentioned, consecutive MPI ranks are allocated to the specified hosts. In the case above, ranks 0 to 13, inclusive, will be assigned to myhost1, the remaining ranks 14 and 15 will be assigned to myhost2. On Linux, it is possible to omit specifying the exact hosts, in which case the distribution of the processes between the hosts will be done in round-robin fashion. That is, myhost1 will get MPI ranks 0, 2, and 4 thru 15, while myhost2 will get MPI ranks 1 and 3. The latter behavior may change in the future.
In the examples this reference uses the
mpirun
command as opposed to
mpiexec
and
mpiexec.hydra
while real-world jobs might use the
mpiexec*
ones.
mpirun
is a higher-level command that dispatches to
mpiexec
or
mpiexec.hydra
depending on the current default and options passed. All the examples listed in the paper work for the
mpiexec*
commands as well as the
mpirun
command.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804