User Guide

Contents

MPI Code Analysis

Explore using
Intel® VTune™
Profiler
command line interface (
vtune
) for profiling an MPI application.
Parallel High Performance Computing (HPC) applications often rely on multi-node architectures of modern clusters. Performance tuning of such applications must involve analysis of cross-node application behavior as well as single-node performance analysis. Intel® Parallel Studio Cluster Edition includes such performance analysis tools as Application Performance Snapshot, Intel Trace Analyzer and Collector, and
Intel VTune
Profiler
that can provide important insights to help in MPI application performance analysis. For example:
  • Application Performance Snapshot provides a quick MPI application performance overview.
  • Intel Trace Analyzer and Collector explores message passing interface (MPI) usage efficiency with communication hotspots, synchronization bottlenecks, load balancing, etc.
  • Intel VTune
    Profiler
    focuses on intra-node performance with threading, memory, and vectorization efficiency metrics.
The version of the Intel MPI library included with the Intel Parallel Studio Cluster Edition makes an important switch to use the Hydra process manager by default for
mpirun
. This provides high scalability across the big number of nodes.
This topic focuses on how to use the
VTune
Profiler
command line tool to analyze an MPI application. Refer to the Additional Resources section below to learn more about other analysis tools.
Use the
VTune
Profiler
for a single-node analysis including threading when you start analyzing hybrid codes that combine parallel MPI processes with threading for a more efficient exploitation of computing resources. HPC Performance Characterization analysis is a good starting point to understand CPU utilization, memory access, and vectorization efficiency aspects and define the tuning strategy to address performance gaps. The CPU Utilization section contains the MPI Imbalance metric, which is calculated for MPICH-based MPIs. Further steps might include Intel Trace Analyzer and Collector to look at MPI communication efficiency, Memory Access analysis to go deeper on memory issues, Microarchitecture Exploration analysis to explore microarchitecture issues, or Intel Advisor to dive into vectorization tuning specifics.
Use these basic steps required to analyze MPI applications for imbalance issues with the
VTune
Profiler
:
Explore additional information on MPI analysis:

Configure Installation for MPI Analysis on Linux* Host

For MPI application analysis on a Linux* cluster, you may enable the
Per-user Hardware Event-based Sampling
mode when installing the Intel Parallel Studio Cluster Edition. This option ensures that during the collection the
VTune
Profiler
collects data only for the current user. Once enabled by the administrator during the installation, this mode cannot be turned off by a regular user, which is intentional to preclude individual users from observing the performance data over the whole node including activities of other users.
After installation, you can use the respective
vars.sh
files to set up the appropriate environment (PATH, MANPATH) in the current terminal session.

Configure MPI Analysis with the
VTune
Profiler

To collect performance data for an MPI application with the
VTune
Profiler
, use the command line interface (
vtune
). The collection configuration can be completed with the help of the target configuration options in the
VTune
Profiler
user interface. For more information, see Arbitrary Targets Configuration.
Usually, MPI jobs are started using an MPI launcher such as
mpirun
,
mpiexec
,
srun
,
aprun
, etc. The examples provided use
mpirun
. A typical MPI job uses the following syntax:
mpirun [
options
] <
program
> [<
args
>]
VTune
Profiler
is launched using
<
program
>
and your application is launched using the
VTune
Profiler
command arguments. As a result, launching an MPI application using
VTune
Profiler
uses the following syntax:
mpirun [
options
]
vtune
[
options
] <
program
> [<
args
>]
There are several options for
mpirun
and
vtune
that must be specified or are highly recommended while others can use the default settings. A typical command uses the following syntax:
mpirun -n <
n
> -l
vtune
-quiet -collect <
analysis_type
> -trace-mpi -result-dir <
my_result
> my_app [<
my_app_options
>]
The
mpirun
options include:
  • <
    n
    >
    is the number of MPI processes to be run.
  • -l
    option of the
    mpiexec
    /
    mpirun
    tools marks stdout lines with an MPI rank. This option is recommended, but not required.
The
vtune
options include:
  • -quiet
    /
    -q
    option suppresses the diagnostic output like progress messages. This option is recommended, but not required.
  • -collect <
    analysis type
    >
    is an analysis type you run with the
    VTune
    Profiler
    . To view a list of available analysis types, use
    VTune
    Profiler
    -help collect
    command.
  • -trace-mpi
    adds a per-node suffix to the result directory name and adds a rank number to a process name in the result. This option is required for non-Intel MPI launchers.
  • -result-dir <
    my_result
    >
    specifies the path to a directory in which the analysis results are stored.
If a MPI application is launched on multiple nodes,
VTune
Profiler
creates a number of result directories per compute node in the current directory, named as
my_result.<hostname1>
,
my_result.<hostname2>
, ...
my_result.<hostnameN>
, encapsulating the data for all the ranks running on the node in the same directory. For example, the Hotspots analysis (hardware event-based sampling mode) run on 4 nodes collects data on each compute node:


    
mpirun -n 16 –ppn 4 –l vtune -collect hotspots -k sampling-mode=hw -trace-mpi -result-dir my_result -- my_app.a
Each process data is presented for each node they were running on:


    
my_result.host_name1 (rank 0-3) my_result.host_name2 (rank 4-7) my_result.host_name3 (rank 8-11) my_result.host_name4 (rank 12-15)
If you want to profile particular ranks (for example, outlier ranks defined by Application Performance Snapshot), use selective rank profiling. Use multi-binary MPI run and apply
VTune
Profiler
profiling for the ranks of interest. This significantly reduces the amount of data required to process and analyze. The following example collects Memory Access data for 2 out of 16 processes with 1 rank per node:


    
export VTUNE_CL=vtune -collect memory-access -trace-mpi -result-dir my_result mpirun -host myhost1 -n 7 my_app.a : -host myhost1 -n 1 $VTUNE_CL -- my_app.a :-host myhost2 -n 7 my_app.a : -host myhost2 -n 1 $VTUNE_CL -- my_app.a
Alternatively, you can create a configuration file with the following content:


    
# config.txt configuration file -host myhost1 -n 7 ./a.out -host myhost1 -n 1 vtune -quiet -collect memory-access -trace-mpi -result-dir my_result ./a.out -host myhost2 -n 7 ./a.out -host myhost2 -n 1 vtune -quiet -collect memory-access -trace-mpi -result-dir my_result ./a.out
To run the collection using the configuration file, use the following command:


    
mpirun -configfile ./config.txt
If you use Intel MPI with version 5.0.2 or later you can use the
-gtool
option with the Intel MPI process launcher for easier selective rank profiling:
mpirun -n <
n
> -gtool "
vtune
-collect <
analysis type
> -r <
my_result
>:<
rank_set
>" <
my_app
> [
my_app_options
]
where
<
rank_set
>
specifies a ranks range to be involved in the tool execution. Separate ranks with a comma or use the "-" symbol for a set of contiguous ranks.
For example:


    
mpirun -gtool "vtune -collect memory-access -result-dir my_result:7,5" my_app.a
Examples:
  1. This example runs the HPC Performance Characterization analysis type (based on the sampling driver), which is recommended as a starting point:
    
    
        
    mpirun -n 4 vtune -result-dir my_result -collect hpc-performance -- my_app [my_app_options]
  2. This example collects the Hotspots data (hardware event-based sampling mode) for two out of 16 processes run on myhost2 in the job distributed across the hosts:
    
    
        
    mpirun -host myhost1 -n 8 ./a.out : -host myhost2 -n 6 ./a.out : -host myhost2 -n 2 vtune -result-dir foo -c hotspots -k sampling-mode=hw ./a.out
    As a result, the
    VTune
    Profiler
    creates a result directory in the current directory
    foo.myhost2
    (given that process ranks 14 and 15 were assigned to the second node in the job).
  3. As an alternative to the previous example, you can create a configuration file with the following content:
    
    
        
    # config.txt configuration file -host myhost1 -n 8 ./a.out -host myhost2 -n 6 ./a.out -host myhost2 -n 2 vtune -quiet -collect hotspots -k sampling-mode=hw -result-dir foo ./a.out
    and run the data collection as:
    
    
        
    mpirun -configfile ./config.txt
    to achieve the same result as in the previous example:
    foo.myhost2
    result directory is created.
  4. This example runs the Memory Access analysis with memory object profiling for all ranks on all nodes:
    
    
        
    mpirun n 16 -ppn 4 vtune -r my_result -collect memory-access -knob analyze-mem-objects=true -my_app [my_app_options]
  5. This example runs Hotspots analysis (hardware event-based sampling mode) on ranks 1, 4-6, 10:
    
    
        
    mpirun –gtool "vtune -r my_result -collect hotspots -k sampling-mode=hw : 1,4-6,10" –n 16 -ppn 4 my_app [my_app_options]
The examples above use the
mpirun
command as opposed to
mpiexec
and
mpiexec.hydra
while real-world jobs might use the
mpiexec*
ones.
mpirun
is a higher-level command that dispatches to
mpiexec
or
mpiexec.hydra
depending on the current default and options passed. All the listed examples work for the
mpiexec*
commands as well as the
mpirun
command.

Control Collection with Standard MPI_Pcontrol Function

By default,
VTune
Profiler
collects statistics for the whole application run. In some cases, it is important to enable or disable the collection for a specific application phase. For example, you may want to focus on the most time consuming section or disable collection for the initialization or finalization phases. This can be done with
VTune
Profiler
instrumentation and tracing technology (ITT). Starting with the
Intel VTune
Profiler
2019 Update 3 version,
VTune
Profiler
provides ability to control data collection for MPI application with the help of standard
MPI_Pcontrol
function.
Common syntax:
  • Pause data collection:
    MPI_Pcontrol(0)
  • Resume data collection:
    MPI_Pcontrol(1)
  • Exclude initialization phase: Use with the
    VTune
    Profiler
    -start-paused
    option by adding the
    MPI_Pcontrol(1)
    call right after initialization code completion. Unlike with ITT API calls, using the
    MPI_Pcontrol
    function to control data collection does not require a link to a profiled application with a static ITT API library and therefore changes in the build configuration of the application.

Resolve Symbols for MPI Modules

After data collection, the
VTune
Profiler
automatically finalizes the data (resolves symbols and converts them to the database). It happens on the same compute node where the command line collection was executing. So, the
VTune
Profiler
automatically locates binary and symbol files. In cases where you need to point to symbol files stored elsewhere, adjust the search settings using the
-search-dir
option:


    
mpirun -np 128 vtune -q -collect hotspots -search-dir /home/foo/syms ./a.out

View Collected Data

Once the result is collected, you can open it in the graphical or command line interface of the
VTune
Profiler
.
To view the results in the command line interface:
Use the
-report
option. To get the list of all available
VTune
Profiler
reports, enter
VTune
Profiler
-help report
.
To view the results in the graphical interface:
Click the menu button and select
Open
Result...
and browse to the required result file (
*.
vtune
).
You may copy a result to another system and view it there (for example, to open a result collected on a Linux* cluster on a Windows* workstation).
VTune
Profiler
classifies MPI functions as system functions similar to Intel Threading Building Blocks (Intel TBB) and OpenMP* functions. This approach helps you focus on your code rather than MPI internals. You can use the
VTune
Profiler
GUI Call Stack Mode filter bar combo box and CLI
call-stack-mode
option to enable displaying the system functions and thus view and analyze the internals of the MPI implementation. The call stack mode
User functions+1
is especially useful to find the MPI functions that consumed most of CPU Time (Hotspots analysis) or waited the most (Threading analysis). For example, in the call chain
main() -> foo() -> MPI_Bar() -> MPI_Bar_Impl() -> ...
,
MPI_Bar()
is the actual MPI API function you use and the deeper functions are MPI implementation details. The call stack modes behave as follows:
  • The
    Only user functions
    call stack mode attributes the time spent in the MPI calls to the user function
    foo()
    so that you can see which of your functions you can change to actually improve the performance.
  • The default
    User functions+1
    mode attributes the time spent in the MPI implementation to the top-level system function -
    MPI_Bar()
    so that you can easily see outstandingly heavy MPI calls.
  • The
    User/system functions
    mode shows the call tree without any re-attribution so that you can see where exactly in the MPI library the time was spent.
VTune
Profiler
prefixes the profile version of MPI functions with
P
, for example:
PMPI_Init
.
VTune
Profiler
provides Intel TBB and OpenMP support. You are recommended to use these thread-level parallel solutions in addition to MPI-style parallelism to maximize the CPU resource usage across the cluster, and to use the
VTune
Profiler
to analyze the performance of that level of parallelism. The MPI, OpenMP, and Intel TBB features in the
VTune
Profiler
are functionally independent, so all usual features of OpenMP and Intel TBB support are applicable when looking into a result collected for an MPI process. For hybrid OpenMP and MPI applications, the
VTune
Profiler
displays a summary table listing top MPI ranks with OpenMP metrics sorted by MPI Busy Wait from low to high values. The lower the Communication time is, the longer a process was on a critical path of MPI application execution. For deeper analysis, explore OpenMP analysis by MPI processes laying on the critical path.
Example:
This example displays the performance report for functions and modules analyzed for any analysis type. Note that this example opens per-node result directories (
result_dir.host1
,
result_dir.host2
) and groups data by processes -mpi ranks encapsulated in the per-node result:


    
vtune -R hotspots -group-by process,function -r result_dir.host1


    
vtune -R hotspots -group-by process,module -r result_dir.host2

MPI Implementations Support

You can use the
VTune
Profiler
to analyze both Intel MPI library implementation and other MPI implementations. But beware of the following specifics:
  • Linux* only: Based on the
    PMI_RANK
    or
    PMI_ID
    environment variable (whichever is set), the
    VTune
    Profiler
    extends a process name with the captured rank number that is helpful to differentiate ranks in a
    VTune
    Profiler
    result with multiple ranks. The process naming schema in this case is
    <
    process_name
    > (rank <
    N
    >)
    . To enable detecting an MPI rank ID for MPI implementations that do not provide the environment variable, use the
    -trace-mpi
    option.
  • For the Intel MPI library, the
    VTune
    Profiler
    classifies MPI functions/modules as system functions/modules (the
    User functions+1
    option) and attributes their time to system functions. This option may not work for all modules and functions of non-Intel MPI implementations. In this case, the
    VTune
    Profiler
    may display some internal MPI functions and modules by default.
  • You may need to adjust the command line examples in this help section to work for non-Intel MPI implementations. For example, you need to adjust command lines provided for different process ranks to limit the number of processes in the job.
  • An MPI implementation needs to operate in cases when there is the
    VTune
    Profiler
    process (
    vtune
    ) between the launcher process (
    mpirun
    /
    mpiexec
    ) and the application process. It means that the communication information should be passed using environment variables, as most MPI implementations do.
    VTune
    Profiler
    does not work on an MPI implementation that tries to pass communication information from its immediate parent process.

MPI System Modules Recognized by the
VTune
Profiler

VTune
Profiler
uses the following regular expressions in the Perl syntax to classify MPI implementation modules:
  • impi\.dll
  • impid\.dll
  • impidmt\.dll
  • impil\.dll
  • impilmt\.dll
  • impimt\.dll
  • libimalloc\.dll
  • libmpi_ilp64\.dll
This list is provided for reference only. It may change from version to version without any additional notification.

Analysis Limitations

  • VTune Amplifies does not support MPI dynamic processes (for example, the
    MPI_Comm_spawn
    dynamic process API).

Additional Resources

For more details on analyzing MPI applications, see the Intel Parallel Studio Cluster Edition and online MPI documentation at http://software.intel.com/en-US/articles/intel-mpi-library-documentation/. For information on installing
VTune
Profiler
in a cluster environment, see the
Intel VTune
Profiler
Installation Guide for Linux
.
There are also other resources available online that discuss usage of the
VTune
Profiler
with other Parallel Studio Cluster Edition tools:

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804