User Guide

  • 2020
  • 09/09/2020
  • Public Content
Contents

MPI Analysis Workflow

To analyze the performance and correctness of an MPI application at the inter-process level, use the Intel® Trace Analyzer and Collector tool (located at
<installdir>/itac
directory after installation). The Intel Trace Analyzer and Collector attaches to the application through linkage (statically, dynamically, also through LD_PRELOAD or via the Intel Compiler
-tcollect
and
-tcollect-filter
options), or by using the itcpin tool. The tools collect information about events at the MPI level between processes and allow analyzing the performance and correctness of the MPI calls, deadlock detection, data layout errors, as well as risky or incorrect MPI constructs. The Intel Trace Analyzer and Collector data is correlated and aggregated across all processes and all nodes that participated in the execution run.
Beyond the inter-process level of MPI parallelism, the processes that make up the applications on a modern cluster often also use fork-join threading through OpenMP* and
Intel® Threading Building Blocks (Intel® TBB)
. This is where the
Intel® VTune™
Profiler
and the Intel Inspector should respectively be used to analyze the performance and correctness of an MPI application.
At the high level the analysis workflow consists of three steps:
  1. Use the
    amplxe-cl
    and
    inspxe-cl
    command-line tools to collect data about an application. By default, all processes are analyzed, but it is possible (and sometimes required for
    Intel VTune
    Profiler
    - there are certain collection technology limitations) to filter the data collection to limit it to a subset of processes. An individual result directory is created for each spawned MPI application process that was analyzed with MPI process rank value captured.
  2. Post-process the result, which is also called
    finalization
    or
    symbol resolution
    . This is done automatically for each result directory once the collection has finished.
  3. Open the content of each result directory through the GUI standalone viewer to analyze the data for the specific process. The GUI viewers are independent:
    Intel VTune
    Profiler
    and Intel Inspector have their own user-interfaces.
  • The file system contents should be the same on all nodes to make sure that the modules referenced in the collected data are available automatically on the host where the collection was initiated. This limitation can be overcome by manual copying of the modules for analysis from the nodes and adjusting the
    Intel VTune
    Profiler
    / Intel Inspector project search directories to make the modules found.
  • For
    Intel VTune
    Profiler
    the CPU model and stepping should be the same on all nodes so that the hardware Event-based sampling operates with the same Performance Monitoring Unit (PMU) type on all nodes.

MPI Analysis Limitations

There are certain limitations in the current MPI profiling support provided by the
Intel VTune
Profiler
/ Intel Inspector:
  • MPI dynamic processes are not supported by the
    Intel VTune
    Profiler
    / Intel Inspector. An example of dynamic process API is MPI_Comm_spawn
  • The data collections that use the hardware event-based sampling collector are limited to only one such collection allowed at a time on a system. When the
    Intel VTune
    Profiler
    is used to profile an MPI application, it is the responsibility of the user to make sure that only one SEP data collection session is launched on a given host. Common ways to achieve this is using the host syntax and distribute the ranks running under the tool over different hosts.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804