The cluster tuning workflow is different from the approach for non-distributed applications. In most cases, you tune on the “general-to-specific” basis. That is, when tuning distributed applications, you should generally start with optimizing the communication level and then switch to optimizing on the node level.
The cluster tuning workflow discussed in this section is not universal and might not apply in certain cases. You can perform steps in a different order than presented below or skip them entirely. A common alternative to the recommended workflow is to analyze the node-level performance first (step 6, below, and other optimizations) and then proceed to the cluster-level tuning (steps 1-5, below).
The following recommendations assume you have a working cluster environment and an MPI application source code or executable.
1. Check Cluster Environment (Linux* only)
To start, check the cluster “health”, using the Intel® Cluster Checker. Intel® Cluster Checker verifies the configuration and performance of the cluster by performing several tests. If the tool identifies some issues, it provides diagnostic messages and, in some cases, recommendations to help repair the cluster.
For example, the Intel Cluster Checker can check availability of the nodes, performance of MPI and disk I/O operations, uniformity and wellness of interconnects and more. For the full list of checks the tool can perform, refer to the Intel Cluster Checker User’s Guide.
2. Start Using Intel® MPI Library
Use the Intel® MPI Library to get the most from Intel® microarchitecture processors and interconnects, like Intel® Omni-Path architecture. The library is well optimized for this hardware out-of-the-box and can adapt to a given cluster environment.
The Intel MPI Library is ABI-compatible with MPICH, which means you don’t have to recompile your MPICH-based code when switching to the Intel MPI Library. For applications based on other MPI implementations, the Intel MPI Library provides compiler wrapper scripts that simplify the recompilation procedure. These scripts are available for Intel® compilers (mpiicc and mpiifort) and for GNU compilers (mpicc, mpifort) for C/C++ and Fortran languages.
To run an MPI application, use the launcher scripts provided: mpirun on Linux* or mpiexec on Windows*.
By using the Intel MPI Library, you can also benefit from built-in analysis capabilities and from tight integration with other analysis tools. As described below, there are several runtime options that can facilitate your analysis with MPI Performance Snapshot, Intel® Trace Analyzer and Collector, Intel® VTune™ Amplifier XE and others.
3. Perform High-Level Performance Analysis (Linux* only)
Use the MPI Performance Snapshot tool (available within the Intel® Trace Analyzer and Collector package) as a starting point for analyzing your MPI application. The tool provides some basic information about your application, enabling you to get a general idea about its performance. The tool also provides recommendations on where to continue with a more thorough analysis.
MPI Performance Snapshot provides important metrics on the time spent in computation, MPI, and OpenMP* and whether these loads are balanced. It also tells you the top five MPI functions, memory usage, and much more.
To use the MPI Performance Snapshot, run the application with the –mps option of mpirun to collect statistics. Analyze the statistics by using the mps utility.
Based on the analysis results for your particular application, the MPI Performance Snapshot recommends which tool is best to use next. In general, use Intel Trace Analyzer and Collector as the next step.
Most of the data are presented in a summary HTML report, although you can go into more details using the command-line diagrams.
4. Analyze MPI Communication
As the next practical step, use the Intel® Trace Analyzer and Collector to analyze the MPI communication in the application, and Intel® Trace Analyzer and Collector. Intel Trace Analyzer and Collector offers a rich variety of charts and diagrams that help you understand the application flow, identify the hotspots, and understand the load balancing across MPI ranks.
The first step for using Intel Trace Analyzer and Collector is to “trace” an application. Use the -trace option of mpirun to generate a trace file on Linux* OS, or recompile it with the -trace option on Windows*. To analyze an application, open the trace file in the Intel Trace Analyzer GUI (traceanalyzer tracefile.stf).
The most common chart in the Intel Trace Analyzer is the Event Timeline, showing the application activity over time and helping you understand the MPI communication pattern.
Other important charts include:
- Quantitative Timeline – allows you to view the work distribution between MPI ranks at a given point in time
- Function Profile – presents the function summary and load balance
- Message and Collective Operation Profiles – show how point-to-point and collective operations load the MPI ranks
These charts let you visualize how effectively you've used MPI in your cluster program. While there are no one-size-fits-all solutions that solve the issues you might see in your program, this visualization tool together with insight into your program can aid you in finding fixes. Once you’ve made edits, you can use comparison mode to show the overall affects of the changes you’ve made.
5. Automatic MPI Tuning
Intel® MPI Library provides an automatic tuning utility called mpitune, which can adjust the library settings for a specific application. After you have fixed all major communication, consider using this utility to refine the library settings. For example, if your application uses collective MPI operations heavily, you can use the fast tuning mode, mpitune --fast, to choose the best collective algorithms for your application.
6. Perform Node-Level Analysis
Once the MPI part of the application works as expected, use Intel® VTune™ Amplifier XE to analyze the node-level application performance. The Intel® MPI Library offers facilities for easier interoperability with the tool.
For an MPI application analysis, Intel VTune Amplifier has a special HPC Performance Characterization analysis tool. This type of analysis can help you identify how effectively the CPU and memory resources are used, and how well the application is parallelized with OpenMP* threads.
To run your application under Intel VTune Amplifier, on Linux* use the -gtool option of mpirun to specify the ranks to be analyzed. On Windows*, you can specify the Intel VTune Amplifier command for a particular MPI process when running the application.
On the node level, apart from using the VTune Amplifier, you can use any of the techniques discussed in the General Step-by-Step Approach in the last section to boost the application intra-node performance.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804