Reducing the Runtime of mpitune

The Intel® MPI Library includes a tool - mpitune - that can help to optimize the execution parameters of the Intel MPI Library itself. While the mpitune utility provides great help for the user in finding good parameters, it may also take a long time to execute since it is looking into a very wide search space.

Therefore, limiting the mpitune search space to a reasonable size and only involving parameters that are necessary from the user perspective, is a key for a successful usage of the tool. This is important, especially for the application specific mode of mpitune.

In order to better understand the current search space of mpitune, you may run your application in the “scheduler only” mode (-so) of mpitune.

$ mpitune –so –a \“mpirun …\”

Example output:

 # |                     Option                  |           Test          | S |
  1|I_MPI_RDMA_TRANSLATION_CACHE                 |...the best range options| - |
  2|I_MPI_WAIT_MODE                                               |...the best range options| - |
  3|I_MPI_RDMA_SCALABLE_PROGRESS               |...the best range options| - |
  4|I_MPI_EAGER_THRESHOLD                                |Tuning thresholds        | - |
  5|I_MPI_INTRANODE_EAGER_THRESHOLD        |Tuning thresholds        | - |
  6|I_MPI_RDMA_EAGER_THRESHOLD                   |Tuning thresholds        | - |
  7|I_MPI_DYNAMIC_CONNECTION                          |...the best range options| - |
  8|I_MPI_ADJUST_ALLGATHER                               |... I_MPI_ADJUST_* family| - |
  9|I_MPI_ADJUST_BARRIER                                    |... I_MPI_ADJUST_* family| - |
11'Dec'15 07:13:32 INF | Tuner has finished its work. It was provided by scheduler only limitation.


The overall runtime of mpitune can be estimated by.:

            #options x #option parameters x #iterations (default 3) x application runtime

Since the number of parameters per option, is specific for each option, it is not easy to precisely estimate the overall runtime. For example the I_MPI_ADJUST_ALLGATHER option has 5 parameters – 5 different algorithms that are implemented for the Allgather routine. However, taking an average parameter number of 5 for all options as an approximation, will help our estimation.

In this example with only 9 options to tune, the runtime will be about 135 times a single execution of the target application (9 x 5 x 3). Without limiting the search space however, this could be much larger.

Therefore, the following mpitune parameter table can help you to reduce the runtime by limiting the search space.



--fabric-list / -fl

Limit the number of fabrics to tune for. Since most HPC applications will run on one fabric only, select the fastest fabric available and therefore ignore all other fabrics. Example: -fl shm:dapl – only tune for shared memory and DAPL, while ignoring other fabrics like TCP.

--host-range / -hr

Define the number of compute nodes to be used without executing at any smaller number. Example: -hr 4:4 – only run on 4 nodes instead of starting with only a subset of nodes.

--perhost-range / -pr

Define the number of MPI ranks per node to be used without executing at any smaller number. Example: -pr 28:28 – only run on 28 ranks per node instead of starting with a smaller amount

--collectives-only / -co

Tune only the collective operations used inside the application

--iterations / -i

Define the number of iterations for each step. While the default is 3 iterations per configuration, reducing the number will also reduce the overall mpitune execution time. The side-effect however is that a reduced number of iterations will increase the influence that the noise will have on your final results.

--options-set / -os

Define your own set of options to be tuned by mpitune. Example: -os I_MPI_ADJUST_ALLGATHER,I_MPI_WAIT_MODE                    


Fast Tuning

Alternatively one could also use a completely different approach to tune the execution parameters of the Intel MPI Library. This fast tuning approach is available starting with Intel MPI 5.1.1 and leverages a different implementation of mpitune. Be aware that since the implementation is different, the parameter set of the command line is also different from the default mpitune.

The fast tuning focuses on the collective operations tuning only and requires a minimum of 10% application time spend within the collective operations per default. The workflow of the fast tuning approach is as follows.

  1. Executes the target application once with statistics enabled
  2. Studies the statistics output
  3. Runs the Intel® MPI Benchmarks to tune the collective operations used inside the target application

The 3rd step only applies the number of MPI ranks and message sizes used within the target application and therefore reproduces the behavior of the collective operations used. However, it only uses the Intel MPI benchmarks for that purpose and therefore further reduces the execution time of these parameters.

In order to make use of the fast tuning, one can use the "--fast" flag as follows.

$ mpitune --fast on -a \"mpirun …\"

For more information on the different parameters of mpitune as well as the usage of the mpitune utility, please have a look at the reference manual inside the Intel MPI Library installation or visit our online documentation page which includes an additional MPI Tuner Tutorial. Also, the command line help of the mpitune utility, especially in combination with the fast tuning option, can provide further help.

Para obtener información más completa sobre las optimizaciones del compilador, consulte nuestro Aviso de optimización.