One of the benefits of Intel Cluster Checker is that it acts as an application proxy. If the tool passed, then there is a high probability of an MPI application running properly.
To ensure this, the following exhaustive steps are enforced by Intel Cluster Checker test modules:
- ·Check that base libraries and their uniformity (base_libraries)
- ·Check that MPI tools have consistent paths (mpi_consistency)
- ·Check that per-node MPI jobs can do Hello World independently (intel_mpi_rt)
- ·Check that a global Hello World is successfully executed across compute nodes (intel_mpi_rt_internode)
- ·Runs Intel MPI Benchmarks such as Ping Pong to check available latency and bandwidth (imb_pingpong_intel_mpi)
- ·Stress the communication system by running the HPCC benchmark (hpcc)
If the tool reports something, then an MPI application might have issues to complete their work.
These steps will even catch potential timeouts due wrong configuration on the network stack; and most important, bad cabling or down hardware interfaces. However, if the cluster uses InfiniBand adapters then there is a known issue to be aware of. The global MPI check can hang as any other MPI application will do if InfiniBand is not correctly configured and online.
Intel® MPI Library Runtime Environment (All nodes), (intel_mpi_rt_internode, 1.8.....................................................^C
Caught signal INT, cleaning before termination.
With InfiniBand setups, the configuration of Intel Cluster Checker must define openib and dat_conf as dependencies of intel_mpi_rt_internode. This action will ensure that the InfiniBand devices are properly detected and healthy. openib check hardware devices, and dat_conf the DAPL software interface.
This decision cannot be done automatically as choosing were to use or not the low latency, high bandwidth capabilities of InfiniBand during the check is at discretion of the user. For instance, the administrator may want to double check that an Ethernet fabric can be properly used to run MPI applications.
Be aware that this manual requirement may be lifted in the near future.