Running Intel® Cluster Checker in Big Clustered Systems with Hundred Nodes

On cluster systems having several hundred nodes, it is especially useful to execute both the Intel® Cluster Ready certification and the default wellness modes, thus integrating more than 100 test modules with an overall of 5000 individual checks on configuration, uniformity and performance.

The following items show several tips to consider before executing the tool.

1.     Be sure to run the latest version of Intel Cluster Checker: scalability and execution time are always improved as part of the development program.
2.     Use the usual nodefile environment variable to define the list of nodes to exercise. The system's job scheduler should pick up this setting when running Intel Cluster Checker. (*). (**).
(*)  We highly recommend that you use a small subset of compute nodes while setting the Intel Cluster Checker configuration. Once findings are resolved it should be easy to request more nodes and run the same settings over the whole cluster.
(**) You can increment the process limit to increase the number of parallel working processes, although this will increase the load on the system running Intel Cluster Checker.
<process-limit> 256 </process-limit>
3.     Check the running times for each test module before tuning module configurations, focusing on the longest ones first is likely to offer more benefits after your effort. You may choose to explicitly exclude a test module by using the --exclude option or the <exclude> tag. The elapsed time of each test module can be easily found in the output logs with the following command.
$ grep elapsed *xml
log.xml:  <module name="imb_pingpong_intel_mpi" description="Network Performance" elapsed_time="304">
Several well-known benchmarks are included with Intel Cluster Checker. All of them are optimized to provide useful insights in the shortest amount of time. However, you may want to apply further tuning to the input problem size to fit your particular needs. For instance, you may check the configuration details of the HPCC* benchmark test module at the specific man page.
$ man clck-hpcc
Beyond hundred nodes special considerations should be added like running the tool multiple times in parallel over different racks on the system, intersecting as much nodes as possible. Contact your Intel Cluster Ready Technical Support Engineer to find out specific details about running Intel Cluster Checker in systems with thousand nodes and above.
For more complete information about compiler optimizations, see our Optimization Notice.