Intel Cluster Checker: Reverse Execution Mode

General Description


Reverse Execution is a new Intel Cluster Checker execution mode. The name was chosen, because the execution of the tool in this mode is reversed, executing modules with more dependencies first, and their dependencies tested should a test module fail.

In this mode, an optimistic behavior of the cluster as a whole is expected, in other words, it is assumed that the cluster to be tested is healthy and that all functionality is working properly. With this assumption in mind, the list of test modules to be tested is created without adding their dependencies, and is sorted with the modules with more dependencies first. Then, if a test module fails during the execution, root causing is triggered and the list of modules to be executed is replaced by the failed module's dependencies, sorted in the same way.

When the execution ends, we can check if the module failed on its own, or if it was due to one of its dependencies.

Why use the Reverse Execution Mode?



Execution of the whole set of tests is a time consuming task, more so if it is performed on regular basis. So, what about assuming that our cluster is functioning properly? Following that line of thought lead to the creation of this execution mode.

If you run Intel Cluster Checker in Reverse Execution Mode, you run a reduced set of tests, which are the tests with more dependencies, thus reducing the execution time. These tests are the ones that exercise the cluster as a whole (memory, cpu, network, etc.). If these tests pass then we assume that the cluster is in good shape.

Execution Example


Example 1 is a simple successful execution of the 'hpcc' test module in reverse mode.

Example 1

[icr@pepinillo ~]$ /opt/intel/clck/1.7/cluster-check --include_only hpcc --reverse

Warning: running in reverse mode...
   This feature is experimental and included here for feedback
   Output logs generated with this option cannot be used for
   certification purposes

Intel® Cluster Checker, Version s4-rev33425-b890
Commandline: '/opt/intel/clck/1.7/cluster-check --include_only hpcc --reverse'
Running as 'icr' on 2011-03-22 15:21:00

Configuration: /etc/intel/clck/config.xml

<cluster>
  <global_configuration>
    <cc-path>/opt/intel/cce/11.1.069/</cc-path>
    <fc-path>/opt/intel/fce/11.1.069/</fc-path>
    <ibstat-path>/usr/sbin/</ibstat-path>
    <mkl-path>/opt/intel/mkl/10.2.4.032/</mkl-path>
    <mpi-path>/opt/intel/impi/4.0.0.025/</mpi-path>
  </global_configuration>
  <nodefile>/etc/intel/clck/nodelist</nodefile>
  <test>
  </test>
  <user>icr</user>
</cluster>

Checking 5 nodes:
  compute-03-00, compute-03-01, compute-03-02, compute-03-03, pepinillo

Log files saved in directory: /var/log/intel/clck

Exclusively including modules (at user request):
  'hpcc'

Including modules (at user request):
  'ping'

Test                                                                   Result
--------------------------------------------------------------------------------
Basic Network Connectivity, (ping).....................................Succeeded
HPC Challenge Benchmark (Intel® C++ Compiler, Intel® MPI
Library, Intel® Math Kernel Library), (hpcc)
Attention: this check may take a long time to complete.................Succeeded

Check has Succeeded.

[icr@pepinillo ~]$ cat /var/log/intel/clck/config-20110322.152100.xml | grep 'name="hpcc"'
    <module version="s4-rev33425-b890" name="hpcc" severity="1"
description="HPC Challenge Benchmark (Intel® C++ Compiler, Intel® MPI
Library, Intel® Math Kernel Library)" elapsed_time="88">



What would happen is 'hpcc' test module failed in reverse mode?

As can be seen in Example 2 'hpcc' test module dependencies are run after that test module fails. And the last dependency that failed is 'ssh', so we can assume that there is a problem with ssh connectivity in the cluster.

Example 2

[icr@pepinillo logs]$ /opt/intel/clck/1.7/cluster-check --include_only hpcc --reverse
Warning: running in reverse mode...
   This feature is experimental and included here for feedback
   Output logs generated with this option cannot be used for
   certification purposes

Intel® Cluster Checker, Version s4-rev33425-b890
Commandline: '/opt/intel/clck/1.7/cluster-check --include_only hpcc --reverse'
Running as 'icr' on 2011-03-22 17:01:26

Configuration: /etc/intel/clck/config.xml

<cluster>
  <global_configuration>
    <cc-path>/opt/intel/cce/11.1.069/</cc-path>
    <fc-path>/opt/intel/fce/11.1.069/</fc-path>
    <ibstat-path>/usr/sbin/</ibstat-path>
    <mkl-path>/opt/intel/mkl/10.2.4.032/</mkl-path>
    <mpi-path>/opt/intel/impi/4.0.0.025/</mpi-path>
  </global_configuration>
  <nodefile>/etc/intel/clck/nodelist</nodefile>
  <test>
  </test>
  <user>icr</user>
</cluster>

Checking 5 nodes:
  compute-03-00, compute-03-01, compute-03-02, compute-03-03, pepinillo

Log files and debug files saved in directory: /home/icr/logs

Exclusively including modules (at user request):
  'hpcc'

Including modules (at user request):
  'ping'

Test                                                                   Result
--------------------------------------------------------------------------------
Basic Network Connectivity,
(ping).....................................Succeeded
HPC Challenge Benchmark (Intel® C++ Compiler, Intel® MPI
Library, Intel® Math Kernel Library), (hpcc)
Attention: this check may take a long time to complete.................FAILED
 [ERROR]
subtest 'pre-runtime error' failed
  - failing All hosts returned: 'error building or distributing hpcc'
Intel® MPI Library Runtime Environment (All nodes),
(intel_mpi_rt_internode)...............................................FAILED
 [ERROR]
  - failing All hosts returned: 'hello world compilation error'
Intel® MPI Library Runtime Environment (Single-node),
(intel_mpi_rt).........................................................FAILED
 [NOTICE]
subtest 'mpd shutdown' indeterminate
  - indeterminate All hosts returned: 'skipped due to earlier failure'
subtest 'mpd startup' indeterminate
  - indeterminate All hosts returned: 'skipped due to earlier failure'
 [ERROR]
subtest 'Permissions on $HOME/.mpd.conf' failed
  - failing All hosts returned: 'file does not exist'
Bourne Shell, (sh).....................................................FAILED
 [ERROR]
subtest 'Hello World!' failed
  - failing All hosts returned: 'unspecified runtime error'
subtest 'executable shell interpreter' failed
  - failing All hosts returned: '/bin/sh not found'
GenuineIntel processors, (genuine_intel)...............................FAILED
 [CRITICAL]
subtest 'default' failed
  - failing All hosts
Node SSH Connectivity, (ssh)...........................................FAILED
 [CRITICAL]
subtest 'default' failed
  - failing All hosts returned: 'expected "test string", got "unknown
    failure"'
Basic Network Connectivity, (ping).....................................Succeeded

Check has FAILED.

The Grass is Always Greener on the Other Side

We can agree that the cluster is not being thoroughly tested, but as it usually happens, we are trading time for precision.
The main focus of this execution mode is to provide fast and comprehensive results on which to base decisions. For example, you would save a lot of time if instead of running the whole set of tests every day to ensure the cluster's health, you run it in Reverse Execution Mode, and only trigger a regular execution if something went wrong.

Future Applications: Scaling


Scaling to hundreds or thousands of nodes is one of the tool's future challenges, executions could take several hours per day, rendering the tool useless in heavy duty environments which require high availability. This execution mode could be one of the ways to tackle it.

For more complete information about compiler optimizations, see our Optimization Notice.