Imagine a cluster that runs Intel® Cluster Checker on a fixed schedule and even reports failures automatically. By using a few extra tools, some of which may already exist on your system, one could have a self-checking cluster in just a few steps. The key to reach an automatic Intel Cluster Checker is creating a script that will use all these tools.
Within the script, one could update a file on the system that is checked by Ganglia (or Nagios) for monitoring. The script could also be setup to email a local tracker upon completion with all pertinent log information sent to the correct people. The possibilities are endless. Once the script is set up, it can be run in a batch system, such as Slurm (or PBS), and scheduled in a daily/weekly/monthly basis using cron.
SLURM Script and Setup
#SBATCH -F /etc/intel/clck/nodelist #Use Cluster Checker Nodelist
#SBATCH -t 10 #Max time, adjust for larger nodes
#SBATCH -p clck
#SBATCH --error=auto-clck.err --output=auto-clck.out
# COMPUTE NODES
NodeName=headnode Procs=24 State=UNKNOWN
NodeName=compute-0-[0-15] Procs=24 State=UNKNOWN
PartitionName=batch Nodes=compute-0-[0-15] Default=YES MaxTime =INFINITE State=UP
PartitionName=clck Nodes=headnode,compute-0-[0-15] Priority=65535 Hidden=Yes Default=NO AllowGroups =root MaxTime =30 State=UP
0 1 * * 7 /usr/bin/sbatch /usr/local/clck.sh # run every Sunday
By taking the examples above and adding in some site specific flavor, Intel Cluster Checker execution can be fully customized to detect and report a cluster failure without user intervention. One can run jobs with confidence knowing their cluster is being checked regularly and the right people are being informed when something goes wrong.
See the Intel Cluster Checker product documentation for more details.
To download the latest release, log into the Intel® Registration Center and click on the Intel® Cluster Checker product.