Imagine a cluster that runs Intel® Cluster Checker on a fixed schedule and even reports failures automatically. By using a few extra tools, some of which may already exist on your system, you could have a fully self-checking cluster in just a few steps. The key to automatic cluster checker is creating a script that will use all these tools.
Within the script, you can update a file on the system that is checked by NAGIOS for monitoring. The script could also be setup to email a local tracker upon completion with all pertinent log information sent to the correct people. The possibilities are endless. Once the script is set up, it can be run in a batch system, such as SLURM (Simple Linux Utility for Resource Management), and scheduled daily/weekly/monthly using cron.
Script setup for SLURM:
#!/bin/bash #SBATCH -J ClusterChecker
#SBATCH -F /etc/intel/clck/nodelist #Use Cluster Checker Nodelist
#SBATCH -t 10 #Max time, adjust for larger nodes
#SBATCH -p clck
#SBATCH --error=auto-clck.err --output=auto-clck.out
0 1 * * 7 /usr/bin/sbatch /usr/local/clck.sh #Run CLCK every Sunday
# COMPUTE NODES
NodeName=headnode Procs=24 State=UNKNOWN
NodeName=compute-0-[0-15] Procs=24 State=UNKNOWN
PartitionName=batch Nodes=compute-0-[0-15] Default=YES MaxTime =INFINITE State=UP
PartitionName=clck Nodes=headnode,compute-0-[0-15] Priority=65535 Hidden=Yes Default=NO AllowGroups =root
MaxTime =30 State=UP
By taking the examples above and adding in some site-specific flavor, Automatic Cluster Checker can be customized fully to detect and report a cluster failure without user intervention. Jobs can be run with confidence knowing the cluster is being checked regularly and the right people are being informed when something goes wrong.
The Intel® Cluster Ready team has begun to implement FDR Infiniband solutions from Mellanox in our lab with recommendations for our Intel® Xeon® Processor E5-2600 Family Cluster Solutions Guide. The features and performance offered by Mellanox's latest evolution of this powerhouse technology are impressive. You may have experienced the great FDR performance, greater than 12 GB/s bandwidth and less than 0.7us latency with PCI Express 3.0 support. Here are some additional implementation tips that you might not know.
First of all, to get everything up and running you need to upgrade your existing cables to be capable of running FDR speeds. Existing cables will only get QDR speeds. In addition, Intel® Server System H2000 and Intel® Server System R2000 have FDR options.
Be sure to update firmware to the latest version. OFED version 1.5.4 or later is required for proper support. Older OFED versions may work, however they will not correctly recognize the card. There may be performance issues with older implementations and older versions may cause QDR speeds.
Mellanox's FDR Infiniband has been certified as Intel Cluster Ready in several vendor reference design implementations. FDR Infiniband is also fully supported by Intel Cluster Checker and the entire Intel® Cluster Toolkit.
FDR, 56Gb/s, Passive Copper Cables Data Sheet
Look For Intel® Cluster Ready
When you see the Intel Cluster Ready name, you can be assured the cluster solution complies with the Intel Cluster Ready specification and has passed the tests of the Intel® Cluster Checker.
For more information read the Intel® Cluster Ready Usage Guidelines.
Look for Intel Cluster Ready solution vendors.