Clusters are complex systems, and it can be difficult to identify issues when something goes wrong. Intel® Cluster Checker aims to reduce this complexity barrier and make debugging easier. It collects data from the cluster, analyzes that data, and produces a clear list of found issues. Using Intel® Cluster Checker, you can resolve issues quickly and move on to actually using your cluster.

Intel® Cluster Checker verifies the configuration and performance of Linux*-based clusters by providing analysis of cluster uniformity, performance, and functionality. It can infer overall problems based on found issues and provide actionable remedies to solve problems. It can also verify compliance with Intel® specifications. This tool is ideal for application developers, cluster architects, system administrators, and any other cluster user who wants to easily identify issues with a cluster.

This guide provides step-by-step instructions for using the tool.

Key Concepts

Intel® Cluster Checker identifies issues on clusters and can give recommendations for fixing these issues. The key concepts are observations and diagnoses. Diagnosing a cluster happens in two phases: the collect and the analyze phase. The collect phase gathers required data for the cluster while the analyze phase analyzes the data collected to produce observations and diagnoses. Observations are objective indicators based on data from the cluster. For example, an observation is created when the performance of a node is less than expected or a configuration setting is incorrect.

A diagnosis combines one or more observations to identify the root cause of an issue. For example, a diagnosis is created when high MPI latency measurements are associated with Ethernet* configuration settings that enable interrupt coalescing. This association results in a diagnosis indicating that the enabled interrupt coalescing is the cause of the high MPI latency.

Each observation or diagnosis has a corresponding severity level. The possible severity levels are, in increasing order of severity: informational, warning, and critical. Upon running analysis, a brief summary of the results will appear on the screen, while a log file will contain the details of the analysis.

Intel® Cluster Checker uses XML files called framework definitions to specify what data to collect, what kinds of analysis to perform, etc. By default, Intel® Cluster uses the health framework definition, which provides a complete analysis of the cluster. For more information about framework definitions, see the section Framework Definitions - Intel® Cluster Checker Plugins. The Appendix contains additional useful information, including terminology, included framework definitions, included rules, and more.

For more complete information about compiler optimizations, see our Optimization Notice.