Intel® Cluster Checker executes in two phases. In the data collection phase, Intel® Cluster Checker collects data from the cluster for use in analysis. In the analysis phase, Intel® Cluster Checker analyzes the data in the database and produces the results of analysis. It is possible to invoke these phases together or separately and to customize their scope. By default, Intel® Cluster Checker verifies the overall health of the cluster using the health framework definition.
Invoking Intel® Cluster Checker
Intel® Cluster Checker is a Linux* command-line tool and can be executed using three different commands. The clck command executes data collection followed immediately by analysis and displays the results of analysis. A typical invocation of this command is:
clck -f nodefile
This command will run data collection and analysis using the specified nodefile to determine which nodes to examine and their roles.
It is also possible to run data collection and analysis separately. The clck-collect command executes only data collection without analyzing the data. Intel® Cluster Checker stores collected data in the shared directory (typically the home directory) in the database file .clck/201n/clck.db. A typical invocation of the data collection command is:
clck-collect -f nodefile
The clck-analyze command executes analysis using the most recent data available in the database. A typical invocation of the analysis command is:
clck-analyze -f nodefile
With these three command-line tools it is possible to execute data collection and analysis together (using the clck command) or separately (using the clck-collect and clck-analyze commands).
Additionally, Intel® Cluster Checker includes a database retrieval tool that displays data from the database in a readable format. To display the available data, use the command:
Use the --help option for more information about this command.
Note that Intel® Cluster Checker requires a shared directory to run data collection. This value is set to $HOME by default, but there may be some cases (such as running as root) when $HOME is not shared across nodes. It is possible to change this option by setting the environment variable CLCK_SHARED_TEMP_DIR to the desired shared directory.
In some cases, a message may appear indicating that root access is required to obtain more information. Except in these cases, it is recommended to limit use of Intel® Cluster Checker with root access.
Using a Nodefile
A typical use of the three available commands includes a nodefile using the -f option, as displayed above. A custom nodefile specifies which nodes to include and, if applicable, their roles. Intel® Cluster Checker contains a set of pre-defined roles. A separate hostname appears on each line. If applicable, a role can be specified after the hostname using the annotation # role: compute. If no role is specified for a node, that node is considered a compute node. The following example includes four nodes - one head node and three compute nodes.
node1 #role: head role: compute node2 #role: compute node3 #role: compute node4
A cluster with a single node would only include one hostname in the nodefile.
Intel® Cluster Checker also provides automatic node detection for data collection using Slurm. In order to use this functionality, Intel® MPI Library or password-less SSH must be configured correctly. Intel® Cluster Checker will automatically gather allocated hostnames using a Slurm query when running collection within a Slurm job. While Slurm knows about node types, Intel® Cluster Checker only reads the hostnames. The user login environment must be configured, as the environment will not be propagated over SSH or MPI. To use this functionality, invoke either the clck command or the clck-collect command without using the -f command-line option. For example, the command:
will automatically detect allocated hostnames, collect data on those nodes, and analyze the collected data. Calling clck-analyze without a nodefile will cause Intel® Cluster Checker to analyze recent data from every node available in the database.
For more information about writing nodefiles, see the Selecting Nodes section.
Framework definitions are XML files that define the behavior of Intel® Cluster Checker. They can specify what data is collected, how data is analyzed, and how that information is displayed. By default, Intel® Cluster Checker runs the health framework definition, which provides an overall examination of the health of the cluster. Intel® Cluster Checker provides a wide variety of framework definitions to customize your results, and all framework definitions are located in the Intel® Cluster Checker install directory in the path etc/fwd.
For example, to verify Intel® Omni-Path Architecture (Intel® OPA) Interface functionality, one could run the Intel® OPA framework definition (located at etc/fwd/opa.xml) using the following command:
clck -f nodefile -F opa
A full list of framework definitions and their descriptions is located in the Appendix. Additionally, the command
clck -X list
provides a full list of available framework definitions. To see a description of a specific framework definition (for example, opa.xml), run the following command:
clck -X opa
The following framework definitions are recommended for new users:
- basic_internode_connectivity validates inter-node accessibility by confirming the consistency of node IP addresses.
- dgemm_cpu_performance performs a double precision matrix multiplication routine that is used to verify the cpu performance. Reports nodes with substandard double precision performance relative to a threshold based on the hardware and performance outliers outside the range defined by the median absolute deviation.
- environment_variables_uniformity verifies the uniformity of all environment variables.
- health provides a complete analysis of the cluster, excluding analysis related to specific specs. Health is the default framework definition.
- imb_pingpong_fabric_performance confirms that the Intel® MPI Benchmarks PingPong benchmark ran successfully for nodes within the cluster. Also reports network bandwidth and latency outliers defined by other measured values in the same grouping and if latency or network bandwidth fall below a certain threshold.
- memory_uniformity determines if the amount of physical memory is uniform across the cluster.
- rpm_uniformity verifies the uniformity of the RPMs installed across the cluster and reports absent and superfluous RPMs.
- stream_memory_bandwidth_performance identifies nodes with memory bandwidth outliers (as reported by the STREAM benchmark) outside the range defined by the median absolute deviation.
It is possible to create custom framework definitions to further configure desired results. For more information about the contents of framework definitions, see the Framework Definitions chapter.