User Guide

  • 2021.1
  • 01/08/2021
  • Public Content

Getting Started

Prerequisites
  • Intel® Cluster Checker must be accessible by the same path on all nodes.
  • A readable, writable shared directory must be available from the same path on all nodes for temporary file creation.
    • $HOME
      as the shared directory is used by default, but you can change this option by setting the environment variable
      $CLCK_SHARED_TEMP_DIR
      to the shared directory.
    • For admin privileged users, such as root, the environment variable
      $CLCK_SHARED_TEMP_DIR
      must be explicitly set.
  • Determine if passwordless ssh access to all nodes is set up. (e.g. test if the command
    ssh
    <nodename>
    hostname
    responds with a valid hostname, while not asking for ‘Password:’)
    • If passwordless ssh to all nodes is available - go ahead with Environment Setup and Running using Slurm below. By default Intel® Cluster Checker is configured to use passwordless ssh (through the command
      pdsh
      ) to launch remotely on nodes of the cluster. Note: you may need to add enabling passwordless access in your local ssh configuration setup.
    • If passwordless ssh is not available, use these steps to configure Intel® Cluster Checker to instead use the command
      mpirun
      from the Intel® MPI Library to launch remotely on nodes:
      • requires Intel MPI Library to be set up, and either
      • locate and copy the <installdir>/clck/<version>/etc/clck.xml file locally and uncomment the
        <extension>mpi.so</extension>
        by removing the commenting statements
        <!--
        before it and the
        -->
        after it. Then add the following option when running the
        clck
        command:
        • -c
          <path/to/local/copy/of/clck.xml>
      • or locate and edit the <installdir>/clck/<version>/etc/clck.xml file to uncomment the line containing
        <extension>mpi.so</extension>
        , which changes the default to use mpirun instead of pdsh/ssh for remotely launch.
Environment Setup
Before you start using any of the Intel® Cluster Checker functionality, make sure to establish the proper environment. If you are new to Linux, this means we need to make sure the command line is setup to find the applications we just installed. Fortunately Intel provides helper scripts to accomplish this.
  • To use scripts, follow these steps to setup the shell environment. (By default these scripts are found with the packages they are installed with; default install location is /opt/intel/<package-name>/bin/ or <installdir>/<package-name>/bin/ )
    • If using the Intel® oneAPI HPC Toolkit
      source
      setvars.sh
      from oneAPI HPC Toolkit by default this would be
      source
      /opt/intel/oneapi/setvars.sh
      and will analyze all software installed from oneAPI and add it to your path.
      If you rather individually choose specific software packages you can still do so, i.e.:
      source
      /opt/intel/oneapi/clck/<version>/bin/clckvars.sh
    • or if you are using individual package versions
      source
      mpivars.[sh
      |
      csh]
      from Intel® MPI Library
      source
      mklvars.[sh
      |
      csh]
      from Intel® Math Kernel Library (Intel® MKL)
      source
      compilervars.[sh
      |
      csh]
      from Intel® Parallel Studio XE Cluster Edition
      source
      clckvars.[sh
      |
      csh]
      from
      <installdir>/clck/<version>/bin/clckvars.[sh
      |
      csh]
    • or from Intel® Parallel Studio XE Cluster Edition including all above components
  • An alternative to these scripts is ‘modulefiles’ to setup your runtime environment.
    • Versioned modulefiles for all above components can be installed and loaded with Intel® oneAPI.
    • Alternatively the Intel® Cluster Checker modulefile is available using the module commands
      module
      use
      <installdir>/clck/<version>/modulefiles
      module
      load
      clck
If the
syscfg
system configuration utility or the ‘OSU micro-benchmarks’ were installed, make sure these were also added to the environment path variable
$PATH
.
Running using an Individual Nodefile
The command line for Intel® Cluster Checker is
clck
. If you type in
clck
to the Linux command line, hit enter, and it returns
command
not
found
; then the environment setup is not correct.
A nodefile specifies which nodes to include and, if applicable, their roles. Intel® Cluster Checker contains a set of pre-defined roles. A separate hostname appears on each line. If no role is specified for a node, that node is considered a compute node. The following example includes four compute nodes.
node1 node2 node3 node4
A cluster with a single node would only include one hostname in the nodefile. Localhost is not a recommended hostname, use the value returned by the command
hostname
on the servers themselve and are network resolvable.
You can then do your first run for Intel® Cluster Checker by running
clck
-f
<nodefile>
Running using Slurm
Regardless of whether you are using a batch script via (sbatch) or allocating nodes (salloc), Intel® Cluster Checker uses the list of nodes allocated through Slurm automatically, unless you override it with the individual nodefile option
-f
<nodefile>
.
Do not use the command
srun
to start Intel® Cluster Checker. Only use the
clck
command (or
clck-collect
,
clck-analyze
, etc.), as parallel job for remote data collection is built-in already.
If running on the commandline with a
salloc
Slurm resource allocation, remember to have set up the environment. You can then launch Intel® Cluster Checker by running the command:
clck
If running with sbatch, you should be able to run Intel® Cluster Checker by using a Slurm script that must include the environment setup above through your choice of environment setup script(s) or module commands:
source /opt/intel/oneapi/setvars.sh clck
or for specific components:
source mpivars.[sh | csh] source mklvars.[sh | csh] source compilervars.[sh | csh] source clckvars.[sh | csh] # alternatively use psxevars.[sh | csh] or setvars.sh (Intel oneAPI), or modulefiles to setup environment clck
You can then run
sbatch
<script_name>
In both of the above cases, Intel® Cluster Checker will generate a summary output, an in-depth
clck_results.log
, and a separate
clck_execution_warnings.log
file.
User-Specific Workflows
Intel® Cluster Checker uses what we call a ‘Framework Definition’ to specify what data is collected, how data is analyzed, and how that information is displayed. By default, Intel® Cluster Checker runs the ‘health_base’ Framework Definition, which provides a quick overall examination of the health of the cluster. Intel® Cluster Checker provides a wide variety of Framework Definitions. We describe here the highest level Framework Definitions for particular types of users; however, you can get a full list of available Framework Definitions by running
clck
-X
list
You will get further details of a Framework Definition with the option -X and the name of the specific Framework Definition. E.g.
clck
-X
cpu_base
or
clck
-X
clock
or
clck
-X
health_base
|
more
.
The rest of this page includes some of the more commonly used Framework Definitions that can be helpful depending on your role. You can also find a full list of Framework Definitions in the
Reference
section.
Admin:
For the privileged user, there are four different common-use Framework Definitions for cluster analysis. When first running as an administrator, run
clck
<options>
-F
health_base
You can then look in the file clck_results.log to read the in-depth results of the analysis. These are preliminary checks that would work for either user or administrator. For a more comprehensive, administrator-specific run, next run
clck
<options>
-F
health_admin
If you want to extend to further in-depth checking of the intricacies of your cluster’s uniformity, you will also include the Framework Definitions ‘lshw_hardware_uniformity’, which will find discrepancies in hardware or firmware between nodes, and ‘kernel_parameter_uniformity’, which will give an analysis of the uniformity of the kernel setup, by using
clck
<options>
-F
health_extended_admin
If the optional ‘syscfg’ system configuration utility command has been installed, run and tested to ensure the system is configured uniformly across nodes, can run by
clck
<options>
-F
syscfg_settings_uniformity
You can run all of the above in a single run by running multiple framework definitions at once.
clck
<options>
-F
health_extended_admin
-F
syscfg_settings_utility
These commands will provide preliminary analysis on the screen, with more details available by default in the file clck_results.log. At this point you can explore other framework options to find what serves your needs best. Be aware that some of the user-level Framework Definitions may not run well as root since they include running of an MPI parallel application.
Here is an overview of all the embedded tests the health_extended_admin framework definition contains. As you can see, health_extended_admin is a super set of health_admin, kernel_parameter_uniformity and lshw_hardware_uniformity; and these framework definitions may in turn have additional tests they perform:
health_extended_admin |-- health_admin | |-- health_base | | |-- cpu_user | | |-- environment_variables_uniformity | | |-- ethernet | | |-- infiniband_user | | |-- network_time_uniformity | | |-- node_process_status | | `-- opa_user | |-- basic_shells | |-- cpu_admin | |-- dgemm_cpu_performance | |-- mpi_bios | |-- infiniband_admin | |-- kernel_version_uniformity | |-- local_disk_storage | |-- memory_uniformity_admin | |-- mpi_libfabric | |-- opa_admin | |-- perl_functionality | |-- privileged_user | |-- python_functionality | |-- rpm_uniformity | |-- services_status | `-- stream_memory_bandwidth_performance |-- kernel_parameter_uniformity `-- lshw_hardware_uniformity
Note:
Administrators and privileged users
must
be aware that the data they collect with privileges may contain information about the servers that should be protected. Data such as system MSR settings. It is highly recommended that the database a privileged user creates is protected and realize that it should not be shared with users you do not want to have access to that type of information.
User:
For the non-privileged cluster user, there are three common-use Framework Definitions for cluster analysis. When first running, run
clck
<options>
-F
health_base
You can then look in the file clck_results.log to read the in-depth results of the analysis. In the event that you desire more extended checking, including several lightweight performance checks (IMB, SGEMM, STREAM), you can next run
clck
<options>
-F
health_user
To add more extensive performance checking (DGEMM, HPL) to the above, you can next run
clck
<options>
-F
health_extended_user
These commands will provide preliminary analysis on the screen, with more details available by default in the file clck_results.log. At this point you can explore other framework options to find what serves your needs best. Be aware that not all tools are user-accessible so some may report data missing.
Here is an overview showing how health_extended_user framework definition is a package containing many different sets of tests including some other framework definitions that contain even more checks and tests, such as health_user and health_base:
health_extended_user |-- health_user | |-- health_base | | |-- cpu_user | | |-- environment_variables_uniformity | | |-- ethernet | | |-- infiniband_user | | |-- network_time_uniformity | | |-- node_process_status | | `-- opa_user | |-- basic_internode_connectivity | |-- basic_shells | |-- file_system_uniformity | |-- imb_pingpong_fabric_performance | |-- kernel_version_uniformity | |-- memory_uniformity_user | |-- mpi_local_functionality | |-- mpi_multinode_functionality | |-- perl_functionality | |-- python_functionality | |-- sgemm_cpu_performance | `-- stream_memory_bandwidth_performance |-- dgemm_cpu_performance `-- hpl_cluster_performance
Intel® MPI Library Troubleshooting
Admin:
For the privileged user wanting to make sure their cluster is set up to work with the Intel® MPI Library, run
clck
<options>
-F
mpi_prereq_admin
This Framework Definition helps debug BIOS, software, environment, and hardware issues that could be causing sub-optimal performance or problems using the Intel® MPI Library.
User:
For the non-privileged user wanting to make sure their cluster is set up to work with the Intel® MPI Library, run
clck
<options>
-F
mpi_prereq_user
This Framework Definition helps debug environment and software issues that could be causing sub-optimal performance or problems using the Intel® MPI Library.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.