Appendix

Terminology

  • Analyzer Extension: An analyzer extension takes raw data and converts it in a form that knowledge base modules can use.
  • Configuration File: The configuration file is an XML file that provides greater configuration of Intel® Cluster Checker.
  • Data Provider: A data provider defines what data to collect from the cluster.
  • Diagnosis: A diagnosis is a broader inference based on one or more observations. For example, non-uniform memory would lead to a broader non-uniform hardware diagnosis.
  • Framework Definition - A framework definition is a XML file that defines the scope of data collection and analysis.
  • Issue: An issue is an observation about the cluster. It may indicate a problem or provide additional information. Issues can either be an observation or a diagnosis.
  • Knowledge Base Module: A knowledge base module contains a group of rules.
  • Message Catalog: The message catalog contains messages for display. Each issue has a message ID that maps to a message in the message catalog.
  • Nodefile - A nodefile is a file containing a list of nodes and their roles. The nodefile directs Intel® Cluster Checker which nodes to examine.
  • Observation: An observation provides objective information t about the cluster. It may indicate a problem or provide additional information. Observations may indicate a broader problem, in which case they would lead to a diagnosis. For example, a cluster with different amounts of memory per node would produce a memory not uniform observation.
  • Remedy: A potential actionable solution to the issue.
  • Rule: A rule takes data and, if the data meets certain conditions, triggers an observation or diagnosis. Rules are implemented in the CLIPS language.

 

Additional Configuration Options

Configuring the Database

You can specify a datastore configuration file in the main configuration file using the tags:

<datastore_extensions>
    <group path="datastore/intel64/">
        <entry config_file="default_sqlite.xml">libsqlite.so</entry>
    </group>
</datastore_extensions>

To use odbc instead of sqlite3, enter libodbc.so instead of libsqlite.so. Multiple entry tags will allow you to specify multiple databases through multiple datastore configuration files.

The datastore configuration file, by default, is located at /opt/intel/clck/201n/etc/datastore/default_sqlite.xml and takes the following format:

<configuration>
    <instance_name>clck_default</instance_name>
    <source_parameters>read_only=false|source=$HOME/.clck/201n/clck.db</source_parameters>
    <type>sqlite3</type>
    <source_types>data</source_types>
</configuration>

The instance_name tag defines a database source name. This value must be unique.

The source_parameters tag determines whether or not to open the database in read-only mode and indicates which database to use.

The type tag specifies what type of database to use. Currently, the only accepted value is sqlite3.

The source_types tag specifies what source type to use. Currently, the only accepted value is data.

 

Database Schema

The database consists of a single SQL view named clck_1. The Intel® Cluster Checker database is a standard SQLite* database and any SQLite* compatible tool may be used to browse the database contents. In addition, the clckdb utility is provided with Intel® Cluster Checker (see clckdb -h for more information).

rowid (INTEGER)

  • Unique row ID

 

Provider (TEXT)

  • Data provider name

 

Hostname (TEXT)

  • Hostname of the node where the data provider ran

 

num_nodes (INTEGER)

  • Number of nodes used by the data provider

 

node_names (TEXT)

  • Comma-separated list of nodes used by the data provider (empty if num_nodes = 1)

 

Exit_status (INTEGER)

  • Exit status of the data provider

 

Timestamp (INTEGER)

  • Timestamp when the data provider started (seconds since the UNIX epoch)

 

Duration (REAL)

  • Data provider walltime (seconds)

 

Encoding (INTEGER)

  • Encoding format of the STDOUT and STDERR columns (0 = no encoding, 1 = base64 encoding)

 

STDOUT (TEXT)

  • Data provider standard output

 

STDERR (TEXT)

  • Data provider standard error

 

OptionID (TEXT)

  • The ID of the option set with which the provider was run

 

Version (INTEGER)

  • Output format version of the data provider

 

Username (TEXT)

  • Username of the user who ran the data provider

 

Unique_timestamp (INTEGER)

  • Unique timestamp when the data was collected (seconds since the UNIX epoch)

 

List of Analyzer Extensions

all_to_all

  • IP address consistency

 

cpu

  • CPU compliance and uniformity

 

datconf

  • InfiniBand* DAPL configuration

 

devices

  • Intel® Select Solutions for Simulation and Modeling devices compliance

 

dgemm

  • Floating point performance by double precision matrix multiplication

 

environment

  • Environment variables

 

ethernet

  • Ethernet driver uniformity and wellness

 

files

  • Configuration files

 

hardware

  • Hardware location

 

hpcg_cluster

  • High Performance Conjugate Gradients (HPCG) benchmark four node

 

hpcg_single

  • High Performance Conjugate Gradients (HPCG) benchmark single node

 

hpl

  • High Performance Linpack

 

imb_pingpong

  • MPI performance

 

infiniband

  • InfiniBand* uniformity and wellness

 

iozone

  • Disk I/O performance

 

kernel

  • Linux* kernel

 

kernel_param

  • Kernel parameter uniformity

 

libraries

  • Intel® Scalable System Framework runtime library compliance

 

lsb_tools

  • LSB tool compliance

 

lshw

  • Hardware uniformity

 

lustre

  • Lustre* storage cluster functionality

 

memory

  • Memory compliance

 

mount

  • Mount point compliance and uniformity

 

mpi_internode

  • Multi-node Intel® MPI Library functionality

 

mpi_local

  • Single-node Intel® MPI Library functionality

 

ntp

  • Clock synchronization

 

opa

  • Intel® Omni-Path Host Fabric Interface uniformity and wellness

 

perl

  • Perl* compliance, uniformity, and functionality

 

process

  • Process table

 

python

  • Python* compliance, uniformity, and functionality

 

rpm

  • RPM uniformity

 

rpm_baseline

  • RPM changes over time

 

sgemm

  • Floating point performance by single precision matrix multiplication

 

shells

  • Shell compliance

 

ssf_version

  • Intel® Scalable System Framework version compliance

 

storage

  • Disk capacity

 

stream

  • Memory bandwidth performance

 

tcl

  • Tcl compliance, uniformity, and functionality

 

Blacklists

Kernel Parameters Blacklist

The following is a comprehensive list of blacklisted kernel parameters. The uniformity of these kernel parameters are checked in the kernel_parameter_uniformity Framework Definition. This list is located in the kernel_param analyzer extension and is not accessible to the user. The user can specify other blacklisted items through the default configuration file.

  • dev.cdrom.autoclose

  • dev.cdrom.autoeject

  • dev.cdrom.check_media

  • dev.cdrom.debug

  • dev.cdrom.info

  • dev.cdrom.lock

  • fs.binfmt_misc.jexec

  • fs.dentry-state

  • fs.epoll.max_user_watches

  • fs.file-max

  • fs.file-nr

  • fs.inode-nr

  • fs.inode-state

  • fs.nfs.

  • fs.quota.syncs

  • kernel.domainname

  • kernel.host-name

  • kernel.hostname

  • kernel.hung_task_warnings

  • kernel.ns_last_pid

  • kernel.perf_event_max_sample_rate

  • kernel.pty.nr

  • kernel.random.

  • kernel.sched_domain.

  • kernel.shmmax

  • kernel.threads-max

  • lnet.buffers

  • lnet.fefslog_daemon_pid

  • lnet.lnet_memused

  • lnet.memused

  • lnet.net_status

  • lnet.nis

  • lnet.peers

  • lnet.routes

  • lnet.stats

  • lustre.memused

  • net.bridge.bridge-n

  • net.core.netdev_rss_key

  • net.ipv4.conf.

  • net.ipv4.neigh.

  • net.ipv4.net-filter.

  • net.ipv4.netfilter.ip_conntrack_count

  • net.ipv4.rt_cache_rebuild_count

  • net.ipv4.tcp_mem

  • net.ipv4.udp_mem

  • net.ipv6

  • net.netfilter.nf_conntrack_count

  • sunrpc.transports

Lshw Blacklist

The following is a comprehensive list of items blacklisted by the lshw check through the regex function. This blacklist is located in the lshw analyzer extension and is not accessible to the user. The user can specify other blacklisted items through the default configuration file.

 

  • regex(".*bank.*clock")

  • regex(".*bank.*product")

  • regex(".*bank.*vendor")

  • regex(".*cache.*instruction")

  • regex(".*cache.*unified")

  • regex(".*cdrom.*")

  • regex(".*generic.*")

  • regex(".*irq")

  • regex(".*isa.*")

  • regex(".*network.*size")

  • regex(".*physid")

  • regex(".*signature.*")

  • regex(".*sku.*")

  • regex(".*usb.*")

  • regex(".*volume.*")

  • regex("^pci.*businfo.*$")

  • regex("^pci.*cap_list.*$")

  • regex("^pci.*ioport.*$")

  • regex("^pci.*memory.*")

  • regex("^pci.*width.*$")

  • regex("^cpu:.*-size$")

  • regex("^cpu:.*-capacity$")

  • regex(".*scsi:*[0-9]*-driver")

  • regex(".*scsi:*[0-9]*-businfo")

  • regex(".*scsi:*[0-9]*-logicalname")

  • regex(".*scsi:*[0-9]*-scsi-host")

Included Framework Definitions

All included Framework Definitions are located at /opt/intel/clck/201n/etc/fwd.

basic_internode_connectivity.xml

Validates internode accessibility by confirming the consistency of node IP addresses. Includes the providers:

  • all_to_all
  • uname

Includes the analyzer extension:

  • all_to_all

Includes the knowledge base module:

  • basic_internode_connectivity.clp

basic_shells.xml

Identifies missing and failing bash and sh shells. Includes the providers:

  • shells
  • uname

Includes the analyzer extension:

  • shells

Includes the knowledge base module:

  • basic_shells.clp

benchmarks.xml

Runs all benchmarks and their dependencies. These benchmarks evaluate CPU performance, floating poing computation, network bandwidth and latency, I/O bandwidth, and memory bandwidth. Includes the Framework Definitions:

  • dgemm_cpu_performance.xml
  • ethernet.xml
  • hpl_cluster_performance.xml
  • imb_pingpong_fabric_performance.xml
  • iozone_disk_bandwidth_performance.xml
  • sgemm_cpu_performance.xml
  • stream_memory_bandwidth_performance.xml

clock.xml

Verifies that the clock offset is not above the threshold, the ntp client is connected to the ntp server, and the ntpq or chronyc data is recent and available in the database. Includes the Framework Definition:

  • network_time_uniformity.xml

cluster.xml

Ensures that all nodes in the cluster are able to communicate with one another by confirming the consistency of node IP addresses, verifying Ethernet consistency, executing the HPL benchmark and the Intel® MPI Benchmarks PingPong benchmark, and ensuring that the Intel® MPI Library is functional and can successfully run across the cluster. Includes the Framework Definitions:

  • basic_internode_connectivity.xml
  • ethernet.xml
  • hpl_cluster_performance.xml
  • imb_pingpong_fabric_performance.xml
  • mpi_multinode_functionality.xml

cpu.xml

Verifies the uniformity of cpu model names, the Intel® Turbo Boost Technology status, the number of logical cores, the number of threads per core, and the presence of kernel flags. Confirms that the cpu is a 64 bit Intel® processor. For Intel® Xeon Phi™ processors, verifies the uniformity of cluster/memory modes; verifies the nohz_full, isolcpus, and rcu_nocbs kernel configuration parameters; and confirms that the memoryside cache file is the latest version. Includes the providers:

  • cpuid
  • cpuinfo
  • cpupower
  • dmesg
  • hwloc_dump_hwdata
  • intel_pstate_status
  • kernel_tools
  • lscpu
  • numactl
  • uname

Includes the analyzer extension:

  • cpu

Includes the knowledge base module:

  • cpu.clp

dapl_fabric_providers_present.xml

Verifies that DAPL (Direct Access Programming Libraries) providers are present. Includes the providers:

  • datconf
  • ibstat
  • ipaddr
  • uname

Includes the analyzer extension:

  • datconf

Includes the knowledge base module:

  • dapl_fabric_providers_present.clp

dgemm_cpu_performance.xml

A double precision matrix multiplication routine that is used to verify the cpu performance. Reports nodes with substandard FLOPS relative to a threshold based on the hardware and performance outliers outside the range defined by the median absolute deviation. Includes the providers:

  • cpuid
  • cpuinfo
  • cpupower
  • dgemm
  • dmesg
  • dmidecode
  • hwloc_dump_hwdata
  • intel_pstate_status
  • kernel_tools
  • lscpu
  • meminfo
  • numactl
  • uname

Includes the analyzer extensions:

  • cpu
  • dgemm
  • memory

Includes the knowledge base module:

  • dgemm_cpu_performance.clp

environment_variables_uniformity.xml

Verifies the uniformity of all environment variables. Includes the providers:

  • printenv
  • uname

Includes the analyzer extension:

  • environment

Includes the knowledge base module:

  • environment_variables_uniformity.clp

ethernet.xml

Verifies the consistency of Ethernet drivers, driver versions, and MTU (maximum transmission unit) across the cluster. Verifies that Ethernet interrupt coalescing is enabled.

Includes the providers:

  • ethtool
  • ethtool_show_coalesce
  • ipaddr
  • uname

Includes the analyzer extension:

  • ethernet

Includes the knowledge base module:

  • ethernet.clp

exclude_hpl.xml

Provides a complete analysis of the cluster, excluding the hpl_cluster_performance framework definition and analysis related to specific specs. Includes the framework definitions:

  • basic_internode_connectivity.xml
  • cpu.xml
  • dapl_fabric_providers_present.xml
  • dgemm_cpu_performance.xml
  • environment_variables_uniformity.xml
  • ethernet.xml
  • file_system_uniformity.xml
  • imb_pingpong_fabric_performance.xml
  • infiniband.xml
  • iozone_disk_bandwidth_performance.xml
  • kernel_version_uniformity.xml
  • kernel_parameter_uniformity.xml
  • local_disk_storage.xml
  • lshw_hardware_uniformity.xml
  • lustre_mounted.xml
  • memory_uniformity.xml
  • mpi_local_functionality.xml
  • mpi_multinode_functionality.xml
  • network_time_uniformity.xml
  • node_process_status.xml
  • opa.xml
  • perl_functionality.xml
  • python_functionality.xml
  • rpm_uniformity.xml
  • sgemm_cpu_performance.xml
  • shell_functionality.xml
  • stream_memory_bandwidth_performance.xml
  • tcl_functionality.xml

Includes the providers:

  • chkconfig
  • checksums
  • loadavg
  • mtab
  • ulimit
  • who

files_snapshot.xml

Looks for configuration file changes between  snapshot_x and snapshot_y. Includes the providers:

  • files_head
  • files_compute
  • uname

Includes the analyzer extension:

  • files

Includes the knowledge base module:

  • files_snapshot.clp

file_system_uniformity.xml

Confirms that /tmp directory has appropriate permissions, /dev/shm and /proc are properly mounted, and the home path is uniform and shared across the cluster. Includes the providers:

  • mount
  • stat_home
  • stat_tmp
  • uname

Includes the analyzer extension:

  • mount

Includes the knowledge base module:

  • file_system_uniformity.clp

hardware.xml

Verifies cpu configuration, InfiniBand functionality, hardware uniformity, and Intel® Omni-Path Host Fabric Interface functionality. Includes the Framework Definitions:

  • cpu.xml
  • infiniband.xml
  • lshw_hardware_uniformity.xml
  • opa.xml

hardware_snapshot.xml

Looks for hardware location changes between snapshot_x and snapshot_y. Includes the providers:

  • hw_head
  • hw_compute
  • uname

Includes the analyzer extension:

  • hardware

Includes the knowledge base module:

  • hardware_snapshot.clp

health.xml

Provides a complete analysis of the cluster, excluding analysis related to specific specs. Includes the Framework Definitions:

  • basic_internode_connectivity.xml
  • basic_shells.xml
  • cpu.xml
  • dapl_fabric_providers_present.xml
  • dgemm_cpu_performance.xml
  • environment_variables_uniformity.xml
  • ethernet.xml
  • file_system_uniformity.xml
  • hpl_cluster_performance.xml
  • imb_pingpong_fabric_performance.xml
  • infiniband.xml
  • kernel_version_uniformity.xml
  • kernel_parameter_uniformity.xml
  • local_disk_storage.xml
  • lshw_hardware_uniformity.xml
  • lustre_mounted.xml
  • memory_uniformity.xml
  • mpi_local_functionality.xml
  • mpi_multinode_functionality.xml
  • network_time_uniformity.xml
  • node_process_status.xml
  • opa.xml
  • perl_functionality.xml
  • python_functionality.xml
  • rpm_uniformity.xml
  • services_status.xml
  • sgemm_cpu_performance.xml
  • shell_functionality.xml
  • stream_memory_bandwidth_performance.xml
  • tcl_functionality.xml

hpcg_cluster

The High Performance Conjugate Gradients (HPCG) Benchmark project is an effort to create a new metric for ranking HPC systems. HPCG is designed to exercise computational and data access patterns that more closely match a broad set of applications. This will give an incentive to computer system designers to invest in capabilities that will have an impact on the collective performance of these applications. Intel® Cluster Checker uses the Intel® Optimized High Performance Conjugate Gradient Benchmark, which is executed on four-node sub-clusters as an Intel® MPI Library based benchmark. Includes the providers:

  • hpl_cluster
  • uname

Includes the analyzer extension:

  • hpcg_cluster

Includes the knowledge base modules

  • hpcg_cluster.clp

hpcg_single

The High Performance Conjugate Gradients (HPCG) Benchmark project is an effort to create a new metric for ranking HPC systems. HPCG is designed to exercise computational and data access patterns that more closely match a broad set of applications. This will give an incentive to computer system designers to invest in capabilities that will have an impact on the collective performance of these applications. Intel® Cluster Checker uses the Intel® Optimized High Performance Conjugate Gradient Benchmark, which is executed on each individual node as an Intel® MPI Library based benchmark. Includes the providers:

  • hpl_single
  • uname

Includes the analyzer extension:

  • hpcg_single

Includes the knowledge base modules

  • hpcg_single.clp

hpl_cluster_performance.xml

Reports if the HPL benchmark ran successfully on the cluster and each pair of nodes within the cluster. Reports performance outliers for the pairwise execution outside the range defined by the median absolute deviation. Includes the providers:

  • hpl_cluster
  • hpl_pairwise
  • uname

Includes the analyzer extension:

  • hpl

Includes the knowledge base module:

  • hpl_cluster_performance.clp

imb_pingpong_fabric performance.xml

Confirms that the Intel® MPI Benchmarks PingPong benchmark ran successfully for nodes within the cluster. Also reports network bandwidth and latency outliers defined by other measured values in the same grouping and if latency or network bandwidth fall below a certain threshold. Includes the providers:

  • datconf
  • ethtool
  • ethtool_show_coalesce
  • ibstat
  • imb_pingpong
  • ipaddr
  • lspci
  • mpi_internode
  • mpi_local
  • ofedinfo
  • tmiconf
  • udevadm-net
  • uname

Includes the analyzer extension:

  • imb_pingpong

Includes the knowledge base module:

  • imb_pingpong_fabric_performance.clp

imb_pingpong.xml

Confirms if the Intel® MPI Benchmarks PingPong benchmark ran successfully for nodes within the cluster. Includes additional framework definitions that identify problems that could cause this benchmark to fail to run. Includes the Framework Definitions:

  • imb_pingpong_fabric_performance.xml
  • infiniband.xml
  • mpi_multinode_functionality.xml
  • mpi_local_functionality.xml
  • opa.xml

infiniband.xml

Verifies InfiniBand functionality by confirming the consistency of InfiniBand hardware and firmware, confirming that memlock size is sufficient and consistent across the cluster, verifying that InfiniBand HCA ports are in the Active state and the LinkUp physical state, verifying that HCA states are consistent, confirming that the InfiniBand HCA rate is consistent, and verifying InfiniBand card presence and functionality. Includes the framework definition:

  • dapl_fabric_providers_present.xml

Includes the data providers:

  • datconf
  • ibstat
  • ibv_devinfo
  • lspci
  • ofedinfo
  • ulimit
  • uname

Includes the analyzer extension:

  • infiniband

Includes the knowledge base module:

  • infiniband.clp

iozone_disk_bandwidth_performance.xml

Verifies the I/O performance of a storage device by searching for I/O bandwidth outliers outside the range defined by the median absolute deviation. Includes the data providers:

  • iozone
  • uname

Includes the analyzer extension:

  • iozone

Includes the knowledge base modules

  • iozone_disk_bandwidth_performance.clp

kernel_parameter_preferred

Verifies that kernel parameter value is the preferred one across the cluster. Includes the data providers:

  • sysctl
  • uname

Includes the analyzer extension:

  • kernel_param

Includes the knowledge base modules:

  • kernel_parameter_preferred.clp

In order to use this framework definition, specify any preferred kernel parameter values in the Intel® Cluster Checker config file using the following format:

<analyzer>
    <config>
        <kernel-param-preferred>
            <entry>kernel.parameter|node_role|value<entry>
        </kernel-param-preferred>
    </config>
</analyzer>

In this format, the first value is the kernel parameter, the second value is the node role, and the third value is the preferred value for the given kernel parameter.

kernel_parameter_uniformity.xml

Verifies that kernel parameter data is uniform across the cluster. Includes the data providers:

  • sysctl
  • uname

Includes the analyzer extension:

  • kernel_param

Includes the knowledge base modules

  • kernel_parameter_uniformity.clp

kernel_version_uniformity.xml

For each node, verifies that the kernel version is the same as at least 90% of the other nodes. Includes the data providers:

  • uname

Includes the analyzer extension:

  • kernel

Includes the knowledge base modules

  • kernel_version_uniformity.clp

local_disk_storage.xml

Verifies that there is enough free memory on each node. Includes the data providers:

  • df
  • mount
  • uname

Includes the analyzer extension:

  • storage

Includes the knowledge base modules

  • local_disk_storage.clp

lshw_hardware_uniformity.xml

Verifies the uniformity of hardware installed across the cluster. Determines missing hardware parameters. Includes the data providers:

  • lshw
  • uname

Includes the analyzer extension:

  • lshw

Includes the knowledge base module:

  • lshw_hardware_uniformity.clp

lustre_mounted.xml

Verifies that the Lustre kernel modules are loaded and the object storage targets are active, mounted, uniform and writable across the cluster. Includes the data providers:

  • lsmod
  • lustre_check_servers
  • lustre_logs
  • lustre_df
  • lustre_stripe
  • uname

Includes the analyzer extension:

  • lustre

Includes the knowledge base module:

  • lustre_mounted.clp

memory_uniformity.xml

Determines if the amount of physical memory is uniform across the cluster. Includes the data providers:

  • cpuid
  • cpuinfo
  • cpupower
  • dmesg
  • dmidecode
  • hwloc_dump_hwdata
  • kernel_tools
  • lscpu
  • meminfo
  • numactl
  • uname

Includes the analyzer extensions:

  • memory

Includes the knowledge base module:

  • memory_uniformity.clp

mpi_local_functionality.xml

Determines if MPI is present and the path is uniform with all other nodes. Includes the data providers:

  • mpi_local
  • uname

Includes the analyzer extension:

  • mpi_local

Includes the knowledge base module:

  • mpi_local_functionality.clp

mpi_multinode_functionality.xml

Verifies that the Intel® MPI Library is functional and can successfully run across the cluster. Includes the data providers:

  • mpi_internode
  • uname

Includes the connector extension module:

  • mpi_internode

Includes the knowledge base module:

  • mpi_multinode_functionality.clp

mpi.xml

Verifies that MPI is present, that the path is uniform across nodes, and that MPI successfully runs across the cluster. Runs benchmarks related to MPI performance. Includes the framework definitions:

  • hpl_cluster_performance.xml
  • imb_pingpong_fabric_performance.xml
  • mpi_local_functionality.xml
  • mpi_multinode_functionality.xml

network_time_uniformity.xml

Verifies that the clock offset is not above the threshold, the Network Time Protocol (NTP) client is connected to the NTP server, and the ntpq or chronyc data is recent and available in the database. Includes the data providers:

  • chronyc
  • ntpq
  • uname

Includes the analyzer extension:

  • ntp

Includes the knowledge base module:

  • network_time_uniformity.clp

node_process_status.xml

Identifies nodes with zombie processes and nodes with processes that have high CPU and memory requirements. Includes the data providers:

  • ps
  • uname

Includes the analyzer extension:

  • process

Includes the knowledge base module:

  • node_process_status.clp

opa.xml

Verifies Intel® Omni-Path Architecture (Intel® OPA) Interface functionality by confirming the consistency of Intel® OPA hardware and firmware, by verifying that Intel® OPA HCA ports are in the Active state and the LinkUp physical state, by verifying that HCA states are consistent, by confirming that the Intel® OPA HCA rate is consistent, by verifying that an Intel® OPA subnet manager is running, and by confirming that memlock size is sufficient and consistent across the cluster. Includes the data providers:

  • lspci
  • opahfirev
  • opatools
  • opasmaquery
  • saquery
  • ulimit
  • uname

Includes the analyzer extension:

  • opa

Includes the knowledge base module:

  • opa.clp

perl_functionality.xml

Verifies the presence, functionality, and consistency of the Perl version. Includes the data providers:

  • perl
  • uname

Includes the analyzer extension:

  • perl

Includes the knowledge base module:

  • perl_functionality.clp

python_functionality.xml

Verifies the presence, functionality, and consistency of the Python version. Includes the data providers:

  • python
  • uname

Includes the analyzer extension:

  • python

Includes the knowledge base module:

  • python_functionality.clp

rpm_snapshot.xml

Checks for RPMs installed across the cluster and compares the data from snapshot_x with the data from snapshot_y. Includes the providers:

  • rpm_list
  • uname

Includes the analyzer extension:

  • rpm_baseline

Includes the knowledge base module:

  • rpm_snapshot.clp

rpm_uniformity.xml

Verifies the uniformity of the RPMs installed across the cluster and reports absent and superfluous RPMs. Includes the data providers:

  • rpm_list
  • uname

Includes the analyzer extension:

  • rpm

Includes the knowledge base module:

  • rpm_uniformity.clp

select_solutions_sim_mod_benchmarks

Checks benchmark performance against thresholds required by Intel® Select Solutions for Simulation and Modeling. These benchmarks evaluate CPU performance for double precision floating point operations on a single node and a four node cluster, network bandwidth and latency, and memory bandwidth. Includes the data providers:

  • dgemm
  • hpcg_cluster
  • hpcg_single
  • hpl_cluster
  • imb_pingpong
  • stream
  • uname

Includes the analyzer extensions:

  • dgemm
  • hpl
  • hpcg_cluster
  • hpcg_single
  • imb_pingpong
  • stream

Includes the knowledge base module:

  • select_solutions_sim_mod_benchmarks.clp

select_solutions_sim_mod_priv

Verifies that the cluster meets the part of the Intel® Select for Simulation and Modeling requirements that has to be checked as a privileged user. It checks for system requirements to processor, memory, and fabric. Must be run as a privileged user. A pass of this framework definition along with a pass of the framework definition select_solutions_sim_mod_user.xml (run as normal user) will verify compliance with Intel® Select Solutions for Simulation and Modeling. Includes the data providers:

  • cpuid
  • cpuinfo
  • cpupower
  • dmesg
  • hwloc_dump_hwdata
  • intel_pstate_status
  • kernel_tools
  • lspci_verbose
  • lscpu
  • numactl
  • dmidecode
  • meminfo
  • uname

Includes the analyzer extensions:

  • cpu
  • devices
  • memory

Includes the knowledge base module:

  • select_solutions_sim_mod_system_requirements.clp

select_solutions_sim_mod_user

Verifies that the cluster meets the part of the Intel® Select Solutions for Simulation and Modeling requirements that has to be checked as a non-privileged user. It checks benchmark performance and compliance with Intel® Scalable System Framework. A pass of this framework definition along with a pass of the framework definition select_solutions_sim_mod_priv.xml (run as a privileged user) will verify compliance with Intel® Select Solutions for Simulation and Modeling. Includes the framework definitions:

  • select_solutions_sim_mod_benchmarks.xml
  • ssf_compat-hpc-2016.0.xml

services_status

Verifies the service status is as required by the provided configuration file. Includes the data providers:

  • systemctl_status
  • uname

Includes the analyzer extension:

  • services_status

Includes the knowledge base module:

  • services_status.clp

To use this framework definition, specify the preferred service status in the Intel® Cluster Checker configuration file using the following format:

<analyzer>
    <config>
        <preferred-services-status>
            <entry>service_name|compute|loaded|active|running</entry>
        </preferred-services-status>
    </config>
</analyzer>

This format takes five values:

  1. Service name
  2. Node role
  3. LOAD status - whether the unit definition was properly loaded
  4. ACTIVE status - the high level unit activation state (i.e. generalization of SUB)
  5. SUB status - the low level unit activation state (values depend on unit type)

sgemm_cpu_performance.xml

Verifies CPU performance using a single precision matrix multiplication routine and reports node outliers outside the range defined by the median absolute deviation. Includes the data providers:

  • cpuid
  • cpuinfo
  • cpupower
  • sgemm
  • dmesg
  • hwloc_dump_hwdata
  • intel_pstate_status
  • kernel_tools
  • lscpu
  • numactl
  • uname

Includes the analyzer extensions:

  • cpu
  • sgemm

Includes the knowledge base module:

  • sgemm_cpu_performance.clp

shell_functionality.xml

Identifies missing and failing bash, csh, sh and tcsh shells. Includes the framework definition:

  • basic_shells.xml

Includes the data providers:

  • shells
  • uname

Includes the analyzer extension:

  • shells

Includes the knowledge base module:

  • shell_functionality.clp

single.xml

Runs all framework definitions relevant to single node. Evaluates CPU functionality, network connectivity, file systems, shell functionality, environment variables, and Perl and Python versions and verifies clock offset and Intel® MPI Library functionality. Includes the framework definitions:

  • cpu.xml
  • ethernet.xml
  • environment_variables_uniformity.xml
  • file_system_uniformity.xml
  • lustre_mounted.xml
  • mpi_local_functionality.xml
  • network_time_uniformity.xml
  • opa.xml
  • perl_functionality.xml
  • python_functionality.xml
  • shell_functionality.xml

Includes the data providers:

  • checksums
  • chkconfig
  • datconf
  • df
  • dgemm
  • ibstat
  • ibv_devinfo
  • ifconfig
  • iozone
  • issue
  • kernel_tools
  • ldconfig
  • loadavg
  • lsb
  • lsb_tools
  • lscpu
  • lshw
  • meminfo
  • modinfo
  • mtab
  • numactl
  • ofedinfo
  • printenv
  • ps
  • resolvconf
  • rpm_list
  • ssf_version
  • sshdconf
  • stat_home
  • stat_tmp
  • stream
  • sysctl
  • tcl
  • tmiconf
  • tmp
  • udevadm-net
  • uptime
  • who

ssf_compat-base-2016.0.xml

Verifies that the cluster meets Intel® Scalable System Framework base application compatibility requirements. See the Intel® Scalable System Framework Architecture Specification version 2016.0 for more information. Includes the framework definition:

  • ssf_core-2016.0.xml

Includes the data providers:

  • cpuid
  • cpuinfo
  • cpupower
  • df
  • dmesg
  • dmidecode
  • hwloc_dump_hwdata
  • kernel_tools
  • libraries
  • lsb_tools
  • lscpu
  • meminfo
  • mount
  • numactl
  • perl
  • python
  • shells
  • stat_home
  • stat_tmp
  • tcl
  • uname

Includes the analyzer extensions:

  • libraries
  • lsb_tools
  • memory
  • mount
  •  
  • Includes the data providers:

  • cpuid
  • cpuinfo
  • cpupower
  • df
  • dmesg
  • dmidecode
  • hwloc_dump_hwdata
  • kernel_tools
  • libraries
  • lsb_tools
  • lscpu
  • meminfo
  • mount
  • numactl
  • perl
  • python
  • shells
  • stat_home
  • stat_tmp
  • tcl
  • uname
  • Includes the analyzer extensions:

  • libraries
  • lsb_tools
  • memory
  • mount
  • perl
  • python
  • shells
  • storage
  • tcl

Includes the knowledge base module:

  • ssf_compat base-2016.0.xml

ssf_compat-hpc-2016.0.xml

Verifies that the cluster meets Intel(R) Scalable System Framework high performance computer cluster application compatibility requirements. See the Intel® Scalable System Framework Architecture Specification version 2016.0 for more information. Includes the framework definitions:

  • ssf_hpc-cluster-2016.0.xml
  • ssf_compat-base-2016.0.xml

Includes the data providers:

  • all_to_all
  • mpi_local
  • uname

Includes the analyzer extensions:

  • all_to_all
  • mpi_local

Includes the knowledge base modules:

  • ssf_compat-hpc-2016.0.clp

ssf_compliance_perl_version.xml

Determines if the Perl version is 5.10 or greater per Intel® Scalable System Framework (Intel® SSF) requirements. Includes the data providers:

  • perl
  • uname

Includes the analyzer extensions:

  • perl

Includes the knowledge base module:

  • ssf_compliance_perl_version.clp

ssf_compliance_python_version.xml

Determines if the Python version is 2.6 or greater per Intel® Scalable System Framework (Intel® SSF) requirements. Includes the data providers:

  • python
  • uname

Includes the analyzer extension:

  • python

Includes the knowledge base module:

  • ssf_compliance_python_version.clp

ssf_compliance_shell.xml

Determines if shells meet Intel® Scalable System Framework (Intel® SSF) requirements. Includes the data providers:

  • shells
  • uname

Includes the analyzer extension:

  • shells

Includes the knowledge base module:

  • ssf_compliance_shell.clp

ssf_compliance_tcl_version.xml

Determines if the tcl version is 8.5 or greater per Intel® Scalable System Framework (Intel® SSF) requirements. Includes the data providers:

  • tcl
  • uname

Includes the connector extension module:

  • tcl

Includes the knowledge base module:

  • ssf_compliance_tcl_version.clp

ssf_core-2016.0.xml

Verifies that the cluster meets Intel® Scalable System Framework core requirements. See the Intel® Scalable System Framework Architecture Specification version 2016.0 for more information. Includes the data providers:

  • cpuid
  • cpuinfo
  • cpupower
  • dmesg
  • hwloc_dump_hwdata
  • intel_pstate_status
  • kernel_tools
  • lscpu
  • mount
  • numactl
  • printenv
  • ssf_version
  • stat_home
  • stat_tmp
  • uname

Includes the analyzer extensions:

  • cpu
  • environment
  • kernel
  • mount
  • ssf_version

Includes the knowledge base modules

  • ssf_core-2016.0.clp

ssf_environment_variables_mounted.xml

Verifies that TMPDIR and HOME environment variables meet Intel® Scalable System Framework (Intel® SSF) requirements. Includes the data providers:

  • mount
  • stat_home
  • stat_tmp
  • uname

Includes the analyzer extension:

  • mount

Includes the knowledge base module:

  • ssf_environment_variables_mounted.clp

ssf_hpc-cluster-2016.0.xml

Verifies that the cluster meets Intel® Scalable System Framework requirements for a classic high performance compute cluster. See the Intel® Scalable System Framework Architecture Specification version 2016.0 for more information. Includes the Framework Definitions:

  • ssf_core-2016.0.xml

Includes the data providers:

  • all_to_all
  • uname

Includes the analyzer extension:

  • all_to_all

Includes the knowledge base modules

  • ssf_hpc-cluster-2016.0.

ssf_kernel_version.xml

Verifies that the kernel is version 2.6.32 or greater per Intel® Scalable System Framework (Intel® SSF) requirements. Includes the data provider:

  • uname

Includes the connector extension module:

  • kernel

Includes the knowledge base module:

  • ssf_kernel_version.clp

ssf_libraries.xml

Verifies that the Intel® Scalable System Framework libraries are present. Includes the data providers:

  • libraries
  • uname

Includes the analyzer extension:

  • libraries

Includes the knowledge base module:

  • ssf_libraries.clp

ssf_linux_based_tools_present.xml

Verifies that the Intel® Scalable System Framework (Intel® SSF) required Linux*-based tools are present. Includes the data providers:

  • lsb_tools
  • uname

Includes the connector extension module:

  • lsb_tools

Includes the knowledge base module:

  • ssf_linux_based_tools_present.clp

ssf_minimum_memory_requirements_base.xml

Verifies that the amount of physical memory per core is above 16 GiB per Intel® Scalable System Framework (Intel® SSF) requirements. Includes the data providers:

  • cpuid
  • cpuinfo
  • cpupower
  • dmesg
  • hwloc_dump_hwdata
  • kernel_tools
  • lscpu
  • meminfo
  • numactl
  • uname

Includes the connector extension module:

  • memory

Includes the knowledge base module:

  • ssf_minimum_memory_requirements_base.clp

ssf_minimum_memory_requirements_hpc

Verifies that the amount of physical memory per core is above 32 GiB per Intel® Scalable System Framework (Intel® SSF) requirements. Includes the providers:

  • cpuid
  • cpuinfo
  • cpupower
  • dmesg
  • hwloc_dump_hwdata
  • kernel_tools
  • lscpu
  • meminfo
  • numactl
  • uname

Includes the analyzer extension:

  • memory

Includes the knowledge base module:

  • ssf_minimum_memory_requirements_hpc.clp

ssf_minimum_storage.xml

Verifies that the head node has at least 200 GiB of direct access storage and that all compute nodes have access to at least 80 GiB of persistent storage, per Intel® Scalable System Framework (Intel® SSF) requirements. Includes the data providers:

  • df
  • mount
  • uname

Includes the analyzer extension:

  • storage

Includes the knowledge base module:

  • ssf_minimum_storage.clp

ssf_version.xml

Verifies that the Intel® Scalable System Framework (Intel® SSF) file is present and the file /etc/ssf-release contains the correct version and layers. Includes the data providers:

  • ssf_version
  • uname

Includes the analyzer extension:

  • ssf_version

Includes the knowledge base module:

  • ssf_version.clp

stream_memory_bandwidth_performance.xml

Identifies nodes with memory bandwidth outliers (as reported by the STREAM benchmark) outside the range defined by the median absolute deviation. Includes the data providers:

  • stream
  • uname

Includes the connector extension module:

  • stream

Includes the knowledge base module:

  • stream_memory_bandwidth_performance.clp

tcl_functionality.xml

Verifies that Tcl is installed, functional and uniform across all nodes. Includes the data providers:

  • tcl
  • uname

Includes the connector extension module:

  • tcl

Includes the knowledge base module:

  • tcl_functionality

tools.xml

Verifies that Tcl, Python, and Perl are installed, functional, and uniform. Includes the framework definitions:

  • perl_functionality.xml
  • python_functionality.xml
  • tcl_functionality.xml

 

Rules

The C Language Integrated Production Systems (CLIPS) is an expert system shell that combines an inference engine with a language for representing knowledge. Intel® Cluster Checker uses CLIPS to implement its knowledge base component and define CLIPS classes and rules. Each CLIPS class has one or more CLIPS associated rules. These rules are defined through unique IDs. An example is all-to-all-data-is-too-old, which is associated with the all_to_all analyzer extension.

The remainder of this section contains a short description of rules integrated into the knowledge base. Most rule names are composed of the class name plus a very short description of the rule. For instance the cpu-data-is-too-old rule checks that the CPU data collected is recent.

  • all-logical-cores-not-available: 
    • Check for offline cores.
  • all-to-all-data-is-too-old:
    • Identify nodes where the most recent ALL_TO_ALL data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • all-to-all-data-missing:
    • Check that all-to-all data is available.
  • approx-dimms-per-socket-not-balanced
    • Check that DIMMs are installed in a balanced manner.
  • cpu-data-is-too-old:
    • Identify nodes where the most recent CPU data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • cpu-data-missing:
    • Check that CPU data is available.
  • cpu-min-processor-model
    • Checks if the minimum processor model is met.
  • cpu-min-sockets
    • Checks that the minimum socket number is met.
  • cpu-missing-kernel-flag:
    • Check for missing CPU kernel flag.
  • cpu-model-name-not-uniform:
    • Check that the CPU model name is uniform.
  • cpu-not-intel64:
    • Check that the CPU is a 64-bit Intel® processor.
  • cpu-tickless-error:
    • Check if an error occurred during application nohz-full parameter during booting Intel® Xeon Phi™ processor.
  • cpu-tickless-isolcpus:
    • Check if CPU list in use for nohz-full parameter for the Intel® Xeon Phi™ processor is subset of isolcpus parameter (if present).
  • cpu-tickless-kernel:
    • Check if CPU list in use for nohz-full parameter for the Intel® Xeon Phi™ processor is same as the one applied by kernel.
  • cpu-tickless-list-not-uniform: 
    • nohz-full parameter uniformity check for Intel® Xeon Phi™ processor
  • cpu-tickless-preferred:
    • Check if CPU list in use for nohz-full parameter for the Intel® Xeon Phi™ processor is in preferred CPU list provided.
  • cpu-tickless-rcu-nocbs:
    • Check if CPU list in use for nohz-full parameter for the Intel® Xeon Phi™ processor is a subset of rcu-nocbs parameter (if present).
  • cpu-turbo-status-not-preferred:
    • Check if the Intel® Turbo Boost Technology status across nodes is same as preferred by the user.
  • cpu-turbo-status-not-uniform:
    • Check for the consistency of Intel® Turbo Boost Technology status across a subcluster.
  • data-is-too-old-initial:
    • If there are any signs for out of date data, create a data-is-too-old diagnosis and mark the sign as diagnosed. This rule only fires for the first data-is-too-old sign per node; that is, when the diagnosis does not already exist. Once the diagnosis exists, it should not be duplicated. Thus, there is a corresponding rule, data-is-too-old-subsequent, for the case where there are multiple signs leading to this diagnosis.
  • datconf-data-is-too-old:
    • Identify nodes where the most recent datconf data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • datconf-data-missing:
    • Check that datconf data is available.
  • datconf-no-dapl-providers:
    • Check that datconf data is available.
  • dgemm-data-is-substandard:
    • For the most recent DGEMM data point, identify nodes with substandard FLOPS relative to a threshold based on the hardware. The severity depends on the amount of deviation from the threshold value; the larger the deviation, the higher the severity.
  • dgemm-data-is-too-old:
    • Identify nodes where the most recent DGEMM data data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • dgemm-data-missing:
    • Detect cases where there is no DGEMM data.
  • dgemm-outlier:
    • Locate values that are outliers. An outlier is a value that is outside the range defined by the median +/- 6 * median absolute deviation. The statistics are computed using all samples on all nodes (that is, use the DGEMM statistics key). Note: the statistics-control condition is required to ensure that all samples are included when computing the statistics.
  • dgemm-perf-pass
    • Ensure that a system meets the performance requirements defined by Intel® Select Solutions for Simulation and Modeling.
  • dimms-per-socket-not-balanced:
    • Checks the uniformity of the DIMMs installed per socket.
  • dimms-per-socket-not-uniform:
    • Checks the uniformity of the DIMMs installed per socket
  • dmidecode-command-not-found.clp:
    • Check that dmidecode exists on a node
  • dmidecode-data-error.clp:
    • Check that dmidecode data is available and parsable.
  • dmidecode-data-missing.clp:
    • Checks if dmidecode data is missing.
  • environment-data-is-too-old:
    • Identify nodes where the most recent ENVIRONMENT data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • environment-data-missing:
    • Check that environment data is available.
  • environment-variable-not-uniform:
    • Check that an environment variable is uniform.
  • ethernet-data-is-too-old:
    • Identify nodes where the most recent ETHERNET data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • ethernet-data-missing:
    • Check that ethernet data is available.
  • ethernet-driver-is-not-consistent:
    • Identify inconsistent Ethernet drivers.
  • ethernet-driver-version-is-not-consistent:
    • Identify inconsistent Ethernet driver versions.
  • ethernet-firmware-version-is-not-consistent:
    • Identify inconsistent Ethernet firmware versions.
  • ethernet-interrupt-coalescing-is-enabled:
    • Identify nodes where Ethernet interrupt coalescing is not disabled, that is, rx-usecs is not 0 or 1. This only matters when using Ethernet as the MPI message fabric. Since the same node may be in multiple IMB pingpong pairs, check to see if the sign has already been created to avoid duplicates.
  • ethernet-mtu-is-not-consistent:
    • Identify inconsistent Ethernet firmware versions.
  • failing-bash:
    • Check if bash is failing.
  • failing-csh:
    • Check if csh is failing.
  • failing-sh:
    • Check if sh is failing.
  • failing-tcsh:
    • Check if tcsh is failing.
  • files-added:
    • Check if files have been added between snapshots.
  • files-group:
    • Compare the file group between snapshots.
  • files-md5sum:
    • Compare the file md5sum between snapshots.
  • files-owner:
    • Compare the file owner between snapshots.
  • files-perms:
    • Compare the file permissions between snapshots.
  • files-removed:
    • Check if files have been removed between snapshots.
  • hfi-width-permission-err
    • Identify if lspci was run as a non-privileged user and width could not be determined.
  • hfi_x16_missing
    • Identify if there is at least one x16 bus HFIs on each compute node (100GBps).
  • hpcg-4node-data-missing
    • Check that HPCG data for a four node cluster is available.
  • hpcg-4node-perf-pass
    • Identify nodes that do not meet the HPCG cluster minimum performance requirements for Intel® Select Solutions for Simulation and Modeling.
  • hpcg-cluster-data-missing
    • Check that HPCG cluster data is available.
  • hpcg-cluster-error
    • Detects cases when the HPCG_CLUSTER data is invalid, i.e. data provider output exists in the database, but the analyzer extension could not parse it.
  • hpl-cluster-failed:
    • Look for cases where HPL cluster ran but there was no success in the output.
  • hpcg-single-data-missing
    • Check that HPCG single data is available.
  • hpcg-single-error
    • Detect cases when the HPCG_SINGLE data is invalid, i.e. data provider output exists in the database, but the analyzer extension could not parse it.
  • hpcg-single-perf-pass
    • Identify nodes that do not meet the HPCG single-node minimum performance requirements for Intel® Select Solutions for Simulation and Modeling.
  • hpl-4node-perf-pass
    • Ensure that a system meets the performance requirements defined by Intel® Select Solutions for Simulation and Modeling.
  • hpl-data-is-too-old:
    • Identify nodes where the most recent HPL data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • hpl-data-missing:
    • Check that HPL data is available.
  • hpl-pairwise-failed:
    • Look for cases where HPL pairwise ran but there was no success in the output.
  • hpl-pairwise-outlier:
    • Locate values that are outliers. An outlier is a value that is outside the range defined by the median +/- 6 * median absolute deviations. The statistics are computed using all samples on nodes in the same grouping (that is, have the same HPL statistics key). Note: the statistics-control condition is required to ensure that all samples are included when computing the statistics.
  • hw-added:
    • Check if hardware has been added between snapshots.
  • hw-modified:
    • Compare the output line between snapshots.
  • hw-removed:
    • Check if hardware has been removed between snapshots.
  • imb-pingpong-bandwidth-outlier:
    • Check that the measured Intel® MPI Benchmarks PingPong benchmark bandwidth is within the statistical range defined by other measured values in the same grouping.
  • imb-pingpong-bandwidth-perf-pass
    • Ensure that a system meets the performance requirements defined by Intel® Select Solutions for Simulation and Modeling.
  • imb-pingpong-bandwidth-threshold:
    • Check that the measured Intel® MPI Benchmarks PingPong benchmark bandwidth is greater than or equal to the expected bandwidth.
  • imb-pingpong-data-is-too-old:
    • Identify nodes where the most recent IMB-PINGPONG data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • imb-pingpong-latency-outlier:
    • Check that the measured Intel® MPI Benchmarks PingPong benchmark latency is within the statistical range defined by other measured values in the same grouping.
  • imb-pingpong-latency-perf-pass
    • Ensure that a system meets the performance requirements defined by Intel® Select Solutions for Simulation and Modeling.
  • imb-pingpong-latency-threshold:
    • Check that the measured Intel® MPI Benchmarks PingPong benchmark is less than or equal to the expected latency.
  • imb-pingpong-data-missing:
    • Check that Intel® MPI Benchmarks PingPong benchmark data is available.
  • infiniband-ca-type-is-not-consistent:
    • Identify inconsistent InfiniBand HCA types.
  • infiniband-data-is-too-old:
    • Identify nodes where the most recent INFINIBAND data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • infiniband-data-missing:
    • Identify instances of missing InfiniBand information.
  • infiniband-device-is-not-consistent:
    • Identify inconsistent InfiniBand PCI devices.
  • infiniband-driver-is-not-consistent:
    • Identify inconsistent InfiniBand PCI drivers.
  • infiniband-firmware-version-is-not-consistent:
    • Identify inconsistent InfiniBand HCA firmware versions.
  • infiniband-hardware-version-is-not-consistent:
    • Identify inconsistent InfiniBand HCA hardware versions.
  • infiniband-memlock-is-not-consistent:
    • Identify inconsistent memlock limits.
  • infiniband-memlock-too-small:
    • Identify too low memlock limits.
  • infiniband-ofed-version-is-not-consistent:
    • Identify inconsistent OFED versions.
  • infiniband-physical-state-is-not-consistent:
    • Identify inconsistent InfiniBand HCA physical states
  • infiniband-physlot-is-not-consistent:
    • Identify inconsistent InfiniBand PCI card physical slots.
  • infiniband-port-physical-state-not-linkup:
    • Identify InfiniBand HCA ports not in the LinkUp physical state.
  • infiniband-port-state-not-active:
    • Identify InfiniBand HCA ports not in the Active state.
  • infiniband-rate-is-not-consistent:
    • Identify inconsistent InfiniBand HCA rate.
  • infiniband-rev-is-not-consistent:
    • Identify inconsistent InfiniBand PCI card revision.
  • infiniband-state-is-not-consistent:
    • Identify inconsistent InfiniBand HCA states.
  • intel-pstate-data-error:
    • Check that intel-pstate data is available and parsable.
  • intel-pstate-data-missing:
    • Check if intel-pstate data is missing.
  • invalid-dgemm-data:
    • Detect cases where the DGEMM data is invalid; that is, data provider output exists in the database, but the connector could not parse it.
  • invalid-services-data
    • Identify the nodes where the provider failed to report the right services data.
  • invalid-services-specification
    • Identifies if the preferred services specifications are given in the right format.
  • invalid-sgemm-data:
    • Detect cases where the SGEMM data is invalid; i.e., data provider output exists in the database, but the connector could not parse it.
  • iozone-data-missing:
    • Check that IOzone data is available.
  • iozone-data-is-too-old:
    • Identify nodes where the most recent IOZONE data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • iozone-outlier:
    • Locate values that are outliers. An outlier is a value that is outside the range defined by the median +/- 6 * median absolute deviation. The statistics are computed using all samples on all nodes (that is, use the IOZONE statistics key). Note: the statistics-control condition is required to ensure that all samples are included when computing the statistics.
  • iozone-ran-no-bandwidth:
    • This rule fires on nodes that have bandwidth of 0.0. This is the default value and if this is the value found, it means the connector didn't find a regular expression match for the correct BW.
  • iozone-ran-not-complete:
    • This rule fires on nodes where bandwidth is greater than 0.0, (which means the test finished and the connector found a value) but the string 'iozone test complete' is missing from the output.
  • ip-address-not-consistent:
    • If the IP address of a node differs from the perspective of different nodes, this rule will fire. The IP address of a   particular node must be the same on all cluster nodes.
  • kernel-data-is-too-old:
    • Identify nodes where the most recent KERNEL data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • kernel-data-missing:
    • Check that kernel data is available.
  • kernel-not-ssf:
    • If the kernel version is less than 2.6.32, in which case the kernel is not Intel® Scalable System Framework compliant. If the base (everything before -) has letters, the connector will pass a flag to clips instead of the actual base version.
  • kernel-not-uniform:
    • If the kernel version is not the same as at least 90% of the other nodes, then the node should be flagged as non-uniform. The fewer other nodes that have the same kernel version, the higher the confidence that the node with the different version is incorrect.
  • kernel-param-data-is-too-old:
    • Identify nodes where the most recent KERNEL-PARAM data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • kernel-param-data-missing:
    • Check that kernel parameter data is available.
  • kernel-param-not-uniform:
    • Checks that kernel parameters are uniform.
  • kernel-param-not-preferred:
    • Checks that a specified kernel parameter is in the preferred state as defined in the configuration file.
  • latest-ssf-version:
    • Determine whether the self-identified Intel® Scalable System Framework version contains the latest version (2016.0).
  • latest-xp-hwloc-memoryside-cache-file:
    • Check that the memoryside cache file for the Intel® Xeon Phi™ processor is the latest version.
  • libraries-data-missing:
    • Check that libraries data is available.
  • logical-cores-not-uniform:
    • Check for uniformity of logical core(s) among nodes having equivalent CPU(s).
  • lsb-tools-data-is-too-old:
    • Identify nodes where the most recent LSB tools data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • lsb-tools-data-missing:
    • Check that required LSB tool data is available.
  • lscpu-data-error:
    • Check that lscpu data is available and parsable.
  • lscpu-data-missing:
    • Check that lscpu data is available or unparsable.
  • lshw-data-is-too-old:
    • Identify nodes where the most recent LSHW data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • lshw-data-missing:
    • Check that lshw data is available.
  • lshw-key-missing:
    • Check if lshw key is missing.
  • lshw-not-uniform:
    • Check if lshw is uniform.
  • lspci_verbose_data_missing
    • Identify if there is data missing for devices that uses the provider lspci_verbose.
  • lustre-data-is-too-old:
    • Identify nodes where the most recent LUSTRE data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • lustre-data-missing:
    • Emit a sign if there is no lustre data.
  • lustre-kernel-modules-loaded-error
    • Ensure the lustre kernel modules are loaded.
  • lustre-kernel-modules-loaded-no-data:
    • Emit a sign if there is no data from lsmod.
  • lustre-mount-point-not-mounted:
    • Check uniformity of mount points.
  • lustre-target-inactive:
    • Check if a target is inactive which is active on other nodes on the cluster.
  • lustre-write-targets-uniform:
    • Checks uniformity of object targets that are written to by the stripe test.
  • lustre-no-write-targets:
    • Ensure that object targets are available for the stripe test. 
  • lustre-write-no-mount-points:
    • Ensure that at least one filesystem is mounted.
  • lustre-write-targets-mismatch:
    • Emit a sign if the number of available objects targets is not equal to the number of object targets written to.
  • memory-data-is-too-old:
    • Identify nodes where the most recent MEMORY data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • memory-data-missing:
    • Check that memory data is available.
  • memory-minimum-required-compat-base:
    • Check that the amount of physical memory per core is >= 16 GiB.
  • memory-minimum-required-compat-hpc:
    • Check that the amount of physical memory per core is >= 32 GiB.
  • memory-not-uniform:
    • Check that the amount of physical memory is uniform.
  • memory-sizes-not-uniform:
    • Check if the installed DIMMs have uniform sizes.
  • memory-speeds-not-uniform:
    • Check if the installed DIMMs have uniform speeds.
  • min-mem-per-core:
    • Check that the amount of physical memory per core is >= 2 x the number of physical cores.
  • min-mem-per-core-expected
    • Check that the amount of physical memory per node is greater than the expected memory.
  • min-mem-per-node
    • Check that the amount of physical memory per node is >= 96 GiB.
  • min-mem-per-node-expected
    • Check that the amount of physical memory per node is greater than the expected memory.
  • missing-bash:
    • Check if bash is missing.
  • missing-csh:
    • Check if csh is missing.
  • missing-libutil-x86-64:
    • Advisory Intel® Scalable System Framework compat-base. See the ssf_libraries rules directory for a list of all missing library rules.
  • missing-lsb-tools:
    • Check Tool(s) required but missing.
  • missing-opa-tools:
    • Intel® Omni-Path Architecture tools used for various checks.
  • missing-saquery-tool:
    • Check if saquery is missing.
  • missing-sh:
    • Check if sh is missing.
  • missing-sh-ssf:
    • Check if sh is missing per Intel® Scalable System Framework requirements.
  • missing-tcsh:
    • Check if tcsh is missing.
  • mount-bad-tmp-perms:
    • Check that /tmp has the permissions 777.
  • mount-data-is-too-old:
    • Identify nodes where the most recent MOUNT data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • mount-data-missing:
    • Check that mount data is available.
  • mount-dev-shm-not-mounted:
    • Check that /dev/shm is properly mounted.
  • mount-home-not-defined:
    • HOME environment variable is not defined as per Intel® Scalable System Framework Architecture Specification.
  • mount-not-uniform-home-inode:
    • Check that the home path is shared on the cluster by checking the uniformity of the inodes of the home directory.
  • mount-not-uniform-home-path:
    • Check that the home path is uniform on the cluster.
  • mount-proc-not-mounted:
    • Check that /proc is properly mounted.
  • mount-tmpdir-not-defined:
    • TMPDIR environment variable is not defined as per Intel® Scalable System Framework Architecture Specification.
  • mpi-internode-broken:
    • Check whether MPI intra-node Hello World is functional.
  • mpi-internode-data-is-too-old:
    • Identify nodes where the most recent MPI-INTERNODE data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • mpi_internode-data-missing:
    • Check that MPI internode data is available.
  • mpi-local-broken:
    • Identify cases where there are less than 4 lines of valid output in the parsed output, but an mpirun binary executable was found.
  • mpi-local-data-is-too-old:
    • Identify nodes where the most recent MPI-LOCAL data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • mpi-local-not-found:
    • Identify cases where an mpirun binary executable itself was not found.
  • mpi-local-path-not-uniform:
    • If the mpi-local-path found on each node is not the same as at least 90% of the other nodes, then the node should be flagged as non-uniform. The fewer othernodes have the same mpi-local-path, the greater the confidence that the node with the different version is incorrect.
  • mpi-internode-data-missing:
    • Check that MPI internode data is available.
  • mpi-local-data-missing:
    • If there are any signs for missing data, create a no data diagnosis and mark the sign as diagnosed. This rule only fires for the first no data sign per node, that is, when the diagnosis does not already exist. Once the diagnosis exists, it should not be duplicated. Thus, there is a corresponding rule, no-data-subsequent, for the case where there are multiple signs leading to this diagnosis.
  • node-extra:
    • Check if RPM information has changed (extra node) between the snapshots.
  • node-removed:
    • Check if RPM information has changed (node removed) between the snapshots
  • no-data-subsequent:
    • This rule is related to no-data-initial. The difference is that this rule fires only after the initial diagnosis has already been created. This rule marks the sign as diagnosed, and also adds to the list of signs that produced the diagnosis.
  • no_hfi_detected
    • Checks if no HFI was found on the node.
  • non-uniform-hardware-initial:
    • If there are any signs for non-uniform hardware, create a non-uniform hardware diagnosis and mark the sign as diagnosed. This rule only fires for the first non-uniform hardware sign per node, that is when the diagnosis does not already exist. Once the diagnosis exists, it should not be duplicated. Thus, there is a corresponding rule, non-uniform-hardware-subsequent, for the case where there are multiple signs leading to this diagnosis.
  • non-uniform-hardware-subsequent:
    • This rule is related to non-uniform-hardware-initial. The difference is that this rule fires only after the initial diagnosis has already been created. This rule marks the sign as diagnosed, and also adds to the list of signs that produced the diagnosis.
  • non-uniform-software-initial:
    • If there are any signs for non-uniform software, create a non-uniform software diagnosis and mark the sign as diagnosed. This rule only fires for the first non-uniform software sign per node, that is, when the diagnosis does not already exist. Once the diagnosis exists, it should not be duplicated. Thus, there is a corresponding rule, non-uniform-software-subsequent, for the case where there are multiple signs leading to this diagnosis.
  • non-uniform-software-subsequent:
    • This rule is related to non-uniform-software-initial. The difference is that this rule fires only after the initial diagnosis has already been created. This rule marks the sign as diagnosed, and also adds to the list of signs that produced the diagnosis.
  • not-intel-ssf-compliant-initial-2016.0:
    • If there are any signs for Intel® Scalable System Framework 2016.0 non-compliance, create a not Intel® SSF compliant diagnosis and mark the sign as diagnosed. This rule only fires for the first non-compliance sign per node, that is, when the diagnosis does not already exist. Once the diagnosis exists, it should not be duplicated. Thus, there is a corresponding rule, not-ssf-compliant-subsequent-2016.0, for the case where there are multiple signs leading to this diagnosis.
  • ntp-data-is-too-old:
    • Identify nodes where the most recent ntp data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • ntp-data-missing:
    • Check that ntp data is available.
  • ntp-not-connected:
    • Check if ntp client is not connected to an ntp server. This is true if the remote slot is set to the default.
  • ntp-offset-above-threshold:
    • Check if reported time offset is larger than a threshold. Increase severity based on the size of the difference between the offset and threshold.
  • opa-ca-is-not-consistent:
    • Identify inconsistent Intel® Omni-Path Host Fabric Interface ca types.
  • opa-data-is-too-old:
    • Identify nodes where the most recent Intel® Omni-Path Host Fabric Interface data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • opa-data-missing:
    • Identify instances of missing Intel® Omni-Path Host Fabric Interface information.
  • opa-device-is-not-consistent:
    • Identify inconsistent Intel® Omni-Path Host Fabric Interface PCI devices.
  • opa-driver-is-not-consistent:
    • Identify inconsistent Intel® Omni-Path Driver.
  • opa-firmware-version-is-not-consistent:
    • Identify inconsistent Intel® Omni-Path Host Fabric Interface firmware versions.
  • opa-hardware-version-is-not-consistent:
    • Identify inconsistent Intel® Omni-Path Host Fabric Interface hardware versions.
  • opa-memlock-is-not-consistent:
    • Identify inconsistent memlock limits.
  • opa-memlock-too-small:
    • Identify memlock limits that are deemed too low for the Intel® Omni-Path Fabric.
  • opa-physical-state-is-not-consistent:
    • Identify inconsistent Intel® Omni-Path Host Fabric Interface physical states.
  • opa-physlot-is-not-consistent:
    • Identify inconsistent Intel® Omni-Path Host Fabric Interface physical slots.
  • opa-port-physical-state-not-linkup:
    • Identify Intel® Omni-Path Host Fabric Interface ports not in the LinkUp physical state.
  • opa-port-state-not-active:
    • Identify Intel® Omni-Path Host Fabric Interface ports not in the Active state.
  • opa-rate-is-not-consistent:
    • Identify inconsistent Intel® Omni-Path Host Fabric Interface rate.
  • opa-regex-error:
    • If the connector regular expression fails to parse any of the Intel® Omni-Path Host Fabric Interface commands, this error should fire notifying the user of the issue.
  • opa-state-is-not-consistent:
    • Identify inconsistent Intel® Omni-Path Host Fabric Interface states.
  • opa-subnet-manager-not-running:
    • Check that an Intel® OPA subnet manager is running for Intel® Omni-Path Fabric.
  • outlier-imb-pingpong-latency-due-to-ethernet-coalescing:
    • Diagnose Intel® MPI Benchmarks PingPong latency performance outlier issues due to Ethernet interrupt coalescing not being disabled. If the imb-pingpong-latency-outlier sign is TRUE, the Intel® MPI Library settings are configured to use Ethernet, and the ethernet- interrupt-coalescing-is-enabled sign is TRUE, then conclude the inconsistent performance is due to Ethernet interrupt coalescing not being disabled. Note that the Ethernet interrupt coalescing only affects PingPong latency, not bandwidth, so there is no corresponding rule for bandwidth.
  • perl-data-is-too-old:
    • Identify nodes where the most recent Perl data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • perl-data-missing:
    • Check that Perl data is available.
  • perl-not-found:
    • If no Perl version is found and stderr contains the string 'command not found'', then Perl is not installed / incorrectly installed.
  • perl-not-functional:
    • If no Perl version is present or stderr is not empty, then Perl may not be functional. If a version is present and stderr is not empty, use lower confidence and severity values, since the stderr output may be unrelated. If no version is present and stderr is not empty, then Perl is definitely not functional, so use high confidence and severity values. Avoid matching the 'command not found' case that is handled separately.
  • perl-not-ssf:
    • If the Perl version is less than 5.10, then Perl is not Intel® Scalable System Framework compliant.
  • perl-not-uniform:
    • If the Perl version is not the same as at least 90% of the other nodes, then the node should be flagged as non-uniform. The fewer other nodes that have the same Perl version increases the confidence that the node with the different version is incorrect.
  • process-data-is-too-old:
    • Identify nodes where the most recent PROCESS data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • process-data-missing:
    • Check that process data is available.
  • process-is-a-zombie:
    • For the most recent PROCESS data point, identify nodes with zombie processes, that is, processes with a Z state.
  • process-is-high-cpu:
    • For the most recent PROCESS data point, identify nodes with high CPU processes, that is, processes using more than 20% of a CPU core.
  • process-is-high-memory:
    • For the most recent PROCESS data point, identify nodes with high memory processes, that is, processes using more than 50% of memory.
  • python-data-is-too-old:
    • Identify nodes where the most recent Python data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • python-data-missing:
    • Check that Python data is available.
  • python-not-found:
    • If no Python version is found and stderr contains the string 'command not found', then Python is not installed or incorrectly installed.
  • python-not-functional:
    • If no Python version is present or stderr is not empty, then Python may not be functional. If a version is present and stderr is not empty, use lower confidence and severity values, since the stderr output may be unrelated. If no version is present and stderr is not empty, then Python is definitely not functional, so use high confidence and severity values. Avoid matching the 'command not found' case that is handled separately.
  • python-not-ssf:
    • If the Python version is less than 2.6, then Python is not Intel® Scalable System Framework compliant.
  • python-not-uniform:
    • If the Python version is not the same as at least 90% of the other nodes, then the node should be flagged as non-uniform. The fewer other nodes that have the same Python version, the greater the confidence that the node with the different version is incorrect.
  • rpm-added:
    • Check if RPM information has changed (extra RPM) between snapshots.
  • rpm-data-is-too-old:
    • Identify nodes where the most recent RPM data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • rpm-data-missing:
    • Check that RPM data is available.
  • rpm-is-extra:
    • Check whether an RPM is present on this node, but missing on other nodes.
  • rpm-is-missing:
    • Check whether an RPM is present on other nodes, but missing on this one.
  • rpm-missing:
    • Check if RPM information has changed (RPM missing) between snapshots.
  • rpm-modified:
    • Check if RPM attributes (version, release, architecture) have been modified between snapshots.
  • service-not-available:
    • Identifies if the required services are available on the node.
  • services-data-is-too-old:
    • Identifies nodes where the most recent services data is considered too old. Too old is defined (by default) as no data from the last seven days (605800 seconds).
  • services-data-missing:
    • Identifies the nodes missing services data.
  • services-preferred-status:
    • Identifies if the services status matches the given preferred specification.
  • sgemm-data-is-substandard:
    • For the most recent SGEMM data point, identify nodes with substandard FLOPS relative to a threshold based on the hardware. The severity depends on the amount of deviation from the threshold value; the larger the deviation, the higher the severity.
  • sgemm-data-is-too-old:
    • Identify nodes where the most recent SGEMM data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • sgemm-data-missing:
    • Detect cases where there is no SGEMM data.
  • sgemm-numactl-missing:
    • Checks if the numactl was not found. If this binary is not installed then sgemm performance may be affected.
  • sgemm-outlier:
    • Locate values that are outliers.  An outlier is a value that is outside the range defined by the median +/- 6 * median absolute deviation.  The statistics are computed using all samples on all nodes (i.e., use the SGEMM statistics key).
  • sgemm-taskset-missing:
    • Checks if the taskset binary was not found. If this binary is not installed, then sgemm performance may be affected.
  • shells-data-is-too-old:
    • Identify nodes where the most recent SHELL data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • shells-data-missing:
    • Check that libraries data is available.
  • ssf-file-not-found:
    • If no Intel® Scalable System Framework (Intel® SSF) versions are found and stderr contains the string 'No such file or directory', then the file is missing.
  • ssf-file-other-error:
    • If no Intel® Scalable System Framework (Intel® SSF) versions are found or stderr is not empty, then the file may not be readable. If a version is present and stderr is not empty, use lower confidence and severity values, since the stderr output may be unrelated. If no version is present and stderr is not empty, then the file is definitely not readable, so use high confidence and severity values. Avoid matching the 'No such file or directory' case that is handled separately.
  • ssf-layer-dependency-compat-hpc:
    • Determine whether layer self is also in /etc/ssf-release.
  • ssf-layer-dependency-hpc-cluster-compat-base:
    • Determine whether all contained layers are also in /etc/ssf-release.
  • ssf-layer-dependency-self:
    • Determine whether all contained layers are also in /etc/ssf-release.
  • ssf-libraries-data-is-too-old:
    • Identify nodes where the most recent LIBRARIES data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • ssf-version-data-is-too-old:
    • Identify nodes where the most recent Intel® SSF data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • ssf-version-data-missing:
    • Check that Intel® Scalable System Framework (Intel® SSF) version data is available.
  • storage-data-is-too-old:
    • Identify nodes where the most recent STORAGE data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • storage-data-missing:
    • Check that storage data is available.
  • storage-ssf-compute:
    • Checks the Intel® Scalable System Framework (Intel® SSF) required minimum for compute node storage. The compute node must have at least 16 GiB of RAM and access to at least 80 GiB of persistent storage. Login nodes should have at least 200 GiB of persistent storage.
  • storage-ssf-head:
    • Checks the Intel® Scalable System Framework (Intel® SSF) required minimum for head node storage. The head node must be attached to 200GiB of direct access storage.
  • stream-data-error:
    • Looks for cases where STREAM failed, except because libiomp5 could not be found.
  • stream-data-is-too-old:
    • Identify nodes where the most recent STREAM data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • stream-data-missing:
    • Check that STREAM data is available.
  • stream-failed-validation:
    • Identifies cases where the string \"Failed validation\" is found in the STDOUT. In these cases, the triad value will still be populated, so we can't rely on the existence of the triad value.
  • stream-no-runtimes:
    • Look for cases where STREAM failed because libiomp5 could not be found.
  • stream-outlier:
    • Locate values that are outliers. An outlier is a value that is outside the range defined by the median +/- 6 * median absolute deviation. The statistics are computed using all samples on all nodes (that is, use the STREAM statistics key). Note: the statistics-control condition is required to ensure that all samples are included when computing the statistics.
  • stream-perf-pass:
    • Ensure that a system meets the performance requirements defined by Intel® Select Solutions for Simulation and Modeling.
  • substandard-dgemm-due-to-dimms
    • Diagnose substandard DGEMM performance issues due to insufficient DIMMs. If the dgemm-performance sign is substandard and the DIMMs per socket is insufficient.
  • substandard-dgemm-due-to-high-cpu-process:
    • Diagnose substandard DGEMM performance issues due to a conflicting process that is consuming a high amount of CPU. If the dgemm-performance sign is substandard and the high-cpu-process sign is true and the associated data points are close together in time (within 10 minutes), then conclude the substandard performance is due to the high CPU process.
  • substandard-dgemm-due-to-high-memory-process:
    • Diagnose substandard DGEMM performance issues due to a conflicting process that is consuming a large amount of memory. If the dgemm-performance sign is substandard and the high-memory-process sign is true and the associated data points are close together in time (within 10 minutes), then conclude the substandard performance is due to the high memory process.
  • substandard-dgemm-due-to-offline-cores:
    • Diagnose substandard DGEMM performance issues due to detected offline cores. If the dgemm-performance sign is substandard and the all-logical-cores-not-available sign is true and the associated data points are close together in time (within 10 minutes), then conclude the substandard performance may be due to the offline cores.
  • substandard-imb-pingpong-latency-due-to-ethernet-coalescing:
    • Diagnose substandard IMB pingpong latency performance issues due to Ethernet interrupt coalescing not being disabled. If the imb-pingpong-latency-threshold sign is TRUE (substandard), the Intel® MPI Library settings are configured to use Ethernet, and the ethernet-interrupt-coalescing-is-enabled sign is TRUE, then conclude the substandard performance is due to Ethernet interrupt coalescing not being disabled. Note that the Ethernet interrupt coalescing only affects IMB pingpong latency, not bandwidth, so there is no corresponding rule for bandwidth.
  • substandard-sgemm-due-to-high-cpu-process
    • Diagnose substandard SGEMM performance issues due to a conflicting process that is consuming a high amount of cpu. If the sgemm-performance sign is substandard and the high-cpu-process sign is true and the associated data points are close together in time (within 10 minutes), then conclude the substandard performance is due to the high cpu process.
  • substandard-sgemm-due-to-high-memory-process
    • Diagnose substandard SGEMM performance issues due to a conflicting process that is consuming a large amount of memory. If the sgemm-performance sign is substandard and the high-memory-process sign is true and the associated data points are close together in time (10 minutes), then conclude the substandard performance is due to the high memory process.
  • substandard-sgemm-due-to-offline-cores
    • Diagnose substandard SGEMM performance issues due detected offline cores. If the sgemm-performance sign is substandard and the all-logical-cores-not-available sign is true and the associated data points are close together in time (10 minutes), then conclude the substandard performance is due to the offline cores.
  • tcl-data-is-too-old:
    • Identify nodes where the most recent Tcl data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
  • tcl-data-missing:
    • Check that Tcl data is available.
  • tcl-not-found:
    • If no Tcl version is found and stderr contains the string 'command not found', then Tcl is not installed / incorrectly installed.
  • tcl-not-functional:
    • If no Tcl version is present or stderr is not empty, then Tcl may not be functional. If a version is present and stderr is not empty, use lower confidence and severity values, since the stderr output may be unrelated. If no version is present and stderr is not empty, then Tcl is definitely not functional, so use high confidence and severity values. Avoid matching the 'command not found' case that is handled separately.
  • tcl-not-ssf:
    • If the Tcl version is less than 8.5, then Tcl is not Intel® Scalable System Framework (Intel® SSF) compliant.
  • tcl-not-uniform:
    • If the Tcl version is not the same as at least 90% of the other nodes, then the node should be flagged as non-uniform. The fewer other nodes that have the same tcl version, the greater the confidence that the node with the different version is incorrect.
  • threads-per-core-not-uniform:
    • Check for uniformity of threads per core among nodes having equivalent CPU(s) (for valid thread count per core).
  • threads-per-core-unusual:
    • Check to see if there is an unusual number of threads.
  • unable-to-obtain-ip-address:
    • If hostname -i does not return a valid IP address, the connector will pass an empty string to the clips slot for the IP address and this rule will fire.
  • xp-cluster-mode-ambiguous:
    • Check if cluster mode for the Intel® Xeon Phi™ processor is undetermined.
  • xp-cluster-mode-not-uniform:
    • Check that the cluster mode for the Intel® Xeon Phi™ processor is uniform.
  • xp-cluster-mode-preferred:
    • Check that the cluster mode for the Intel® Xeon Phi™ processor is in preferred mode.
  • xp-data-source-numactl:
    • Check if cluster/memory mode for the Intel® Xeon Phi™ processor is undetermined.
  • xp-memory-mode-ambiguous:
    • Check if memory mode for the Intel® Xeon Phi™ processor is undetermined.
  • xp-memory-mode-not-uniform:
    • Check that the memory mode for the Intel® Xeon Phi™ processor is uniform.
  • xp-memory-mode-preferred:
    • Check that the memory mode for the Intel® Xeon Phi™ processor is in preferred mode.
  • xp-modes-data-is-too-old:
    • Identify nodes where the most recent Intel® Xeon Phi™ processor modes data is too old. Data is considered too old when there is no data from the last 7 days (604800 seconds).
  • xp-modes-data-missing:
    • Check if the modes data for the Intel® Xeon Phi™ processor is available.
For more complete information about compiler optimizations, see our Optimization Notice.