Troubleshooting the kernel_parameters check

Submit New Article

August 19, 2009 10:00 PM PDT


When clusters consist of homogeneous compute nodes, Linux kernel parameters should be consistent across the cluster. However, when there are differences in the hardware characteristics of computes nodes, the kernel parameters may not be consistent. In some cases, this may be the result of a failure or issue that needs to be addressed to ensure proper operation of the cluster. A bad block of memory or inconsistent BIOS configuration may alter the kernel parameters of a compute node. These issues could potentially affect the overall performance of the cluster. Intel® Cluster Checker detects differences in the kernel parameters and reports differences between the nodes in the cluster.

Variability in the characteristics of compute nodes may occur for many reasons. Some clusters are designed with a few nodes that contain larger amounts of memory and/or storage. Clusters that have nodes replaced or added to the system after the initial deployment may include servers that have subtle hardware differences. These variances may be acceptable differences but can manifest in inconsistent kernel parameters.

The kernel_parameters module can be configured to ignore these differences. An <exclude> tag can be used for each kernel parameter that varies between the nodes.  The parameter value is the name of the kernel parameter that is allowed to vary between nodes.  By using this tag, the module will not report differences in the specified parameter across the cluster.

For example, a compute node has been replaced. The original compute nodes each contain a DVD drive, but the replacement node does not. The kernel_parameters module detects differences related to this variance in compute nodes.

Linux Kernel Runtime Parameters, (kernel_parameters)...................FAILED
subtest 'dev.cdrom.autoclose' failed
- failing hosts compute-00-00 - compute-00-06 returned: '1'
- failing host compute-00-07 returned: 'undefined value'
subtest 'dev.cdrom.autoeject' failed
- failing hosts compute-00-00 - compute-00-06 returned: '0'
- failing host compute-00-07 returned: 'undefined value'

If this difference is determined to be acceptable, the following configuration block will tell this check to ignore the inconsistencies reported above.

<kernel_parameters>
  <exclude>dev.cdrom.autoclose</exclude>
  <exclude>dev.cdrom.autoeject</exclude>
</kernel_parameters>

Excluding kernel parameters should be used with some discretion, however, as this can mask errors or differences that may not be desired.  The best option is to resolve inconsistencies and only exclude specific kernel parameters as a last resort.


Do you need more help?


This article applies to: Intel® Cluster Checker Knowledge Base,   Intel® Cluster Ready Knowledge Base