Troubleshooting the kernel_modules check

When running the Intel® Cluster Checker kernel_modules check, you encounter a diagnostic message that looks something like the following:

Kernel Module Correctness and Uniformity, (kernel_modules)...................FAILED
subtest 'joydev' failed
- failing hosts compute-0-2 - compute-0-7 returned: 'joydev 11841 0 '
- failing hosts compute-0-1, compute-0-8 returned: 'not loaded'

The kernel_modules check verifies that the set of loaded Linux* kernel modules is uniform on all the cluster nodes. A kernel module loaded on some nodes but not others will cause this check to fail.  

The name of the Linux kernel module is displayed as the subtest description: subtest 'joydev' failed.  The output under the subtest heading identifies the nodes on which the kernel model is loaded (nodes compute-0-2 through compute-0-7, inclusive) and the nodes on which the kernel module is not loaded (nodes compute-0-1 and compute-0-8.)  For the nodes that have loaded the kernel module, the diagnostic information also includes the corresponding output from the lsmod utility: joydev 11841 0

Now that you understand the diagnostic message, how do you resolve it?  First, you must understand why the kernel module was loaded on some, but not all, of the nodes. 

Software configuration is one reason why a Linux kernel module may be loaded on some nodes but not others. Verify that the file /etc/modprobe.conf is identical on all the nodes. If you believe that this or other files may differ on your cluster, try running the file_tree check.

Another very common reason is that some of the nodes have a different hardware configuration; many kernel modules will be loaded only if the Linux kernel detects specific hardware.  Hardware configuration differences are typically flagged by the pci and/or dmidecode checks.  If issues are also detected by these checks, try resolving them first to see if that also takes care of the kernel_modules issue. 

Sometimes hardware configuration differences are not readily apparent.  For example, plugging in a keyboard, mouse, and monitor to work on a node triggers the loading of some kernel modules.  A USB flash drive has a similar effect.  The kernel modules remain loaded even after you remove the device(s).  However, those newly loaded kernel modules will go away after a reboot.  This can lead to what appears to be intermittent issues on a seemingly random set of nodes.  Kernel modules that may be loaded as a result of using a keyboard, USB flash drive, etc. include joydev and usb_storage.  To resolve issues related to these particular modules, you should configure the kernel_modules check to exclude these kernel modules from it's verification:


You should not exclude a kernel module from verification unless you first understand why it's loaded on only some of the nodes.  Otherwise you may be ignoring a real issue that could significantly impact the operation of your cluster.

Refer to the Module Reference Guide included with Intel® Cluster Checker for more information about the kernel_modules check and how to configure it.