Intel® Cluster Ready Partner Newsletter Q3 2012 - Tips & Tricks


Best Practices for ECC Memory Monitoring

Keep Your Cluster Up and Running. In the datacenter, the monitoring of many different aspects of a cluster can be extremely important, especially as a cluster grows. There are many utilities/products available that encompass a broad range of features to monitor hardware and software events on a cluster. However, there seems to be an overlooked test to check for memory errors. Here is our recommended method on how to seek out and monitor your ECC memory for issues.

First you must be on relatively recent hardware. A 4-node cluster was used in the test below. The machine was configured with:

  • S5520UR motherboard with Intel® Xeon® CPU’s – model number - X5680 @ 3.33GHz
  • Samsung 2GB PC3-10600 DDR3-1333MHz ECC Registered memory DIMMS
  • Red Hat Enterprise Linux Server release 5.4

Confirm ipmitool is installed (standard RHEL install).
Verify the correct kernel modules are loaded.
[root@lucio:~]> lsmod |grep ipmi
ipmi_devintf           44753  0
ipmi_si                77965  0
ipmi_msghandler        72985  2 ipmi_devintf,ipmi_si

If ipmi_devintf and ipmi_si are not loaded, run:
modprobe ipmi_si
modprobe ipmi_devintf

To ensure everything is working, run (this will generate quite a bit of output if working correctly):
ipmitool sel list
<lots of output>


The last part is to run (which searches for the uncorrectable errors):
ipmitool sel list |grep ECC |grep -ci uncorrectable
[root@lucio:~]> ipmitool sel list | grep ECC | grep -ci uncorrectable
0

A zero indicates this machine has no uncorrectable errors present – that’s good!

Now, run it on the compute nodes.
[root@lucio:~]> tentakel ipmitool sel list | grep ECC | grep -ci uncorrectable
0

Again, a zero indicates no uncorrectable errors are present which means healthy hardware.

Here is an example of hardware that is seeing more errors.
[root@iago ~]# ipmitool sel list | grep ECC | grep -ci uncorrectable
12

Let’s take a look at those 12 issues.
[root@iago ~]# ipmitool sel list |grep ECC |grep -i uncorrectable
2e6c | 08/27/2008 | 11:02:40 | Memory #0x08 | Uncorrectable ECC | Asserted
2e80 | 08/27/2008 | 11:02:40 | Memory #0x08 | Uncorrectable ECC | Asserted
2e94 | 08/27/2008 | 11:02:41 | Memory #0x08 | Uncorrectable ECC | Asserted
2ea8 | 08/27/2008 | 11:02:41 | Memory #0x08 | Uncorrectable ECC | Asserted
2ebc | 08/27/2008 | 11:02:41 | Memory #0x08 | Uncorrectable ECC | Asserted
2ed0 | 08/27/2008 | 11:02:41 | Memory #0x08 | Uncorrectable ECC | Asserted
309c | 08/27/2008 | 11:28:07 | Memory #0x08 | Uncorrectable ECC | Asserted
30b0 | 08/27/2008 | 11:28:07 | Memory #0x08 | Uncorrectable ECC | Asserted
30c4 | 08/27/2008 | 11:28:08 | Memory #0x08 | Uncorrectable ECC | Asserted
30d8 | 08/27/2008 | 11:28:08 | Memory #0x08 | Uncorrectable ECC | Asserted
30ec | 08/27/2008 | 11:28:08 | Memory #0x08 | Uncorrectable ECC | Asserted
3100 | 08/27/2008 | 11:28:08 | Memory #0x08 | Uncorrectable ECC | Asserted

Don’t worry too much about a few errors. Most systems will have them because of many factors: different hardware, utilization, etc. What is important is that these tests be run at regular intervals to track the number of errors. An increase in errors could indicate other problems like a memory DIMM going bad. Adding this test to a resource manager/monitor could help prevent future problems that would cost time troubleshooting as well as possible node downtime.



Certify Your Linux Installation – with Ease

Do you need a simple way to make a Linux installation Intel® Cluster Ready? The High Performance Computing Center at Stanford University has made available RPM packages that do exactly that. These meta-packages identify all Linux binary and library requirements to meet Intel Cluster Ready specification version 1.2.

By using any installer with automatic dependency support, such as YUM or Zypper, a single RPM will add all required Linux distribution packages to the installation. A second package is provided for additional head node requirements. These RPMs do not include Intel® Cluster Checker or Intel runtime libraries, which are added separately.

RPMs are available for both Red Hat Enterprise Linux (RHEL) and SUSE Linux Enterprise Edition (SLES). They have been tested to work on RHEL 5 and RHEL 6 clones, as well as OpenSUSE 11 variants.

To get the new RPMs, click here.

Look For Intel® Cluster Ready

When you see the Intel Cluster Ready name, you can be assured the cluster solution complies with the Intel Cluster Ready specification and has passed the tests of the Intel® Cluster Checker.

For more information read the Intel® Cluster Ready Usage Guidelines.

Look for Intel Cluster Ready solution vendors.

Einzelheiten zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.