<?xml version="1.0" encoding="UTF-8"?>
<!-- Generated on Fri, 25 May 2012 13:30:26 -0700 -->
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <atom:link href="http://software.intel.com/en-us/articles/intel-cluster-ready-kb/type/errors-diagnostics/feed/" rel="self" type="application/rss+xml" />
    <title>Intel Software Network articles Feed</title>
    <link>http://software.intel.com/en-us/articles/intel-cluster-ready-kb/type/errors-diagnostics/</link>
    <description></description>
    <language>en-us</language>
    <item>
      <title>Using Cluster SSH with Intel® Cluster Checker</title>
      <description><![CDATA[ <span class="sectionHeading">Symptom<br /></span><br />The <span >process_check</span> module of Intel Cluster Checker sometimes reports an issue related to <a href="http://sourceforge.net/projects/clusterssh/">Cluster SSH</a>:<br /><br />
<blockquote>Stale Process Check, (process_check)................................................................FAILED<br />subtest 'Percent cpu usage is greater than 5%' failed<br />- failing host compute-00-14 returned: 'pid=8736 (cssh)'</blockquote>
<br /><span class="sectionHeading">Resolution</span><br /><br />Configure Intel® Cluster Checker to ignore the CPU usage of Cluster SSH:<br /><br />
<blockquote>&lt;process_check&gt;<br />  &lt;exclude&gt;cssh&lt;/exclude&gt;<br />&lt;/process_check&gt;</blockquote>
<br /> ]]></description>
      <link>http://software.intel.com/en-us/articles/using-cluster-ssh-with-intel-cluster-checker/</link>
      <pubDate>Wed, 03 Feb 2010 22:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/using-cluster-ssh-with-intel-cluster-checker/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/using-cluster-ssh-with-intel-cluster-checker/</guid>
      <category>Intel® Cluster Checker Knowledge Base</category>
      <category>Intel® Cluster Ready Knowledge Base</category>
    </item>
    <item>
      <title>How to resolve dat_conf issues affecting Platform MPI</title>
      <description><![CDATA[ <span class="sectionHeadingText"><span class="sectionHeading">Symptom</span><br /></span><br />The Intel® Cluster Checker test <span >dat_conf</span> verifies that the file /etc/dat.conf does not contain any entries known to affect some versions of Platform MPI* (formerly HP MPI*).  The error may appear similar to the following example:<br /><br />
<blockquote>Valid dat.conf entries, (dat_conf).....................................FAILED<br />subtest 'Device: ehca0' failed<br />- failing hosts compute-00-00 - compute-00-05 returned: '/etc/dat.conf contains Interface Adapter entries known to trigger a fault in HP MPI version 2.2.5 or earlier. Please contact Hewlett-Packard for more information.'</blockquote>
<br /><span class="sectionHeading">Cause<br /></span><br />Some versions of the Open Fabrics Enterprise Distribution (OFED) insert example entries into /etc/dat.conf.  However, some older versions of HP MPI / Platform MPI assume that all /etc/dat.conf entries are valid and active.  <br /><br /><span class="sectionHeading">Resolution</span><br /><br />Remove or comment out all entries in /etc/dat.conf that are not valid. <br /><br />These lines must be removed/commented out like so: <br /><br />For example, the following two entries have been commented out by inserting the '#' character at the beginning of the line: <br /><br />
<blockquote># OpenIB-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" ""<br /># OpenIB-mlx4_0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 2" ""</blockquote>
<br /><br /><br /> ]]></description>
      <link>http://software.intel.com/en-us/articles/how-to-resolve-dat_conf-issues-affecting-platform-mpi/</link>
      <pubDate>Wed, 27 Jan 2010 22:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/how-to-resolve-dat_conf-issues-affecting-platform-mpi/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/how-to-resolve-dat_conf-issues-affecting-platform-mpi/</guid>
      <category>Intel® Cluster Checker Knowledge Base</category>
      <category>Intel® Cluster Ready Knowledge Base</category>
    </item>
    <item>
      <title>Troubleshooting InfiniBand connection issues using OFED tools</title>
      <description><![CDATA[ The Open Fabrics Enterprise Distribution (OFED) package has many debugging tools available as part of the standard release. This article describes the use of those tools to troubleshoot the hardware and firmware of an InfiniBand fabric deployment.<br /><br />First, the <span >/sys/class</span> sub-system should be checked to verify that the hardware is up and connected to the InfiniBand fabric.  The following command will show the InfiniBand hardware modules recognized by the system:<br /><br />
<blockquote>ls /sys/class/infiniband</blockquote>
<br />This example will use the module <span >mlx4_0</span>, which is typical for Mellanox ConnectX* series of adapters. If this, or a similar module, is not found, refer to the documentation that came with the OFED package on starting the OpenIB drivers.<br /><br />Next, check the state of the InfiniBand port: <br /><br />
<blockquote>cat /sys/class/infiniband/mlx4_0/ports/1/state</blockquote>
<br />This command should return “ACTIVE” if the hardware is initialized, and the subnet manager has found the port and added the port to the InfiniBand fabric. If this command returns “INIT” the hardware is initialized, but the subnet manager has not added the port to the fabric yet.  <br /><br />If necessary, start the subnet manager:<br /><br />
<blockquote>/etc/init.d/opensmd start</blockquote>
<br />Once the port on the head node is in the “ACTIVE” state, check the state of the InfiniBand port on all the compute nodes to ensure that all of the Infiniband hardware on the compute nodes has been initialized, and the subnet manager has added all of the compute nodes ports on to the fabric. This article will use the pdsh tool to run the command on all nodes: <br /><br />
<blockquote>pdsh –a cat /sys/class/infiniband/mlx4_0/ports/1/state</blockquote>
<br />All nodes should report “ACTIVE”. If a node reports it cannot find the file, ensure the OpenIB drivers is loaded on that node. Refer to the documentation that came with the OFED package on starting the OpenIB drivers.<br /><br />Once all of the compute nodes report that port 1 is “ACTIVE”, verify the speed on each port using the following commands: <br /><br />
<blockquote>cat /sys/class/infiniband/mlx4_0/ports/1/rate<br />pdsh –a cat /sys/class/infiniband/mlx4_0/ports/1/rate</blockquote>
<br />This is a good first check for a bad cable or connection.  Each port should report the same speed. For example, the output for double data rate (DDR) InfiniBand cards will be similar to “20 Gb/sec (4X DDR)”.<br /><br />Once the above basic checks are complete, more in-depth troubleshooting can be performed. The main OFED tool for troubleshooting performance and connection problems is <span >ibdiagnet</span>. This tool runs multiple tests, as specified on the command line during the run, to detect errors related to the subnet, bad packets, and bad states. These errors are some of the more common seen during initial setup of Infiniband fabrics. <br /><br />Run <span >ibdiagnet</span> with the following command line options:<br /><br />
<blockquote>ibdiagnet –pc –c 1000</blockquote>
<br />The output will be similar to this:<br />
<blockquote>Loading IBDIAGNET from: /usr/lib64/ibdiagnet1.2<br />-W- Topology file is not specified.<br />Reports regarding cluster links will use direct routes.<br />Loading IBDM from: /usr/lib64/ibdm1.2<br />-W- A few ports of local device are up.<br />Since port-num was not specified (-p option), port 1 of device 1 will be<br />used as the local port.<br />-I- Discovering ... 17 nodes (1 Switches &amp; 16 CA-s) discovered.<br /><br /><br />-I---------------------------------------------------<br />-I- Bad Guids/LIDs Info<br />-I---------------------------------------------------<br />-I- No bad Guids were found<br /><br />-I---------------------------------------------------<br />-I- Links With Logical State = INIT<br />-I---------------------------------------------------<br />-I- No bad Links (with logical state = INIT) were found<br /><br />-I---------------------------------------------------<br />-I- PM Counters Info<br />-I---------------------------------------------------<br />-I- No illegal PM counters values were found<br /><br />-I---------------------------------------------------<br />-I- Fabric Partitions Report (see ibdiagnet.pkey for a full hosts list)<br />-I---------------------------------------------------<br /><br />-I---------------------------------------------------<br />-I- IPoIB Subnets Check<br />-I---------------------------------------------------<br />-I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00<br />-W- No members found for group<br /><br />-I---------------------------------------------------<br />-I- Bad Links Info<br />-I- Errors have occurred on the following links<br />(for errors details, look in log file /tmp/ibdiagnet.log):<br />-I---------------------------------------------------<br />Link at the end of direct route "1,5"<br />----------------------------------------------------------------<br />-I- Stages Status Report:<br />STAGE Errors Warnings<br />Bad GUIDs/LIDs Check 0 0 <br />Link State Active Check 0 0 <br />Performance Counters Report 0 0 <br />Partitions Check 0 0 <br />IPoIB Subnets Check 0 1 <br />Link Errors Check 0 0 <br /><br />Please see /tmp/ibdiagnet.log for complete log<br />----------------------------------------------------------------<br />-I- Done. Run time was 9 seconds.</blockquote>
<br />The warning “No members found for group” can safely be ignored. <br /><br />In this example, a bad link was found: “Link at the end of direct route “1,5”.”  "1,5" refers to the LID numbers associated with the individual ports. The following commands can be used to identify the LID numbers associated with each port:<br />
<blockquote>cat /sys/class/infiniband/mlx4_0/ports/1/lid<br />pdsh –a /sys/class/infiniband/mlx4_0/ports/1/lid</blockquote>
<br />This command generates a list of LIDs associated with nodes. In the output of the above command, locate the entries for 0x1 and 0x5.  0x1 is likely the head node.   For errors of this type, reseat or replace the InfiniBand cable connecting the node corresponding to LID 0x5.<br /><br />Finally, run <span >ibdiagnet</span> once more time to verify there are no errors, and then to check the error state of each port. Each test should pass. <br /><br />
<blockquote>ibdiagnet –pc –c 1000<br />ibcheckerrors.</blockquote>
<br />
			 ]]></description>
      <link>http://software.intel.com/en-us/articles/troubleshooting-infiniband-connection-issues-using-ofed-tools/</link>
      <pubDate>Wed, 20 Jan 2010 22:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/troubleshooting-infiniband-connection-issues-using-ofed-tools/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/troubleshooting-infiniband-connection-issues-using-ofed-tools/</guid>
      <category>Intel® Cluster Ready Knowledge Base</category>
    </item>
    <item>
      <title>Configuration of threshold values in the HPCC test module</title>
      <description><![CDATA[ <p>The HPC Challenge (HPCC) benchmark suite is a common method to gauge the performance of a cluster.  HPCC consists of seven benchmarks that measure a spectrum of system characteristics.  The hpcc module for Intel® Cluster Checker runs the HPCC benchmark suite on the cluster and reports ‘Succeeded' or ‘Failed' based on the outcome of the tests.</p>
<p>This article does not cover descriptions or definitions of the individual HPCC benchmarks.  For more information about the HPCC benchmarks, see <a href="http://icl.cs.utk.edu/hpcc/">http://icl.cs.utk.edu/hpcc/</a>.</p>
<p class="sectionHeading">Module configuration affects the results of the hpcc tests</p>
<p>Whether the hpcc module succeeds or fails depends on the configuration of the module in the Intel® Cluster Checker configuration file.  The module will execute the HPCC benchmark over each network fabric that is configured in the hpcc module block in the input configuration file.  For each network fabric configured, the individual HPCC benchmark test can optionally configure a performance threshold value that must be achieved for a successful result.  If a performance threshold is not set, then success of a test is based solely on the benchmark running to completion.</p>
<p class="sectionHeading">Results when threshold values are configured</p>
<p>When threshold values are set, a benchmark must meet or exceed the configured performance value.  Depending on the benchmark, that may mean a result that is equal to or greater than the configured threshold OR a result that is equal to or less than the configured threshold.</p>
<table border="1" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td width="139" valign="top">
<p><b>hpcc module configuration tag</b></p>
</td>
<td width="114" valign="top">
<p><b>Measurement unit</b></p>
</td>
<td width="192" valign="top">
<p><b>Output characteristics</b></p>
</td>
<td width="145" valign="top">
<p><b>Passing result</b></p>
</td>
</tr>
<tr>
<td width="139" valign="top">
<p>bandwidth</p>
</td>
<td width="114" valign="top">
<p>GB/s</p>
</td>
<td width="192" valign="top">
<p>Higher is better</p>
</td>
<td width="145" valign="top">
<p>Equal or greater</p>
</td>
</tr>
<tr>
<td width="139" valign="top">
<p>dgemm</p>
</td>
<td width="114" valign="top">
<p>GFLOPS</p>
</td>
<td width="192" valign="top">
<p>Higher is better</p>
</td>
<td width="145" valign="top">
<p>Equal or greater</p>
</td>
</tr>
<tr>
<td width="139" valign="top">
<p>fft</p>
</td>
<td width="114" valign="top">
<p>GFLOPS</p>
</td>
<td width="192" valign="top">
<p>Higher is better</p>
</td>
<td width="145" valign="top">
<p>Equal or greater</p>
</td>
</tr>
<tr>
<td width="139" valign="top">
<p>hpl</p>
</td>
<td width="114" valign="top">
<p>TFLOPS</p>
</td>
<td width="192" valign="top">
<p>Higher is better</p>
</td>
<td width="145" valign="top">
<p>Equal or greater</p>
</td>
</tr>
<tr>
<td width="139" valign="top">
<p>latency</p>
</td>
<td width="114" valign="top">
<p>µs</p>
</td>
<td width="192" valign="top">
<p>Lower is better</p>
</td>
<td width="145" valign="top">
<p>Equal or less</p>
</td>
</tr>
<tr>
<td width="139" valign="top">
<p>ptrans</p>
</td>
<td width="114" valign="top">
<p>GB/s</p>
</td>
<td width="192" valign="top">
<p>Higher is better</p>
</td>
<td width="145" valign="top">
<p>Equal or greater</p>
</td>
</tr>
<tr>
<td width="139" valign="top">
<p>randomacess</p>
</td>
<td width="114" valign="top">
<p>GUPs</p>
</td>
<td width="192" valign="top">
<p>Higher is better</p>
</td>
<td width="145" valign="top">
<p>Equal or greater</p>
</td>
</tr>
<tr>
<td width="139" valign="top">
<p>stream</p>
</td>
<td width="114" valign="top">
<p>GB/s</p>
</td>
<td width="192" valign="top">
<p>Higher is better</p>
</td>
<td width="145" valign="top">
<p>Equal or greater</p>
</td>
</tr>
</tbody>
</table>
<p><br />If one of the benchmarks does not meet the configured threshold value, the module will report a failing result identifying the network fabric and the individual failing benchmark(s).  For example, using the following configuration, the hpcc module reported the following failure:</p>
<blockquote>
<p>&lt;hpcc&gt;</p>
<p>        &lt;cc-path&gt;/opt/intel/cce/11.0.069/&lt;/cc-path&gt;</p>
<p>        &lt;fabric&gt;</p>
<p>              &lt;bandwidth&gt;0.003&lt;/bandwidth&gt;</p>
<p>              &lt;device&gt;sock&lt;/device&gt;</p>
<p>              &lt;dgemm&gt;5.76&lt;/dgemm&gt;</p>
<p>              &lt;fft&gt;0.4&lt;/fft&gt;</p>
<p>              &lt;hpl&gt;0.04&lt;/hpl&gt;</p>
<p>              &lt;latency&gt;40&lt;/latency&gt;</p>
<p>              &lt;ptrans&gt;0.10&lt;/ptrans&gt;</p>
<p>              &lt;randomaccess&gt;0.008&lt;/randomaccess&gt;</p>
<p>              &lt;stream&gt;1.4&lt;/stream&gt;</p>
<p>        &lt;/fabric&gt;</p>
<p> </p>
<p>        &lt;mkl-path&gt;/opt/intel/cmkl/10.1.0.015/&lt;/mkl-path&gt;</p>
<p>        &lt;mpi-path&gt;/opt/intel/impi/3.2/&lt;/mpi-path&gt;</p>
<p>        &lt;process-number&gt;8&lt;/process-number&gt;</p>
<p>        &lt;thread-number&gt;1&lt;/thread-number&gt;</p>
<p>&lt;/hpcc&gt;</p>
</blockquote>
<p> </p>
<blockquote>
<p>HPC Challenge Benchmark (Intel(R) C++ Compiler, Intel(R) MPI</p>
<p>Library, Intel(R) Math Kernel Library), (hpcc)</p>
<p>Attention: this check may take a long time to complete......FAILED</p>
<p>subtest 'PTRANS, GB/s (device = sock)' failed</p>
<p>  - failing All hosts returned: '0.0817186'</p>
</blockquote>
<p> <br />The module reported a failure because the result of running the PTRANS test was 0.0817186 GB/s which did not meet or exceed the configured value of 0.10 GB/s.</p>
<p class="sectionHeading">What do failures to meet thresholds mean?</p>
<p>Many system characteristics affect the results of the HPCC benchmark suite, and a reported test failure does not necessarily indicate an under-performing or malfunctioning cluster.  Processor speeds, network characteristics, and memory architecture, for instance, all factor into the measured results.  Changes in the characteristics of any of those components or sub-systems can affect the outcome of the tests.  Therefore, a failure to meet a threshold may be the result of a value configured too high for the characteristics of a particular cluster.  The thresholds can be reset to levels that are more appropriate for the specific system to resolve the issue.</p>
<p>A cluster that has historically passed the hpcc module testing where threshold values were configured but begins to fail the test consistently may indicate a problem with one or more components in the system.  Make sure that Intel® Cluster Checker was the only application running on the system; other applications running concurrently are likely to impact the measured results of the benchmarks.  If failures to meet thresholds persist and there have been no changes to the hardware characteristics of the cluster, then there may be an issue causing the system to exhibit degraded performance that should be resolved.</p>
<p>Intermittent failures to meet threshold values may be the result of threshold levels that are set too high to account for the natural fluctuations in performance of the system.  For example, with the PTRANS configuration above, the threshold is set to 0.10.  A given cluster may exhibit performance that routinely yields 0.11 GB/s but has fluctuations ranging from 0.095 to 0.12 GB/s.  Any fluctuations that dip below the 0.10 threshold will be flagged as a failure.  Threshold values should be configured to account for some fluctuations in results, so a better threshold for this example may be 0.09 GB/s.</p> ]]></description>
      <link>http://software.intel.com/en-us/articles/configuration-of-threshold-values-in-the-hpcc-test-module/</link>
      <pubDate>Sun, 22 Nov 2009 22:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/configuration-of-threshold-values-in-the-hpcc-test-module/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/configuration-of-threshold-values-in-the-hpcc-test-module/</guid>
      <category>Intel® Cluster Checker Knowledge Base</category>
      <category>Intel® Cluster Ready Knowledge Base</category>
    </item>
    <item>
      <title>How to Deal with file_tree issues in Intel® Cluster Checker 1.3</title>
      <description><![CDATA[ <p><em>Note: this article describes behavior in Intel® Cluster Checker version 1.3 Update 2 or earlier.</em><br /><br />The file_tree Intel® Cluster Checker test module verifies that a consistent set of files is present on all cluster nodes.  This consistency is a requirement of the Intel® Cluster Ready Specification.  However, some files are expected to contain data unique to each node, such as hostname and IP address.  While the file_tree test automatically excludes many common examples of such files from its consistency check, special cases exist that are not automatically handled (see the end of this article for a list).  <br /><br />If the file_tree test identifies files that are not consistent across the cluster, you should manually determine whether the reported difference must be resolved by applying the following rules:</p>
<ul>
<li>Does the reported file contain unique information to each node, such as a time stamp or network configuration?  If so, manually verify that other than than the node unique data, the file is identical on all nodes.  </li>
<li>Does the file name itself or the path to the file contain node unique information?  If so, are similar files present on all the nodes?</li>
<li>Is the file dynamically generated or modified, e.g., a log file, compiled from source on each node, or modified by the prelink utility, on each node?  If so, a time stamp or other node unique information may be embedded.</li>
</ul>
<p>If you can determine why the file is different on each node and verify that the differences are not material, then you may ignore the reported errors.  If you are certifying a cluster design as Intel® Cluster Ready, you should include a brief description of why you believe the reported errors are immaterial with your submitted Intel® Cluster Checker output logs.  You should document the files you manually resolved with your cluster design so that other engineers and technicians at your company understand that they can ignore file_tree errors with these files, but should not ignore errors reported for other files.  <br /><br />Forthcoming versions of Intel® Cluster Checker will have the capability to configure which files should be automatically excluded.<br /><br /><span class="sectionHeading">Files that are known to differ from node to node not automatically handled by Intel® Cluster Checker 1.3</span></p>
<ul>
<li>/opt/mlnx-ofed/src/* </li>
<li>/opt/rocks/lib/graphviz/config </li>
<li>/opt/torque/lib64/xpbs* </li>
<li>/opt/torque/mom_logs/* </li>
<li>/usr/java/jdk1.6.0_14/jre/lib/servicetag/registration.xml </li>
<li>/usr/java/jdk1.6.0_14/register.html </li>
<li>/usr/java/jdk1.6.0_14/register_ja.html </li>
<li>/usr/java/jdk1.6.0_14/register_zh_CN.html </li>
<li>/usr/java/i386/jre1.6.0_12/lib/i386/client/classes.jsa </li>
</ul> ]]></description>
      <link>http://software.intel.com/en-us/articles/how-to-deal-with-file_tree-issues-in-intel-cluster-checker-13/</link>
      <pubDate>Mon, 28 Sep 2009 22:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/how-to-deal-with-file_tree-issues-in-intel-cluster-checker-13/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/how-to-deal-with-file_tree-issues-in-intel-cluster-checker-13/</guid>
      <category>Intel® Cluster Checker Knowledge Base</category>
      <category>Intel® Cluster Ready Knowledge Base</category>
    </item>
    <item>
      <title>Troubleshooting the kernel_parameters check</title>
      <description><![CDATA[ When clusters consist of homogeneous compute nodes, Linux kernel parameters should be consistent across the cluster. However, when there are differences in the hardware characteristics of computes nodes, the kernel parameters may not be consistent. In some cases, this may be the result of a failure or issue that needs to be addressed to ensure proper operation of the cluster. A bad block of memory or inconsistent BIOS configuration may alter the kernel parameters of a compute node. These issues could potentially affect the overall performance of the cluster. Intel® Cluster Checker detects differences in the kernel parameters and reports differences between the nodes in the cluster.<br /><br />Variability in the characteristics of compute nodes may occur for many reasons. Some clusters are designed with a few nodes that contain larger amounts of memory and/or storage. Clusters that have nodes replaced or added to the system after the initial deployment may include servers that have subtle hardware differences. These variances may be acceptable differences but can manifest in inconsistent kernel parameters. <br /><br />The <span >kernel_parameters</span> module can be configured to ignore these differences. An <span >&lt;exclude&gt;</span> tag can be used for each kernel parameter that varies between the nodes.  The parameter value is the name of the kernel parameter that is allowed to vary between nodes.  By using this tag, the module will not report differences in the specified parameter across the cluster. <br /><br />For example, a compute node has been replaced. The original compute nodes each contain a DVD drive, but the replacement node does not. The <span >kernel_parameters</span> module detects differences related to this variance in compute nodes.<br /><br />
<blockquote>Linux Kernel Runtime Parameters, (kernel_parameters)...................FAILED<br />subtest 'dev.cdrom.autoclose' failed<br />- failing hosts compute-00-00 - compute-00-06 returned: '1'<br />- failing host compute-00-07 returned: 'undefined value'<br />subtest 'dev.cdrom.autoeject' failed<br />- failing hosts compute-00-00 - compute-00-06 returned: '0'<br />- failing host compute-00-07 returned: 'undefined value'</blockquote>
<br />If this difference is determined to be acceptable, the following configuration block will tell this check to ignore the inconsistencies reported above.<br /><br />
<blockquote>&lt;kernel_parameters&gt;<br />  &lt;exclude&gt;dev.cdrom.autoclose&lt;/exclude&gt;<br />  &lt;exclude&gt;dev.cdrom.autoeject&lt;/exclude&gt;<br />&lt;/kernel_parameters&gt;</blockquote>
<br />Excluding kernel parameters should be used with some discretion, however, as this can mask errors or differences that may not be desired.  The best option is to resolve inconsistencies and only exclude specific kernel parameters as a last resort.<br /> ]]></description>
      <link>http://software.intel.com/en-us/articles/troubleshooting-the-kernel_parameters-check/</link>
      <pubDate>Wed, 19 Aug 2009 22:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/troubleshooting-the-kernel_parameters-check/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/troubleshooting-the-kernel_parameters-check/</guid>
      <category>Intel® Cluster Checker Knowledge Base</category>
      <category>Intel® Cluster Ready Knowledge Base</category>
    </item>
    <item>
      <title>Troubleshooting STREAM bandwidth issues</title>
      <description><![CDATA[ <p>Intel® Cluster Checker uses the <a href="http://www.cs.virginia.edu/stream/">STREAM</a> benchmark to verify the memory performance of each cluster node.  STREAM consists of several individual benchmark tests; the <span >memory_bandwidth_stream</span> check only uses the 'Triad' benchmark.  When running the memory_bandwidth_stream check, you may encounter a diagnostic message similar to the following:<br /><br />
<blockquote>Single-node Memory Bandwidth (STREAM), (memory_bandwidth_stream).......FAILED<br />subtest 'Triad' failed<br />- failing host compute-00-03 returned: '15977.7 MB/s'<br />- failing host compute-00-01 returned: '15997.1 MB/s'<br />- failing host compute-00-00 returned: '16004.3 MB/s'<br />- failing host compute-00-02 returned: '16126.5 MB/s'</blockquote>
<br />The failure reported above occurs because the measured bandwidth of the STREAM Triad benchmark on the nodes is less than the threshold value configured by the <span >&lt;bandwidth&gt;</span> tag in the configuration file: <br /><br />
<blockquote>&lt;memory_bandwidth_stream&gt;<br />  &lt;bandwidth&gt;27000&lt;/bandwidth&gt; <br />&lt;/memory_bandwidth_stream&gt;</blockquote>
<br />The performance for the STREAM benchmark is sensitive to the characteristics of the processor(s), motherboard, and memory used in the system. Failures to achieve the configured performance threshold may not actually be a system fault. It’s possible that the threshold value is set or tuned for higher performing processors and memory. <br /><br />If all of the following statements are true, then it likely indicates the <span >&lt;bandwidth&gt;</span> performance threshold needs to be tuned for the performance of the individual cluster nodes:<br />
<ul>
<li>The <span >memory_bandwidth_stream</span> check has always reported a failure to achieve the configured performance threshold on the system (no previous runs ever met the specified performance threshold)</li>
<li>All the nodes in a cluster fail to achieve the configured performance threshold</li>
<li>All the nodes achieve relatively similar performance levels</li>
</ul>
When setting the performance threshold, it is suggested to use a value that is 90% of the lowest measured performance. This will allow for some normal fluctuation in the results.<br /><br />Memory performance may also be sensitive to where memory is located on the motherboard and the BIOS settings.  If all the nodes have the identical hardware but one node is consistently reporting lower performance, verify the same memory slots are populated and the BIOS memory options are set consistently.<br /><br />If the memory performance is inconsistent from run to run, there may be other processes on the node consuming resources.  Verify that no other programs are running on the nodes before starting this check.<br /><br />If the cluster has heterogeneous hardware, a single performance threshold may not be appropriate for all the nodes.  This <a href="http://software.intel.com/en-us/articles/how-to-use-a-single-intel-cluster-checker-configuration-file-for-different-configurations/">knowledge base article</a>describes how to configure Intel® Cluster Checker for heterogeneous clusters. </p> ]]></description>
      <link>http://software.intel.com/en-us/articles/troubleshooting-stream-bandwidth-issues/</link>
      <pubDate>Wed, 19 Aug 2009 22:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/troubleshooting-stream-bandwidth-issues/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/troubleshooting-stream-bandwidth-issues/</guid>
      <category>Intel® Cluster Checker Knowledge Base</category>
      <category>Intel® Cluster Ready Knowledge Base</category>
    </item>
    <item>
      <title>Troubleshooting the dmidecode check</title>
      <description><![CDATA[ <p>The <span >dmidecode</span> check uses the <a href="http://www.nongnu.org/dmidecode/">dmidecode utility </a>to retrieve system information from the SMBIOS tables and checks for consistency across the cluster. Each subtest examines a particular SMBIOS field. If any of the fields differ across the cluster nodes, then the subtest reports a failing result. There are many fields that are compared by the <span >dmidecode</span> check. Here are some examples of how the check reports differences found between nodes.</p>
<blockquote>subtest 'BIOS Information (0x0000): Release Date' failed<br />- failing hosts compute-00-00 - compute-00-05, compute-00-07 returned: '03/09/2009'<br />- failing host compute-00-06 returned: '03/12/2009'</blockquote>
<br />In this example, above, the BIOS release date is reported to be different on one of the compute nodes. This implies the BIOS versions are not the same in all the compute nodes.  Note that the <span >dmidecode</span> check does not assume that either of the BIOS release dates is correct, only that the field is not consistent across the cluster.<br /><br />
<blockquote>subtest 'Base Board Information (0x0200): Product Name' failed<br />- failing hosts compute-00-00 - compute-00-02, compute-00-07 returned: '0AJ123A'<br />- failing hosts compute-00-03 - compute-00-06 returned: '0AK456A'</blockquote>
<br />The second example indicates that there are two types of motherboards used in the cluster.  This can result from the compute nodes using differing model motherboards or simply different generations of the same motherboard.<br /><br />
<blockquote>subtest 'Memory Device (0x1100): Asset Tag' failed<br />- failing hosts compute-00-00, compute-00-02 - compute-00-06 returned: '0108001D'<br />- failing hosts compute-00-01, compute-00-07 returned: '01082803'</blockquote>
<br />The output from the third example shows varying results about the memory used in the nodes. Two of the nodes have a different memory configuration than the rest of the cluster. Using a mix of memory sizes, performance characteristics, etc. can adversely and unexpectedly impact overall cluster performance.<br /><br />Issues reported by the <span >dmidecode</span> check do not necessarily mean a cluster will exhibit any issues or failures; however, consistency across cluster nodes is highly desirable.  If there are legitimate or desired differences between compute nodes, the Intel® Cluster Checker group capability and/or individual subtest exclusions can be configured to instruct the <span >dmidecode</span> check to ignore specific differences between nodes.<br /><br />Specifying a group in the <span >dmidecode</span> configuration restricts the consistency check to defined subsets of nodes, and only differences between nodes within a group are reported as issues. For instance, the following configuration option causes the <span >dmidecode</span> check to compare all nodes defined in the <span >mthrbd2</span> group to each other, but a node in the <span >mthrbrd2</span> group would not be compared to a node that is not in this group. <br /><br />
<blockquote>&lt;dmidecode&gt;<br />  &lt;group name=”mthrbd2”/&gt;<br />&lt;/dmidecode&gt;</blockquote>
<br />For more information on the groups capability, see <a href="http://software.intel.com/en-us/articles/how-to-use-a-single-intel-cluster-checker-configuration-file-for-different-configurations/">this article </a>and the <a href="http://software.intel.com/file/9817">Intel® Cluster Checker User's Guide</a>.<br /><br />The other way to instruct the <span >dmidecode</span> check to ignore specific differences between nodes is to use the exclude configuration option.  This option ignores a specific field for all nodes but still checks all the remaining fields for consistency across the whole cluster.<br /><br />
<blockquote>&lt;dmidecode&gt;<br />  &lt;exclude&gt;Memory Device (0x1100): Asset Tag&lt;/exclude&gt;<br />&lt;/dmidecode&gt;</blockquote>
<p> <br />For more information on the exclude option, see the <span >dmidecode</span> section of the <a href="http://software.intel.com/file/14912">Intel® Cluster Checker Module Reference</a>.</p> ]]></description>
      <link>http://software.intel.com/en-us/articles/troubleshooting-the-dmidecode-check/</link>
      <pubDate>Wed, 19 Aug 2009 22:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/troubleshooting-the-dmidecode-check/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/troubleshooting-the-dmidecode-check/</guid>
      <category>Intel® Cluster Checker Knowledge Base</category>
      <category>Intel® Cluster Ready Knowledge Base</category>
    </item>
    <item>
      <title>Troubleshooting the uid_sync check</title>
      <description><![CDATA[ The <span >uid_sync</span> check verifies that the user and group data is consistent across the cluster.  A user or group present on some nodes but others will cause this check to fail.  Here is an example of how the check reports differences found between nodes:<br /><br />
<blockquote>User and Group Uniformity, (uid_sync)...............................................................FAILED<br />subtest 'gid = 1313' failed<br />- failing hosts node00001 - node00128 returned: 'clck:x:1313:'<br />- failing host node00129 returned: 'group does not exist'<br />subtest 'uid = 1313' failed<br />- failing hosts node00001 - node00128 returned: 'clck:x:1313:1313:::cluster checker:/home/clck:/bin/bash'<br />- failing host node00129 returned: 'user does not exist'</blockquote>
<br />The ID of the user or group is displayed as the subtest description: <span >subtest 'gid = <strong>1313</strong>' failed</span>.  The output under the heading identifies the nodes on which the user exists and provides the corresponding entry from the user or group database (e.g., <span >/etc/group</span> or <span >/etc/password</span>).  <br /><br />This check most often fails when there are mismatches or conflicts with NIS or the local group or user databases.  Depending on the method you employ, you should ensure that NIS is properly configured on the inconsistent node(s) or that the <span >/etc/group</span> and <span >/etc/password</span> files are correctly propagated to the inconsistent node(s).  <br /> ]]></description>
      <link>http://software.intel.com/en-us/articles/troubleshooting-the-uid_sync-check/</link>
      <pubDate>Thu, 30 Jul 2009 22:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/troubleshooting-the-uid_sync-check/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/troubleshooting-the-uid_sync-check/</guid>
      <category>Intel® Cluster Checker Knowledge Base</category>
      <category>Intel® Cluster Ready Knowledge Base</category>
    </item>
  </channel></rss>
