Intel® Cluster Checker uses the STREAM benchmark to verify the memory performance of each cluster node. STREAM consists of several individual benchmark tests; the memory_bandwidth_stream check only uses the 'Triad' benchmark. When running the memory_bandwidth_stream check, you may encounter a diagnostic message similar to the following:
Single-node Memory Bandwidth (STREAM), (memory_bandwidth_stream).......FAILED
subtest 'Triad' failed
- failing host compute-00-03 returned: '15977.7 MB/s'
- failing host compute-00-01 returned: '15997.1 MB/s'
- failing host compute-00-00 returned: '16004.3 MB/s'
- failing host compute-00-02 returned: '16126.5 MB/s'
The failure reported above occurs because the measured bandwidth of the STREAM Triad benchmark on the nodes is less than the threshold value configured by the <bandwidth> tag in the configuration file:
The performance for the STREAM benchmark is sensitive to the characteristics of the processor(s), motherboard, and memory used in the system. Failures to achieve the configured performance threshold may not actually be a system fault. It’s possible that the threshold value is set or tuned for higher performing processors and memory.
If all of the following statements are true, then it likely indicates the <bandwidth> performance threshold needs to be tuned for the performance of the individual cluster nodes:
- The memory_bandwidth_stream check has always reported a failure to achieve the configured performance threshold on the system (no previous runs ever met the specified performance threshold)
- All the nodes in a cluster fail to achieve the configured performance threshold
- All the nodes achieve relatively similar performance levels
When setting the performance threshold, it is suggested to use a value that is 90% of the lowest measured performance. This will allow for some normal fluctuation in the results.
Memory performance may also be sensitive to where memory is located on the motherboard and the BIOS settings. If all the nodes have the identical hardware but one node is consistently reporting lower performance, verify the same memory slots are populated and the BIOS memory options are set consistently.
If the memory performance is inconsistent from run to run, there may be other processes on the node consuming resources. Verify that no other programs are running on the nodes before starting this check.
If the cluster has heterogeneous hardware, a single performance threshold may not be appropriate for all the nodes. This knowledge base articledescribes how to configure Intel® Cluster Checker for heterogeneous clusters.