Configuration of threshold values in the HPCC test module

The HPC Challenge (HPCC) benchmark suite is a common method to gauge the performance of a cluster.  HPCC consists of seven benchmarks that measure a spectrum of system characteristics.  The hpcc module for Intel® Cluster Checker runs the HPCC benchmark suite on the cluster and reports ‘Succeeded' or ‘Failed' based on the outcome of the tests.

This article does not cover descriptions or definitions of the individual HPCC benchmarks.  For more information about the HPCC benchmarks, see http://icl.cs.utk.edu/hpcc/.

Module configuration affects the results of the hpcc tests

Whether the hpcc module succeeds or fails depends on the configuration of the module in the Intel® Cluster Checker configuration file.  The module will execute the HPCC benchmark over each network fabric that is configured in the hpcc module block in the input configuration file.  For each network fabric configured, the individual HPCC benchmark test can optionally configure a performance threshold value that must be achieved for a successful result.  If a performance threshold is not set, then success of a test is based solely on the benchmark running to completion.

Results when threshold values are configured

When threshold values are set, a benchmark must meet or exceed the configured performance value.  Depending on the benchmark, that may mean a result that is equal to or greater than the configured threshold OR a result that is equal to or less than the configured threshold.

hpcc module configuration tag

Measurement unit

Output characteristics

Passing result

bandwidth

GB/s

Higher is better

Equal or greater

dgemm

GFLOPS

Higher is better

Equal or greater

fft

GFLOPS

Higher is better

Equal or greater

hpl

TFLOPS

Higher is better

Equal or greater

latency

µs

Lower is better

Equal or less

ptrans

GB/s

Higher is better

Equal or greater

randomacess

GUPs

Higher is better

Equal or greater

stream

GB/s

Higher is better

Equal or greater


If one of the benchmarks does not meet the configured threshold value, the module will report a failing result identifying the network fabric and the individual failing benchmark(s).  For example, using the following configuration, the hpcc module reported the following failure:

<hpcc>

        <cc-path>/opt/intel/cce/11.0.069/</cc-path>

        <fabric>

              <bandwidth>0.003</bandwidth>

              <device>sock</device>

              <dgemm>5.76</dgemm>

              <fft>0.4</fft>

              <hpl>0.04</hpl>

              <latency>40</latency>

              <ptrans>0.10</ptrans>

              <randomaccess>0.008</randomaccess>

              <stream>1.4</stream>

        </fabric>

 

        <mkl-path>/opt/intel/cmkl/10.1.0.015/</mkl-path>

        <mpi-path>/opt/intel/impi/3.2/</mpi-path>

        <process-number>8</process-number>

        <thread-number>1</thread-number>

</hpcc>

 

HPC Challenge Benchmark (Intel® C++ Compiler, Intel® MPI

Library, Intel® Math Kernel Library), (hpcc)

Attention: this check may take a long time to complete......FAILED

subtest 'PTRANS, GB/s (device = sock)' failed

  - failing All hosts returned: '0.0817186'

 
The module reported a failure because the result of running the PTRANS test was 0.0817186 GB/s which did not meet or exceed the configured value of 0.10 GB/s.

What do failures to meet thresholds mean?

Many system characteristics affect the results of the HPCC benchmark suite, and a reported test failure does not necessarily indicate an under-performing or malfunctioning cluster.  Processor speeds, network characteristics, and memory architecture, for instance, all factor into the measured results.  Changes in the characteristics of any of those components or sub-systems can affect the outcome of the tests.  Therefore, a failure to meet a threshold may be the result of a value configured too high for the characteristics of a particular cluster.  The thresholds can be reset to levels that are more appropriate for the specific system to resolve the issue.

A cluster that has historically passed the hpcc module testing where threshold values were configured but begins to fail the test consistently may indicate a problem with one or more components in the system.  Make sure that Intel® Cluster Checker was the only application running on the system; other applications running concurrently are likely to impact the measured results of the benchmarks.  If failures to meet thresholds persist and there have been no changes to the hardware characteristics of the cluster, then there may be an issue causing the system to exhibit degraded performance that should be resolved.

Intermittent failures to meet threshold values may be the result of threshold levels that are set too high to account for the natural fluctuations in performance of the system.  For example, with the PTRANS configuration above, the threshold is set to 0.10.  A given cluster may exhibit performance that routinely yields 0.11 GB/s but has fluctuations ranging from 0.095 to 0.12 GB/s.  Any fluctuations that dip below the 0.10 threshold will be flagged as a failure.  Threshold values should be configured to account for some fluctuations in results, so a better threshold for this example may be 0.09 GB/s.

For more complete information about compiler optimizations, see our Optimization Notice.