Intel® Optimized Technology Preview for High Performance Conjugate Gradient Benchmark

Intel® Optimized Technology Preview for High Performance Conjugate Gradient Benchmark

The Intel® Optimized High Performance Conjugate Gradient Benchmark provides an early implementation of the HPCG benchmark (http://hpcg-benchmark.org) optimized for Intel® Advanced Vector Extensions (Intel® AVX), Intel® Advanced Vector Extensions 2 (Intel® AVX2), and Intel® Advanced Vector Extensions 512 (Intel® AVX-512) enabled Intel® Xeon®  and Intel® Xeon Phi™ processors. The HPCG Benchmark is intended to complement the High Performance LINPACK benchmark used in the TOP500 (http://www.top500.org) system ranking by providing a metric that better aligns with a broader set of important cluster applications.

The HPCG benchmark implementation is based on a 3D regular 27-point discretization of an elliptic partial differential equation. The 3D domain is scaled to fill a 3D virtual process grid for all of the available MPI ranks. The preconditioned conjugate gradient method (CG) is used to solve the intermediate systems of equations and incorporates a local and symmetric Gauss-Seidel preconditioning step that requires a triangular forward solve and a backward solve. A synthetic multigrid V-cycle is used on each preconditioning step to make the benchmark more similar to real world applications. The multiplication of matrices is implemented locally with an initial halo exchange between neighboring processes. The benchmark exhibits irregular accesses to memory and fine-grain recursive computations that dominate many scientific workloads (http://www.sandia.gov/~maherou/docs/HPCG-Benchmark.pdf).

The Intel Optimized High Performance Conjugate Gradient Benchmark contains HPCG v3.0 reference implementation source code with the modifications that are necessary to include Intel® architecture optimizations, along with pre-built benchmark executables that link to Intel® Math Kernel Library (Intel® MKL) Inspector-executor Sparse BLAS kernels for sparse matrix vector multiplication (SpMV), sparse triangular solve (TRSV), and symmetric Gauss-Seidel smoother (SYMGS) that are optimized for Intel AVX, Intel AVX2, and Intel AVX-512  instruction sets. For Intel AVX-512 instruction set, separate versions targeting Intel Xeon Scalable processors and Intel Xeon Phi processors are included. This package can be used to evaluate the performance of distributed memory systems based on any generation of the Intel® Xeon® processor E3 family, Intel® Xeon® processor E5 family, Intel® Xeon® processor E7 family, Intel® Xeon® Scalable processor family, and Intel® Xeon PhiTM processor family.

The Intel MKL Inspector-executor Sparse BLAS kernels SpMV, TRSV, and SYMGS are implemented using an inspector-executor model. During the inspection step the best algorithm for the input matrix is chosen and the matrix is converted to a special internal representation. During the execution step the operation itself is executed with high performance.

Getting started

To start working with the benchmark, unpack the Intel Optimized High Performance Conjugate Gradient Benchmark package to a cluster file system directory accessible by all nodes. Read and accept the license as indicated in readme.txt file included in the package.

The package includes pre-built HPCG benchmark for Intel MPI 5.1 and later versions:

hpcg/bin/xhpcg_avx – The Intel AVX optimized version (xhpcg_avx) is optimized for systems based on the first and the second generations of Intel Xeon processor E3 family, Intel Xeon processor E5 family, or Intel Xeon processor E7 family.

    hpcg/bin/xhpcg_avx2 -- The Intel AVX2 optimized version (xhpcg_avx2) is optimized for systems based on the third and later generations of the Intel Xeon processor E3 family, Intel Xeon processor E5 family, Intel Xeon processor E7 family, and future Intel processors with Intel AVX2 support. Running the Intel AVX optimized version of the benchmark on an Intel AVX2 enabled system produces non-optimal performance. The Intel AVX2 optimized version of the benchmark will not run on systems that do not have Intel AVX2 support.

    hpcg/bin/xhpcg_knl -- The Intel Xeon Phi processor (products formerly Knights Landing) optimized version of the benchmark is designed for systems based on Intel Xeon Phi processors with Intel AVX-512 support. Running the Intel AVX or AVX2 optimized versions of the benchmark on an Intel AVX-512 enabled system produces non-optimal performance. The Intel Xeon Phi processor optimized version of the benchmark will not run on systems that do not have Intel AVX-512 support.

    hpcg/bin/xhpcg_skx -- The Intel Xeon Scalable processor (products formerly Skylake) optimized version of the benchmark is designed for systems based on Intel Xeon Scalable processors and future Intel processors with Intel AVX-512 support. Running the Intel AVX or AVX2 optimized versions of the benchmark on an Intel AVX-512 enabled system produces non-optimal performance. The Intel Xeon Scalable processor optimized version of the benchmark will not run on systems that do not have Intel AVX-512 support.

The package also includes the source code necessary to build Intel AVX optimized version, Intel AVX2 optimized version, Intel Xeon Scalable processor optimized version with Intel AVX-512 instructions, or Intel Xeon Phi processor optimized version with Intel AVX-512 instructions for other MPI implementations such as SGI MPT*, MPICH2 or OpenMPI. Instructions are available in the file QUICKSTART included with the package.

Once the package is unpacked and stored:

  1. Change to the hpcg/bin directory
  2. Determine which pre-built version of the benchmark is best for your system or follow QUICKSTART instructions to build a version of the benchmark for your MPI implementation. Make sure that Intel MKL, Intel® C/C++ Compiler 16.0 and MPI runtime environments are properly set (this can be done using the scripts mklvars.sh, compilervars,sh, and mpivars.sh included in those distributions).
    • Intel AVX-512 (Intel Xeon processors), Intel AVX2 and Intel AVX optimized versions perform best with one MPI process per socket and one OpenMP thread per core skipping simultaneous multithreading (SMT) threads (with affinity set as KMP_AFFINITY=granularity=fine,compact,1,0). For a 128-node cluster with two Intel Xeon Processor E5-2697 v4 per node, run the executable in the following way:

#> mpiexec.hydra -n 256 -ppn 2 env OMP_NUM_THREADS=18 KMP_AFFINITY=granularity=fine,compact,1,0 ./bin/xhpcg_avx2 -n192

  • Intel AVX-512 (for Intel Xeon Phi processors) optimized version performs best with four MPI processes per processor and two threads for each processor core with SMT turned on. For a 128-node cluster with one Intel® Xeon PhiTM Processor 7250 per node, run the executable in the following way:

#> mpiexec.hydra -n 512 -ppn 4 env OMP_NUM_THREADS=34 KMP_AFFINITY=granularity=fine,compact,1,0 ./bin/xhpcg_knl -n160

  1. The benchmark completes execution in a few minutes and produces an official YAML results file in the current directory. The performance rating of the benchmarked system is given in the last section of the file:

           HPCG result is VALID with a GFLOP/s rating of: [GFLOP/s]

Choosing best configuration and problem sizes

HPCG benchmark performance depends on many parameters of the system including, but not limited to, host hardware configuration and MPI implementation used. Depending on the configuration, the following parameters should be adjusted to  achieve the best performance:

  • number of MPI processes per host and OpenMP threads per process,
  • local problem size.

For Intel Xeon processor-based clusters use the Intel AVX, Intel AVX2, or Intel AVX-512  optimized version of the benchmark, depending on supported instruction set, and run one MPI process per CPU socket and one OpenMP thread per physical CPU core skipping SMT threads.

For systems based on Intel Xeon Phi processors, use the Intel AVX-512 optimized version of the benchmark. This version works best with four MPI processes per Intel Xeon Phi processor. The number of OpenMP threads should be set to two for each processor core with SMT turned on. For instance, on Intel Xeon Phi Processor 7250, which has 68 cores, each MPI process should run 34 OpenMP threads.

 

Finally, problem size should be chosen to achieve the best performance: first of all, size should be big enough for better utilization of available cores. On the other hand, all tasks should fit available memory.

The chart below shows the performance results reported by different benchmark modes.

Figure 1: Performance measurements of Intel Optimized High Performance Conjugate Gradient Benchmark using Intel MKL 2018 and Intel MPI 2017 Update 1.

* Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. 

Run Configuration:

Intel® Xeon® Processor E5-2697v4, 18 cores, 2 processes per node, each has 18 threads

Intel® Xeon® Scalable Processor 6148, Turbo off, SMT on, 20 cores, 2 processes per node, each has 20 threads

Intel® Xeon® Scalable Processor 8180, Turbo off, SMT on, 28 cores, 2 processes per node, each has 28 threads

Intel® Xeon Phi Processor 7250, Turbo on, SMT on, FLAT mode, 68 cores, 16GB MCDRAM, 4 processes per node, each has 34 threads


System requirements

Pre-built benchmark executables require the following hardware and software:

  • Intel AVX, Intel AVX2, or Intel AVX-512 enabled Intel Xeon processor or Intel Xeon Phi processor
  • Red Hat* Enterprise Linux* 6 (64-bit) or compatible OS
  • Intel® MPI Library 5.1 or later
  • Intel C/C++ Compiler 16.0 or later

Using the source code provided and the optimized kernels, you can build executables compatible with other MPI implementations. See the QUICKSTART file in the package for instructions.


Frequently Asked Questions

Q – Where can I download the Intel Optimized High Performance Conjugate Gradient Benchmark?

A – Starting from MKL version 11.3 update 1, the Intel Optimized High Performance Conjugate Gradient Benchmark is included in Intel MKL Benchmarks package. Please visit this KB article Intel® Math Kernel Library Benchmarks (Intel® MKL Benchmarks)  to take the HPCG package from there.

Q – What is the password of the archived package?

A – Please see license.txt file for directions for extracting the package contents.

Q - Where can I get support?

A - We encourage you to visit Intel Math Kernel Library support forum (https://software.intel.com/en-us/forums/intel-math-kernel-library) or use Intel Online Service Center

Q – Can I use this package on systems with hosts that do not support Intel AVX, Intel AVX2, or Intel AVX-512?

A – The current version of Intel Optimized High Performance Conjugate Gradient Benchmark is optimized for Intel AVX, Intel AVX2, and Intel AVX-512 and can run only on systems that support these instruction sets. The binary issues an error message if it is executed on architecture without Intel AVX support or non-Intel processor.

Q – How can I measure performance using this benchmark?

A – Please see the QUICKSTART.txt file in the package.


 

 

For more complete information about compiler optimizations, see our Optimization Notice.

1 comment

Top

In the section "Choosing best configuration and problem sizes", in the line "number of MPI processes per host and OpenMPI threads per process", did you mean OpenMP threads per process and not OpenMPI threads per process ?

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.