Intel® Optimized Technology Preview for High Performance Conjugate Gradient Benchmark
The®Intel Optimized High Performance Conjugate Gradient Benchmark provides an early implementation of the HPCG benchmark (http://hpcg-benchmark.org) optimized for Intel® Advanced Vector Extensions (Intel® AVX), Intel® Advanced Vector Extensions 2 (Intel® AVX2) enabled Intel® processors and Intel® Xeon Phi™ coprocessors. The HPCG Benchmark is intended to complement the High Performance LINPACK benchmark used in the TOP500 (http://www.top500.org) system ranking by providing a metric that better aligns with a broader set of important cluster applications.
The HPCG benchmark implementation is based on a 3D regular 27-point discretization of an elliptic partial differential equation. The 3D domain is scaled to fill a 3D virtual process grid for all of the available MPI ranks. The preconditioned conjugate gradient method (CG) is used to solve the intermediate systems of equations and incorporates a local and symmetric Gauss-Seidel preconditioning step that requires a triangular forward solve and a backward solve. A synthetic multigrid V-cycle is used on each preconditioning step to make the benchmark more similar to real world applications. The multiplication of matrices is implemented locally with an initial halo exchange between neighboring processes. The benchmark exhibits irregular accesses to memory and fine-grain recursive computations that dominate many scientific workloads (http://www.sandia.gov/~maherou/docs/HPCG-Benchmark.pdf).
The Intel Intel Optimized High Performance Conjugate Gradient Benchmark contains HPCG v3.0 reference implementation source code with the modifications that are necessary to include Intel® architecture optimizations, pre-built benchmark executables and four dynamic libraries with sparse matrix vector multiplication (SpMV), symmetric Gauss-Seidel smoother (SYMGS), and Gauss-Seidel preconditioner (GS) kernels optimized for Intel AVX, Intel AVX2, and Intel Xeon Phi coprocessor instruction sets. This package can be used to evaluate the performance of distributed memory systems based on any generation of Intel® Xeon® processor E3 family, Intel® Xeon® processor E5 family, Intel® Xeon® processor E7 family, and Intel Xeon Phi coprocessor family.
The SpMV and GS kernels are implemented using an inspector-executor model. During the inspection step the best algorithm for the input matrix is chosen and the matrix is converted to a special internal representation. During the execution step the operation itself is executed with high performance.
To start working with the benchmark, unpack the Intel Optimized High Performance Conjugate Gradient Benchmark package to a cluster file system directory accessible by all nodes. Read and accept the license as indicated in readme.txt file included in the package.
The package includes pre-built HPCG benchmark for Intel MPI 4.1.3 and later versions:
ihpcg/bin/xhcg_avx – The Intel AVX optimized version (xhpcg_avx) is optimized for systems based on the first and the second generations of Intel Xeon processor E3 family, Intel Xeon processor E5 family, or Intel Xeon processor E7 family.
ihpcg/bin/xhcg_avx2 -- The Intel AVX2 optimized version (xhpcg_avx2) is optimized for systems based on Intel® Xeon® E3-xxxx v3 processor and future Intel processors with Intel AVX2 support. Running the Intel AVX optimized version of the benchmark on an Intel AVX2 enabled system produces non-optimal performance. The Intel AVX2 optimized version of the benchmark will not run on systems that do not have Intel AVX2 support.
ihpcg/bin/xhpcg_mic -- The Intel Xeon Phi coprocessor optimized version (xhpcg_mic). This version should be used for native runs on Intel Xeon Phi coprocessor. It is also used along with the Intel AVX optimized version (xhpcg_avx) or Intel AVX2 optimized version (xhpcg_avx2) for symmetric runs. A symmetric run involves xhpcg_mic running on the Intel Xeon Phi coprocessors and xhpcg_avx or xhpcg_avx2 running on Intel Xeon processors. MPI ranks can be on both the Intel Xeon hosts and the Intel Xeon Phi coprocessors. This version works only with Intel MPI.
ihpcg/bin/xhpcg_offload -- The Intel Xeon Phi coprocessor optimized version for offload mode (xhpcg_offload). This version runs on Intel Xeon system and offloads computations to Intel Xeon Phi coprocessor(s). The difference from the Intel Xeon Phi optimized version (xhpcg_mic) is that MPI ranks are on the Intel Xeon hosts but not on the Intel Xeon Phi coprocessors. Running this version of benchmark requires redistributable Library package for the Intel Parallel Studio XE 2015.
The package also includes the source code and libraries necessary to build Intel AVX optimized version, Intel AVX2 optimized version or Intel Xeon Phi coprocessor optimized version for offload mode for other MPI implementations like SGI MPT*, MPICH2 or OpenMPI. Instructions are available in the file QUICKSTART included with the package. Intel Xeon Phi coprocessor optimized version is available only for Intel MPI.
Once the package is unpacked and stored:
- Change to the ihpcg/bin directory
- Determine which pre-built version of the benchmark is best for your system or follow QUICKSTART instructions to build a version of the benchmark for your MPI implementation. Note that only Intel Xeon processor and Intel Xeon Phi coprocessor offload versions can be built with other MPI implementations. Intel Xeon Phi coprocessor native runs and symmetric runs require Intel MPI. Make sure that Intel® C/C++ Compiler 15.0, MPI runtime libraries, and Intel® Manycore Platform Software Stack (Intel® MPSS) runtime libraries (for Intel Xeon Phi coprocessors only) are available via LD_LIBRARY_PATH and MIC_LD_LIBRARY_PATH. Running MIC offload benchmark also requires Intel Composer XE 2015 or Redistributable Library package for the Intel Parallel Studio XE 2015 to be installed.
- Intel AVX2 and Intel AVX optimized versions perform best with one process per socket and one OpenMP thread per core skipping hyperthreads (with affinity set as KMP_AFFINITY=granularity=fine,compact,1,0). For a 128-node cluster with two Intel Xeon Processor E5-2697 v3 per node, run the executable in the following way:
#> I_MPI_ADJUST_ALLREDUCE=5 mpiexec.hydra –machinefile .machinefile -n 512 -perhost 2 env OMP_NUM_THREADS=14 KMP_AFFINITY=granularity=fine,compact,1,0 bin/xhpcg_avx2 --n=168
- Intel Xeon Phi coprocessor optimized version for offload mode performs best with one MPI process per coprocessor and four threads for each Intel Xeon Phi coprocessor core with a single core left free. For a 128-node cluster with two Intel Xeon Phi coprocessor 7120D per node, run the executable in the following way:
#> I_MPI_ADJUST_ALLREDUCE=5 mpiexec.hydra –machinefile .machinefile -n 256 –perhost 2 env –u OMP_NUM_THREADS –u KMP_AFFINITY MIC_OMP_NUM_THREADS=240 MIC_LD_LIBRARY_PATH=./bin/lib/mic:$MIC_LD_LIBRARY_PATH LD_LIBRARY_PATH=./bin/lib/mic:./bin/lib/intel64:$LD_LIBRARY_PATH ./bin/xhpcg_offload --n=168
- In symmetric mode, the number of MPI processes per host and coprocessor should be chosen to achieve balance in the performance of processes. For a 128-node cluster with one Intel Xeon Phi coprocessor 7120D per node, two MPI per host, and two MPI per Xeon Phi, run the executable in the following way:
#> I_MPI_ADJUST_ALLREDUCE=5 mpiexec.hydra –machinefile .machinefile -n 256 -perhost 2 env OMP_NUM_THREADS=14 KMP_AFFINITY=granularity=fine,compact,1,0 ./bin/xhpcg_avx2 --n=144 : -n 256 –perhost 2 env OMP_NUM_THREADS=120 KMP_AFFINITY=compact ./bin/xhpcg_mic --n=144
For symmetric runs in the example above, .machinefile should include a list of Intel Xeon hosts followed by the list of Intel Xeon Phi coprocessors.
- The benchmark completes execution in a few minutes and produces an official YAML results file in the current directory. The performance rating of the benchmarked system is given in the last section of the file:
HPCG result is VALID with a GFLOP/s rating of: [GFLOP/s]
Choosing best configuration and problem sizes
HPCG benchmark performance depends on many parameters of the system including, but not limited to, host hardware configuration, number and configuration of coprocessors, and MPI implementation used. Depending on the configuration, the following three parameters should be chosen to get the best performance:
- number of MPI processes per host and OpenMPI threads per process,
- local problem size,
- and execution mode, if Intel Xeon Phi coprocessors are available.
For Intel Xeon processor-based clusters use the Intel AVX or Intel AVX2-optimized version of the benchmark, depending on supported instruction set, and run one MPI process per CPU socket and one OpenMP thread per physical CPU core skipping SMT threads.
Intel Xeon Phi coprocessor-enabled systems support two execution modes: symmetric and offload. In offload mode the benchmark uses the host for MPI communication and offloads computational work to the Intel Xeon Phi coprocessors. In symmetric mode MPI ranks run on both Intel Xeon processors and Intel Xeon Phi coprocessors, potentially resulting in better performance. Offload mode uses fewer MPI processes per system and scales better for large runs. Native mode requires more MPI processes per node to achieve good balancing, which may lead to limited scalability.
For systems with a single Intel Xeon Phi coprocessor, symmetric execution mode is recommended with one MPI process per socket and two MPI processes per coprocessor. On the Intel Xeon host, each process should run one OpenMP thread per processor core skipping hyper threads. On the Intel Xeon Phi coprocessor each process should run four OpenMP threads per core with a single core left free. For instance on Intel Xeon Phi 7120D, which has 61 cores, each of two MPI processes should run 120 OpenMP threads.
For systems with two or more Intel Xeon Phi coprocessors, offload execution mode works best with two MPI process per Intel Xeon Phi coprocessor. The number of OpenMP threads for coprocessors should be set to four for each coprocessor core with a single core left free. For instance, on Intel Xeon Phi 7120D, which has 61 cores, each MPI process should run 120 OpenMP threads.
Intel Xeon Phi coprocessors have 57, 60, or 61 cores, depending on the specific model, with each core supporting four threads. The number of OpenMP processes in benchmark runs with Intel Xeon Phi coprocessors should be set to use all cores but one, which is reserved for MPI or offload communications. For example, for 61 core coprocessors, 240 threads should be used by the benchmark. Finally, problem size should be chosen to achieve the best performance: first of all, size should be big enough for better utilization of available cores. On the other hand, all tasks should fit available memory.
The chart below shows the performance results reported by different benchmark modes.
Figure 1: Performance measurements of optimized implementations of HPCG benchmark
All numerical experiments were performed on an Infiniband*-linked cluster consisting of 16 computational nodes where each node contained two Intel Xeon processor E5-2697 v3 CPUs (dual-socket, 28 cores in total) with 64 GB of RAM per node, Infiniband FDR, two Intel Xeon Phi 7120 per node. Performance results for the following configurations are reported on the chart:
- Host+2 coprocessors (symmetric): Symmetric version of the benchmark, problem size 144, 2 Intel Xeon Phi 7120 per node, 2 MPI processes per host with 14 OpenMP threads, 2 MPI processes per Intel Xeon Phi 7120 with 120 OpenMP threads.
- Host+2 coprocessors (offload): Offload version of the benchmark, problem size 168, 2 Intel Xeon Phi 7120 per node, 2 MPI processes per host, 240 OpenMP threads per coprocessor.
- Host+coprocessor (symmetric): Symmetric version of the benchmark, problem size 144, 1 Intel Xeon Phi 7110 per node, 2 MPI processes per host with 14 OpenMP threads, 2 MPI processes per Intel Xeon Phi 7120 with 120 OpenMP threads
- Host: Host version of the benchmark, problem size 168, 2 MPI processes per node (1 per socket), 1 thread per core (14 OpenMP threads per MPI process) skipping hyper threads.
Pre-built benchmark executables require the following hardware and software:
- Intel AVX or Intel AVX2 enabled Intel Xeon processor or Intel Xeon Phi coprocessor
- Red Hat* Enterprise Linux* 6 (64-bit) or compatible OS
- Intel® MPI Library 4.1.3 or later
- Intel C/C++ Compiler 15.0 or later
- Intel MPSS 3.4 or later (for Intel Xeon Phi coprocessors only)
Using the source code provided and the optimized kernels, you can build executables compatible with other MPI implementations. See the QUICKSTART file in the package for instructions.
Q – Where can I download the Intel Optimized High Performance Conjugate Gradient Benchmark?
A – starting since the MKL version 11.3 update 1, the Intel Optimized High Performance Conjugate Gradient Benchmark is included in Intel MKL Benchmarks package. Please visit this KB article Intel® Math Kernel Library Benchmarks (Intel® MKL Benchmarks) to take the HPCG package from there.
Q – What is the password of the archived package?
A – Please see license.txt file for directions for extracting the package contents.
Q - Where can I get support?
A - We encourage you to visit Intel Math Kernel Library support forum (https://software.intel.com/en-us/forums/intel-math-kernel-library) or use Intel Premier Support (https://premier.intel.com).
Q – Can I use this package on systems with hosts that do not support Intel AVX or Intel AVX2?
A – The current version of Intel Optimized High Performance Conjugate Gradient Benchmark is optimized for Intel AVX or Intel AVX2 and can run only on systems that support these instruction sets. The binary issues an error message if it is executed on architecture without Intel AVX support or non-Intel processor.
Q – How can I measure performance using this benchmark?
A – Please see the QUICKSTART.txt file in the package.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Intel, the Intel logo, Intel Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.