The performance of the Intel Optimized HPCG depends on many system parameters including (but not limited to) the hardware configuration of the host and MPI implementation used. To get the best performance for a specific system configuration, choose a combination of these parameters:
The number of MPI processes per host and OpenMPI threads per process
Local problem size
On Intel Xeon processor-based clusters, use the Intel AVX, Intel AVX2, or Intel AVX-512 optimized version of the benchmark depending on the supported instruction set and run one MPI process per CPU socket and one OpenMP* thread per physical CPU core skipping SMT threads.
On systems based on Intel Xeon Phi processors, use the Intel AVX-512 optimized version with four MPI processes per processor. Set the number of OpenMP threads to two for each processor core, with SMT turned on. For example, on Intel Xeon Phi processor 7250 which has 68 cores, each MPI process should run 34 OpenMP threads.
For best performance, use the problem size that is large enough to better utilize available cores, but not too large, so that all tasks fit the available memory.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804