# Intel® Math Kernel Library (Intel® MKL) 2018 Update 2 ScaLAPACK Symmetric Eigensolver Enhancements

Introduction

The symmetric eigenvalue problem arises in the context of numerous scientific applications. Intel MKL provides functions to solve the symmetric eigenvalue problem in the LAPACK domain for shared memory applications, and in the ScaLAPACK domain for distributed memory applications. Intel MKL 2018 Update 2 provides newly optimized ScaLAPACK eigensolver functions that deliver up to three times the performance over the same functions from earlier Intel MKL releases.

Algorithm

Let A be a real symmetric or complex Hermitian matrix of order N. A scalar λ is called an eigenvalue and a nonzero column vector z the corresponding eigenvector if

A*λ = λ*z.

λ is real whenever A is real symmetric or complex Hermitian.

The goal of the symmetric eigenproblem is to compute the eigenvalues λ and optionally, the corresponding eigenvectors z for a given symmetric or Hermitian matrix A. This task can be formulated as a decomposition of A into the following form:

A = Z*Λ*ZH,

where Z is an orthogonal matrix containing the eigenvectors of A, and Λ is diagonal matrix containing the eigenvalues of A. Such a decomposition can be found for any real symmetric or complex Hermitian matrix.

The traditional first step in computing this factorization is to reduce A to tridiagonal form:

A = Q*T*QH

Then, an iterative method finds the eigenvalues and eigenvectors of the tridiagonal matrix:

T = V*Λ*VH

The final step (called backward transformation) is needed only if eigenvectors are requested. In this step, the eigenvectors of the initial matrix are computed from the orthogonal matrices Q and V as:

A = Q*T*QH = Q* V*Λ*VH*QH= Z*Λ*ZH , where Z=Q*V.

This is the traditional method, used in the reference implementations in Netlib LAPACK and ScaLAPACK. Reduction to tridiagonal form (implemented by the ?SYTRD routine in LAPACK, and P?SYTRD in ScaLAPACK) is the most time-consuming part of this method. About one-half of the operations in the reduction are performed by Level-3 BLAS (symmetric rank-2 update), while the rest use Level-2 BLAS. For Level-3 BLAS, the number of floating-point (FP) and memory operations are estimated as O(n3) and O(n2), respectively, where n is the dimension of A, while for Level-2 BLAS we have O(n2) for both FP and memory operations. Domination of FP operations over memory accesses in Level-3 BLAS allows effective data re-use and achieving close to peak performance on computers with multi-level memory hierarchy. Level-2 BLAS is memory-bound, making it the main bottleneck of the eigensolver.

In 1996, Bischof, Lang and Sun published the Successive Band Reduction (SBR) approach, which addresses this bottleneck. In the SBR approach, the original matrix is first reduced to banded form, and then to tridiagonal form:

A = Q1*B* Q1H = Q1*( Q2*T* Q2H)* Q1H= Q*T*QH, where B is banded.

The full-to-band reduction takes the majority of FP operations in the overall tridiagonal reduction, and can be entirely implemented using Level-3 BLAS (or PBLAS for ScaLAPACK). Details explaining the specifics of this algorithm can be found many papers available online.

Additional performance typically comes at a cost, and in the case of the SBR algorithm, internal memory consumption, number of FP operations and code complexity are increased when compared to the traditional one-step algorithm.

Practical Suggestions for Intel MKL Eigensolver Users

Since Intel MKL 11.1 Update 3, two LAPACK drivers have been enabled with the SBR algorithm: ?SYEV and ?SYEVD. A few SBR-enabled eigensolvers, ?SYEV[X|D]_2STAGE, were introduced in Netlib LAPACK 3.7.0, which was released about two years after Intel MKL 11.1 Update 3. While these *_2STAGE implementations were integrated into Intel MKL Update 2, this was done strictly for compatibility with Netlib LAPACK. Intel MKL LAPACK eigensolvers ?SYEV[D] support the general case when both eigenvalues and eigenvectors are computed, while the *_2STAGE functions only support the case with eigenvalues (JOBZ=’N’), in addition to providing better performance, and therefore it is recommended to use ?SYEV[D] functions instead of the *_STAGE functions.

In Intel MKL 2018 Update 2, the SBR algorithm was implemented in the ScaLAPACK symmetric eigensolver drivers P?SYEVD/P?HEEVD, P?SYEVX/P?HEEVX and P?SYEVR/P?HEEVR. Additionally, redistribution of a user input matrix into a more optimal grid was added to the EVX drivers, and improved in the EVD drivers, as it was observed that some applications sub-optimally distribute matrices passed to ScaLAPACK routines in block-cyclic manner (https://software.intel.com/en-us/mkl-developer-reference-c-scalapack-array-descriptors). A good rule of thumb for a ScaLAPACK distributed matrix is to have the shape of process grid similar to the matrix shape. For example, for square matrices the grid should be close to square, and for an MxN matrix with M>>N (tall-and-skinny matrix) it is good to have process grid of size PxQ, where P>>Q. In most cases, the reasonable block size for block-cyclic distribution should be somewhere between 24 and 64. Experiments show that the time needed for the matrix redistribution is negligible when compared to the computational time, but additional memory is needed for redistribution. As memory has become less of a limiting factor in High Performance Computing (HPC), it was decided to increase demand for additional memory (WORK/RWORK arrays) when significant performance speed-up is expected.

When a new algorithm is added into Intel MKL, it is preferable to integrate it under an existing API (rather than creating new entry point(s)), as there is no need to change source code on the user side and performance improvements are available immediately after relinking the new Intel MKL version. This principle was used when integrating SBR in both LAPACK and ScaLAPACK, and there are now conditions inside existing eigensolvers that determine whether the traditional or the SBR algorithm should be used. Experiments show that it makes sense to use SBR for larger matrices. For example, the matrix size should be at least 3000-4000 for a single-node run (<50 MPI ranks) and about 13000 for a run on 400 MPI processes (these thresholds are provided just as an example and are subject to change).

There are four ScaLAPACK drivers available in MKL for solving the symmetric eigenvalue problem. They are all based on the same reduction approach. The difference between them is the method used for finding eigenvalues and eigenvectors of the tridiagonal matrix T. Some drivers support more options compared to others (see table below).

*redist indicates that internal matrix redistribution is supported

The EVD (divide and conquer) driver is a balanced option providing both good stability and performance for the JOBZ=’V’ case (eigenvectors are needed). If only eigenvalues are needed (JOBZ=’N’), either EVX (bisection followed by inverse iteration) or EVR (Multiple Relatively Robust Representations, or MRRR, algorithm) will be more suitable.

EVX is typically the fastest driver, and scales very well, but should be used with caution for matrices with highly clustered eigenvalues. After computing the eigenvalues, EVX splits the spectrum of the matrix across MPI processes evenly in intervals. This approach works only if a cluster entirely belongs to a single interval, otherwise an error is returned. In Intel MKL 2018 Update 2, this technique was improved by allowing EVX to split the spectrum in uneven intervals, taking into account information about clusters. But, there is still a limitation for the size of a cluster (and hence an interval) to be no more than N/4, otherwise we return an error  (the chance of having a matrix with such large clusters is low in real applications). A slight downside of this approach is that the amount of memory needed for computations depends on the number and size of clusters in the spectrum. As this information is not known in advance, Intel MKL allocates extra memory internally when it is needed.

EVD performance is usually slightly behind the performance of EVX, and EVR struggles to scale when the number of MPI processes is large (several hundreds). The EV driver is the slowest, and does not support either SBR or matrix redistribution, but also the most reliable. It should be used only if all other eigensolvers have failed.

The SBR approach has not yet been implemented for the case when only part of the eigenvalue spectrum is requested. This omission makes it possible for a situation to arise when finding all the eigenvalues and eigenvectors by EVX or EVR can be faster than finding only a subset.

Performance Results

The charts below demonstrate ScaLAPACK eigensolver performance improvements when comparing Intel MKL 2018 Update 2 versus Intel MKL 2018 Update 1. All experiments were carried out on a TACC* Stampede 2* cluster equipped with 2 x 24-core Intel® XeonTM Platinum 8160 processors. See configuration section below for more details. The sequential threading layer of Intel MKL was linked.

Chart 1. Intel MKL ScaLAPACK symmetric eigensolver performance for matrix N=20000.

Chart 2. Intel MKL ScaLAPACK symmetric eigensolver performance for matrix N=50000.

Experiments with Real Applications

The open source applications CP2K* and MiniDFT* were chosen to validate the improvements in Intel MKL ScaLAPACK symmetric eigensolver performance.

CP2K* is a quantum chemistry and solid state physics software package that can perform atomistic simulations of solid state, liquid, molecular, periodic, material, crystal, and biological systems. Reference: https://www.cp2k.org/

MiniDFT* is a plane-wave density functional theory (DFT) mini-app for modeling materials. Reference: http://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/nersc-8-trinity-benchmarks/minidft/

Below you can find a few details about CP2K* and MiniDFT* runs.

In order to estimate how the actual computations were improved we excluded the initialization/preparation steps from the applications’ total time. This means that we used the “scf_env_do_scf_inner_loop” and “electrons” timings from CP2K* and MiniDFT* output logs, respectively, for the time reported in the chart.

Chart 3. CP2K* and MiniDFT* performance with Intel MKL.

One can see that using Intel MKL Update 2 it is possible to improve CP2K performance by 13% and MiniDFT by 8%, compared to the previous Intel MKL release.

[1] Test Configuration and Optimization Notices

Benchmark results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown". Implementation of these updates may make these results inapplicable to your device or system.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.  Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions.  Any change to any of those factors may cause the results to vary.  You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Configuration: TACC* Stampede 2* cluster equipped with Intel® XeonTM Platinum 8160 nodes, 48 cores on two sockets (24 cores/socket), 2.1 GHz nominal clock rate (1.4-3.7GHz depending on instruction set and number of active cores), 192GB (2.67GHz) RAM. The interconnect is a 100Gb/sec Intel Omni-Path (OPA) network; Software: Intel icc/ifort 17.0, Intel MKL 2018 Update 1 and Intel MKL 2018 Update 2. Sequential version of Intel MKL was used (no OpenMP* threading). Benchmark Source: Intel Corporation.

Optimization notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804