Contents

# Parallel Direct Sparse Solver for Clusters Interface

The Parallel Direct Sparse Solver for Clusters Interface solves large linear systems of equations with sparse matrices on clusters. It is
• high performing
• robust
• memory efficient
• easy to use
A hybrid implementation combines Message Passing Interface (MPI) technology for data exchange between parallel tasks (processes) running on different nodes, and OpenMP* technology for parallelism inside each node of the cluster. This approach effectively uses modern hardware resources such as clusters consisting of nodes with multi-core processors. The solver code is optimized for the latest Intel processors, but also performs well on clusters consisting of non-Intel processors.
Code examples are available in the
Intel® oneAPI Math Kernel Library
installation
examples
directory.
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.
Notice revision #20201201

## Parallel Direct Sparse Solver for Clusters Interface Algorithm

Parallel Direct Sparse Solver for Clusters Interface solves a set of sparse linear equations
A
*
X
=
B
with multiple right-hand sides using a distributed
LU
,
LL
T
,
LDL
T
or
LDL
* factorization, where
A
is an
n
-by-
n
matrix, and
X
and
B
are
n
-by-
nrhs
matrices.
• analysis and symbolic factorization;
• numerical factorization;
• forward and backward substitution including iterative refinement;
• termination to release all internal solver memory.
The solver first computes a symmetric fill-in reducing permutation
P
based on the nested dissection algorithm from the METIS package [Karypis98](included with
Intel® oneAPI Math Kernel Library
), followed by the Cholesky or other type of factorization (depending on matrix type)[Schenk00-2] of
PAP
T
. The solver uses either diagonal pivoting, or 1x1 and 2x2 Bunch and Kaufman pivoting for symmetric indefinite or Hermitian matrices before finding an approximation of
X
by forward and backward substitution and iterative refinement.
The initial matrix
A
is perturbed whenever numerically acceptable 1x1 and 2x2 pivots cannot be found within the diagonal blocks. One or two passes of iterative refinement may be required to correct the effect of the perturbations. This restricted notion of pivoting with iterative refinement is effective for highly indefinite symmetric systems. For a large set of matrices from different application areas, the accuracy of this method is comparable to a direct factorization method that uses complete sparse pivoting techniques [Schenk04].
Parallel Direct Sparse Solver for Clusters additionally improves the pivoting accuracy by applying symmetric weighted matching algorithms. These methods identify large entries in the coefficient matrix
A
that, if permuted close to the diagonal, enable the factorization process to identify more acceptable pivots and proceed with fewer pivot perturbations. The methods are based on maximum weighted matching and improve the quality of the factor in a complementary way to the alternative idea of using more complete pivoting techniques.

## Parallel Direct Sparse Solver for Clusters Interface Matrix Storage

The sparse data storage in the Parallel Direct Sparse Solver for Clusters Interface follows the scheme described in the Sparse Matrix Storage Formats section using the variable
ja
for
columns
,
ia
for
rowIndex
, and
a
for
values
. Column indices
ja
must be in increasing order per row.
When an input data structure is not accessed in a call, a NULL pointer or any valid address can be passed as a placeholder for that argument.

## Algorithm Parallelization and Data Distribution

Intel® oneAPI Math Kernel Library
Parallel Direct Sparse Solver for Clusters enables parallel execution of the solution algorithm with efficient data distribution.
The master MPI process performs the symbolic factorization phase to represent matrix
A
as computational tree. Then matrix
A
is divided among all MPI processes in a one-dimensional manner. The same distribution is used for
L-factor
(the lower triangular matrix in Cholesky decomposition). Matrix
A
and all required internal data are broadcast to subordinate MPI processes. Each MPI process fills in its own parts of
L-factor
with initial values of the matrix
A
.
Parallel Direct Sparse Solver for Clusters Interface computes all independent parts of
L-factor
completely in parallel. When a block of the factor must be updated by other blocks, these updates are independently passed to a temporary array on each updating MPI process. It further gathers the result into an updated block using the
MPI_Reduce()
routine. The computations within an MPI process are dynamically divided among OpenMP threads using pipelining parallelism with a combination of left- and right-looking techniques similar to those of the PARDISO* software. Level 3 BLAS operations from
Intel® oneAPI Math Kernel Library
ensure highly efficient performance of block-to-block update operations.
During forward/backward substitutions, respective Right Hand Side (RHS) parts are distributed among all MPI processes. All these processes participate in the computation of the solution. Finally, the solution is gathered on the master MPI process.
This approach demonstrates good scalability on clusters with Infiniband* technology. Another advantage of the approach is the effective distribution of
L-factor
among cluster nodes. This enables the solution of tasks with a much higher number of non-zero elements than it is possible with any Symmetric Multiprocessing (SMP) in-core direct solver.
The algorithm ensures that the memory required to keep internal data on each MPI process is decreased when the number of MPI processes in a run increases. However, the solver requires that matrix
A
and some other internal arrays completely fit into the memory of each MPI process.
To get the best performance, run one MPI process per physical node and set the number of OpenMP* threads per node equal to the number of physical cores on the node.
MPI_Init()
, initialize MPI with