Developer Reference

Contents

Parallel Direct Sparse Solver for Clusters Interface

The Parallel Direct Sparse Solver for Clusters Interface solves large linear systems of equations with sparse matrices on clusters. It is
  • high performing
  • robust
  • memory efficient
  • easy to use
A hybrid implementation combines Message Passing Interface (MPI) technology for data exchange between parallel tasks (processes) running on different nodes, and OpenMP* technology for parallelism inside each node of the cluster. This approach effectively uses modern hardware resources such as clusters consisting of nodes with multi-core processors. The solver code is optimized for the latest Intel processors, but also performs well on clusters consisting of non-Intel processors.
Code examples are available in the
Intel® MKL
installation
examples
directory.
Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
This notice covers the following instruction sets: SSE2, SSE4.2, AVX2, AVX-512.

Parallel Direct Sparse Solver for Clusters Interface Algorithm

Parallel Direct Sparse Solver for Clusters Interface solves a set of sparse linear equations
A
*
X
=
B
with multiple right-hand sides using a distributed
LU
,
LL
T
,
LDL
T
or
LDL
* factorization, where
A
is an
n
-by-
n
matrix, and
X
and
B
are
n
-by-
nrhs
matrices.
The solution comprises four tasks:
  • analysis and symbolic factorization;
  • numerical factorization;
  • forward and backward substitution including iterative refinement;
  • termination to release all internal solver memory.
The solver first computes a symmetric fill-in reducing permutation
P
based on the nested dissection algorithm from the METIS package [Karypis98](included with
Intel® MKL
), followed by the Cholesky or other type of factorization (depending on matrix type)[Schenk00-2] of
PAP
T
. The solver uses either diagonal pivoting, or 1x1 and 2x2 Bunch and Kaufman pivoting for symmetric indefinite or Hermitian matrices before finding an approximation of
X
by forward and backward substitution and iterative refinement.
The initial matrix
A
is perturbed whenever numerically acceptable 1x1 and 2x2 pivots cannot be found within the diagonal blocks. One or two passes of iterative refinement may be required to correct the effect of the perturbations. This restricted notion of pivoting with iterative refinement is effective for highly indefinite symmetric systems. For a large set of matrices from different application areas, the accuracy of this method is comparable to a direct factorization method that uses complete sparse pivoting techniques [Schenk04].
Parallel Direct Sparse Solver for Clusters additionally improves the pivoting accuracy by applying symmetric weighted matching algorithms. These methods identify large entries in the coefficient matrix
A
that, if permuted close to the diagonal, enable the factorization process to identify more acceptable pivots and proceed with fewer pivot perturbations. The methods are based on maximum weighted matching and improve the quality of the factor in a complementary way to the alternative idea of using more complete pivoting techniques.

Parallel Direct Sparse Solver for Clusters Interface Matrix Storage

The sparse data storage in the Parallel Direct Sparse Solver for Clusters Interface follows the scheme described in the Sparse Matrix Storage Formats section using the variable
ja
for
columns
,
ia
for
rowIndex
, and
a
for
values
. Column indices
ja
must be in increasing order per row.
When an input data structure is not accessed in a call, a NULL pointer or any valid address can be passed as a placeholder for that argument.

Algorithm Parallelization and Data Distribution

Intel® MKL
Parallel Direct Sparse Solver for Clusters enables parallel execution of the solution algorithm with efficient data distribution.
The master MPI process performs the symbolic factorization phase to represent matrix
A
as computational tree. Then matrix
A
is divided among all MPI processes in a one-dimensional manner. The same distribution is used for
L-factor
(the lower triangular matrix in Cholesky decomposition). Matrix
A
and all required internal data are broadcast to slave MPI processes. Each MPI process fills in its own parts of
L-factor
with initial values of the matrix
A
.
Parallel Direct Sparse Solver for Clusters Interface computes all independent parts of
L-factor
completely in parallel. When a block of the factor must be updated by other blocks, these updates are independently passed to a temporary array on each updating MPI process. It further gathers the result into an updated block using the
MPI_Reduce()
routine. The computations within an MPI process are dynamically divided among OpenMP threads using pipelining parallelism with a combination of left- and right-looking techniques similar to those of the PARDISO* software. Level 3 BLAS operations from
Intel® MKL
ensure highly efficient performance of block-to-block update operations.
During forward/backward substitutions, respective Right Hand Side (RHS) parts are distributed among all MPI processes. All these processes participate in the computation of the solution. Finally, the solution is gathered on the master MPI process.
This approach demonstrates good scalability on clusters with Infiniband* technology. Another advantage of the approach is the effective distribution of
L-factor
among cluster nodes. This enables the solution of tasks with a much higher number of non-zero elements than it is possible with any Symmetric Multiprocessing (SMP) in-core direct solver.
The algorithm ensures that the memory required to keep internal data on each MPI process is decreased when the number of MPI processes in a run increases. However, the solver requires that matrix
A
and some other internal arrays completely fit into the memory of each MPI process.
To get the best performance, run one MPI process per physical node and set the number of OpenMP* threads per node equal to the number of physical cores on the node.
Instead of calling
MPI_Init()
, initialize MPI with
MPI_Init_thread()
and set the MPI threading level to
MPI_THREAD_FUNNELED
or higher. For details, see the code examples in
<install_dir>
/examples.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804