Developer Reference

Contents

p?pbtrsv

Solves a single triangular linear system via frontsolve or backsolve where the triangular matrix is a factor of a banded matrix computed by
p?pbtrf
.

Syntax

void
pspbtrsv
(
char
*uplo
,
char
*trans
,
MKL_INT
*n
,
MKL_INT
*bw
,
MKL_INT
*nrhs
,
float
*a
,
MKL_INT
*ja
,
MKL_INT
*desca
,
float
*b
,
MKL_INT
*ib
,
MKL_INT
*descb
,
float
*af
,
MKL_INT
*laf
,
float
*work
,
MKL_INT
*lwork
,
MKL_INT
*info
);
void
pdpbtrsv
(
char
*uplo
,
char
*trans
,
MKL_INT
*n
,
MKL_INT
*bw
,
MKL_INT
*nrhs
,
double
*a
,
MKL_INT
*ja
,
MKL_INT
*desca
,
double
*b
,
MKL_INT
*ib
,
MKL_INT
*descb
,
double
*af
,
MKL_INT
*laf
,
double
*work
,
MKL_INT
*lwork
,
MKL_INT
*info
);
void
pcpbtrsv
(
char
*uplo
,
char
*trans
,
MKL_INT
*n
,
MKL_INT
*bw
,
MKL_INT
*nrhs
,
MKL_Complex8
*a
,
MKL_INT
*ja
,
MKL_INT
*desca
,
MKL_Complex8
*b
,
MKL_INT
*ib
,
MKL_INT
*descb
,
MKL_Complex8
*af
,
MKL_INT
*laf
,
MKL_Complex8
*work
,
MKL_INT
*lwork
,
MKL_INT
*info
);
void
pzpbtrsv
(
char
*uplo
,
char
*trans
,
MKL_INT
*n
,
MKL_INT
*bw
,
MKL_INT
*nrhs
,
MKL_Complex16
*a
,
MKL_INT
*ja
,
MKL_INT
*desca
,
MKL_Complex16
*b
,
MKL_INT
*ib
,
MKL_INT
*descb
,
MKL_Complex16
*af
,
MKL_INT
*laf
,
MKL_Complex16
*work
,
MKL_INT
*lwork
,
MKL_INT
*info
);
Include Files
  • mkl_scalapack.h
Description
The
p?pbtrsv
function
solves a banded triangular system of linear equations
A
(1:
n
,
ja
:
ja
+
n
-1)*
X
=
B
(
jb
:
jb+n
-1, 1:
nrhs
)
or
A
(1:
n
,
ja
:
ja
+
n
-1)
T
*
X
=
B
(
jb
:
jb+n
-1, 1:
nrhs
)
for real flavors,
A
(1:
n
,
ja
:
ja
+
n
-1)
H
*
X
=
B
(
jb
:
jb+n
-1, 1:
nrhs
)
for complex flavors,
where
A
(1:
n
,
ja
:
ja
+
n
-1)
is a banded triangular matrix factor produced by the Cholesky factorization code
p?pbtrf
and is stored in
A
(1:
n
,
ja
:
ja
+
n
-1)
and
af
. The matrix stored in
A
(1:
n
,
ja
:
ja
+
n
-1)
is either upper or lower triangular according to
uplo
.
The function
p?pbtrf
must be called first.
Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
This notice covers the following instruction sets: SSE2, SSE4.2, AVX2, AVX-512.
Input Parameters
uplo
(global) Must be
'U'
or
'L'
.
If
uplo
=
'U'
, upper triangle of
A
(1:
n
,
ja
:
ja
+
n
-1)
is stored;
If
uplo
=
'L'
, lower triangle of
A
(1:
n
,
ja
:
ja
+
n
-1)
is stored.
trans
(global) Must be
'N'
or
'T'
or
'C'
.
If
trans
=
'N'
, solve with
A
(1:
n
,
ja
:
ja
+
n
-1)
;
If
trans
=
'T'
or
'C'
for real flavors, solve with
A
(1:
n
,
ja
:
ja
+
n
-1)
T
.
If
trans
=
'C'
for complex flavors, solve with conjugate transpose (
A
(1:
n
,
ja
:
ja
+
n
-1
)
H
.
n
(global)
The number of rows and columns to be operated on, that is, the order of the distributed submatrix
A
(1:
n
,
ja
:
ja
+
n
-1)
.
n
0
.
bw
(global)
The number of subdiagonals in
'L'
or
'U'
, 0
bw
n
-1.
nrhs
(global)
The number of right hand sides; the number of columns of the distributed submatrix
B
(
jb
:
jb
+
n
-1, 1:
nrhs
)
;
nrhs
0
.
a
(local)
Pointer into the local memory to an array with the first size
lld_a
(
bw
+1)
, stored in
desca
.
On entry, this array contains the local pieces of the
n
-by-
n
symmetric banded distributed Cholesky factor
L
or
L
T
*
A
(1:
n
,
ja
:
ja
+
n
-1)
.
This local portion is stored in the packed banded format used in LAPACK. See the
Application Notes
below and the
ScaLAPACK manual for more detail on the format of distributed matrices.
ja
(global) The index in the global in the global matrix
A
that points to the start of the matrix to be operated on (which may be either all of
A
or a submatrix of
A
).
desca
(global and local) array of size
dlen_
. The array descriptor for the distributed matrix
A
.
If
1D type (
dtype_a
= 501)
, then
dlen
7
;
If
2D type (
dtype_a
= 1)
, then
dlen
9
.
Contains information on mapping of
A
to memory. (See ScaLAPACK manual for full description and options.)
b
(local)
Pointer into the local memory to an array of local lead size
lld_b
nb
.
On entry, this array contains the local pieces of the right hand sides
B
(
jb
:
jb+n
-1, 1:
nrhs
)
.
ib
(global) The row index in the global matrix
B
that points to the first row of the matrix to be operated on (which may be either all of
B
or a submatrix of
B
).
descb
(global and local) array of size
dlen_
. The array descriptor for the distributed matrix
B
.
If
1D type (
dtype_b
= 502)
, then
dlen
7
;
If
2D type (
dtype_b
= 1)
, then
dlen
9
.
Contains information on mapping of
B
to memory. Please, see ScaLAPACK manual for full description and options.
laf
(local)
The size of user-input auxiliary fill-in space
af
. Must be
laf
(
nb
+2*
bw
)*
bw
. If
laf
is not large enough, an error code will be returned and the minimum acceptable size will be returned in
af
[0]
.
work
(local)
The array
work
is a temporary workspace array of size
lwork
. This space may be overwritten in between
function calls
.
lwork
(local or global) The size of the user-input workspace
work
, must be at least
lwork
bw
*
nrhs
. If
lwork
is too small, the minimal acceptable size will be returned in
work
[0]
and an error code is returned.
Output Parameters
af
(local)
The array
af
is of size
laf
. It contains auxiliary fill-in space. The fill-in space is created in a call to the factorization
function
p?pbtrf
and is stored in
af
. If a linear system is to be solved using
p?pbtrs
after the factorization
function
,
af
must not be altered after the factorization.
b
On exit, this array contains the local piece of the solutions distributed matrix
X
.
work
[0]
On exit,
work
[0]
contains the minimum value of
lwork
.
info
(local)
= 0
: successful exit
< 0
: if the
i
-th argument is an array and the
j
-th entry
, indexed
j
-1,
had an illegal value,
then
info
= - (
i
*100 +
j
),
if the
i
-th argument is a scalar and had an illegal value,
then
info
= -
i
.
Application Notes
If the factorization
function
and the solve
function
are to be called separately to solve various sets of right-hand sides using the same coefficient matrix, the auxiliary space
af
must not be altered between calls to the factorization
function
and the solve
function
.
The best algorithm for solving banded and tridiagonal linear systems depends on a variety of parameters, especially the bandwidth. Currently, only algorithms designed for the case
N
/
P
>
>
bw
are implemented. These algorithms go by many names, including Divide and Conquer, Partitioning, domain decomposition-type, etc.
Algorithm description: Divide and Conquer. *
The Divide and Conquer algorithm assumes the matrix is narrowly banded compared with the number of equations. In this situation, it is best to distribute the input matrix
A
one-dimensionally, with columns atomic and rows divided amongst the processes. The basic algorithm divides the banded matrix up into
P
pieces with one stored on each processor, and then proceeds in 2 phases for the factorization or 3 for the solution of a linear system.
  1. Local Phase
    : The individual pieces are factored independently and in parallel. These factors are applied to the matrix creating fill-in, which is stored in a non-inspectable way in auxiliary space
    af
    . Mathematically, this is equivalent to reordering the matrix
    A
    as
    P
    A
    P
    T
    and then factoring the principal leading submatrix of size equal to the sum of the sizes of the matrices factored on each processor. The factors of these submatrices overwrite the corresponding parts of
    A
    in memory.
  2. Reduced System Phase
    : A small
    (
    bw
    *
    (
    P
    -1)) system is formed representing interaction of the larger blocks and is stored (as are its factors) in the space
    af
    . A parallel Block Cyclic Reduction algorithm is used. For a linear system, a parallel front solve followed by an analogous backsolve, both using the structure of the factored matrix, are performed.
  3. Back Subsitution Phase
    : For a linear system, a local backsubstitution is performed on each processor in parallel.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804