The Alternatives for Intel® IPP Legacy Small Matrices Domain

Igor Astakhov, Shane A Story

Introduction

Functionally in the Intel® Integrated Performance Primitives (Intel® IPP) Small Matrix Operations (ippMX) domain has been removed from Intel IPP 9.0. This action follows the deprecation announcement (Deprecated Function for Small Matrices and Realistic Rendering) made when Intel IPP 7.1 released. Customers who want to continue using functions in this domain should make a special request here . Note that the last optimizations made to the ippMX domain were a few years ago and only targeted Intel® Advanced Vector Extensions (Intel® AVX). Now that the package that contains the ippMX functionality is separate from the Intel IPP product, performance or stability issues found will not be addressed.

We recommend that you update your application to use the Intel® Math Kernel Library (Intel® MKL) which provides functionality similar to that in the ippMX domain. Functionality included in Intel MKL may be a good alternative to the ippMX functionality for a number of reasons:

Intel MKL is a computing math library of highly optimized, extensively threaded routines for applications that require maximum performance.
The library provides the same C programming language interfaces. Intel MKL can be called from applications written in either C or C++, as well as in any other language that can reference a C interface.
Intel MKL provides comprehensive vector, vector-matrix, and matrix-matrix functionality including the BLAS (level 1, 2, and 3) and LAPACK linear algebra routines.

Differences exist between the ippMX and Intel MKL functionality. These include:

API differences.
- Intel MKL relies on industry standard definitions such as BLAS and LAPACK.The function names are typically 4-6 characters in length, with the first character denoting the data type (D for double precision, corresponding to Ipp64f; S for single precision, corresponding to Ipp32f; Z for complex double precision; C for complex single precision).Additionally, because BLAS and LAPACK APIs are both Fortran-style, parameters are passed by reference rather than by value.However, there are C-style interfaces available in Intel MKL: CBLAS and LAPACKE, where most parameters are passed by value.
Data layout.
- The BLAS and LAPACK APIs are Fortran-style where matrix data is assumed to be stored in column-major order. C-style interfaces in Intel MKL, CBLAS and LAPACKE, support both row-major and col-major matrix storage.
Allowable matrix structures.
- The ippMX objects allow for both regular and irregular matrices, as well as arrays of matrices, with strides between elements, rows, and matrices in the case of an array of matrices.Intel MKL generally works only on regular matrices and only allows one nontrivial stride (the leading dimension).For row-major ordering, the elements within a row must be contiguous, and the rows must be equally spaced in memory. For column-major ordering, the elements within a column must be contiguous, and the columns must be equally spaced in memory.Only regular vectors, where the stride between elements is constant, are allowed.Intel MKL offers batch processing of matrix-matrix multiplications (?gemm_batch) starting from Intel MKL 11.3.Please refer to the Intel MKL Reference Manual for more details on batch processing.
Destructivity of parameters.
- Many of the ippMX routines are non-destructive, whereas many of the Intel MKL routines are destructive.For instance, to perform a Cholesky decomposition of a symmetric positive definite matrix A, the ippMX version will produce the Cholesky factor L without overwriting A while the Intel MKL version will overwrite A with the Cholesky factor L.If the original matrix A is needed after the Cholesky factorization, it can first be copied using an Intel MKL routine such as DLACPY or MKL_DOMATCOPY2. Note that the copy overhead may be significant.
Error reporting.
- The ippMX functions typically return an IppStatus value, signifying whether or not any errors were encountered. Intel MKL routines rely on XERBLA for error handling, and by default MKL will print out a message upon encountering an error.An XERBLA routine can be provided by the user to change the default behavior. Additionally, LAPACK routines may require a parameter (info) that provides error information specific to the routine at hand.
Performance on small matrices.
- The ippMX domain is tuned for matrices of small size (3x3-6x6).Intel MKL is tuned for matrices of all size, though the overheads of a library are more apparent on small sizes.For this reason, the ability to avoid some overheads (such as error checking) was introduced starting from Intel MKL 11.2.See the subsection “Improving Performance of Intel MKL for Small Size Problems” for more details.In addition, Intel MKL has optimizations for more recent architectures, such as Intel® Advanced Vector Extensions 2 (Intel® AVX2) and Intel® Advanced Vector Extensions 512 (Intel® AVX-512), while optimization of ippMX stopped at Intel® Advanced Vector Extensions (Intel® AVX).
Footprint size.
- The Intel MKL footprint is larger than the IPP footprint, both in terms of library size and growth of the binary executable when linking.
Supported precisions.
- For the most part, Intel MKL routines support both real and complex data types whereas the ippMX domain only covers real data types.

The following table describes where Intel MKL provides functionality similar to ippMX functionality.

ippMX functionality	Intel MKL Domain	Example functionality that can be reproduced with Intel MKL(may require multiple calls)	Intel MKL does not cover
Utility functions	BLAS Level-1, LAPACK	Copy, Loading identity matrix, Extract
Vector Functions	BLAS Level-1	Norm	Cross product
Matrix Functions	BLAS Level-2/3	Matrix multiplication	Norm, Determinant, Trace
Linear Systems & Least Squares	LAPACK	Factor, Solve
Symmetric Eigenvalue Problem	LAPACK	Eigenvalues, Eigenvectors
Math Functions for 2D/3D/4D vectors		Vector addition, subtraction	D3DX 10 Microsoft DirectX (Interpolation, Normalization, Rotation)

The following table details several Intel MKL functions equivalent to some of the ippMX functions.

ippMX Routine

Intel MKL Equivalent

Notes

ippmTranspose_m_64f,

ippmTranspose_m_32f

MKL_DOMATCOPY2,

MKL_SOMATCOPY2

There are simpler functions, but these allow 2 strides within the matrices like the ippMX domain

ippmMul_{mv,tv}_64f,

ippmMul_{mv,tv}_32f

DGEMV,

SGEMV

IppmMul_{mm,mt,tm,tt}_64f,

ippmMul_{mm,mt,tm,tt}_32f

DGEMM,

SGEMM

Limited stride support and paramater transformation is required for the column major APIs (see the code example)

ippmInvert_m_64f,

ippmInvert_m_32f

DGETRF/DGETRI,

SGETRF/SGETRI

User must call LU factorization (DGETRF, SGETRF) first before calling matrix inversion functions (DGETRI, SGETRI)

ippmQRDecomp_m_64f,

ippmQRDecomp_m_32f

DGEQRF,

SGEQRF

ippmQRBackSubst_mv_64f,

ippmQRBackSubst_mv_32f

DORMQR/DLATRS,

SORMQR/SLATRS

The expert drivers in LAPACK tend to use LU, so one has to construct this manually

For concreteness, examples are provided for two matrix operations. The first demonstrates matrix-matrix multiplication single precision (float); the second demonstrates a matrix inversion for double precision (double). These examples are only meant to be a starting point for porting code from ippMX to Intel MKL and do not cover all cases.

ippMX matrix-matrix multiplication:

/* Src1 matrix with width=4 and height=3 */
    Ipp32f pSrc1[3*4]  = { 1, 2, 3, 4,
                           5, 6, 7, 8,
                           4, 3, 2, 1 };

    int src1Width   = 4;               
    int src1Height  = 3;              
    int src1Stride2 = sizeof(Ipp32f);
    int src1Stride1 = 4*sizeof(Ipp32f); 

    /* Src2 matrix with width=3 and height=4 */
    Ipp32f pSrc2[4*3] = { 11, 15, 14,
                          12, 16, 13,
                          13, 17, 12,
                          14, 18, 11 };
    
    int src2Width   = 3;               
    int src2Height  = 4;
    int src2Stride2 = sizeof(Ipp32f);
    int src2Stride1 = 3*sizeof(Ipp32f);

    /*
     *  Destination matrix has width=src2Width=3 
     *  and height=src1Height=3 
     */
    Ipp32f pDst[3*3];                 
    int dstStride2 = sizeof(Ipp32f);
    int dstStride1 = 3*sizeof(Ipp32f);

    IppStatus status = ippmMul_mm_32f(
        pSrc1, src1Stride1, src1Stride2, src1Width, src1Height, 
        pSrc2, src2Stride1, src2Stride2, src2Width, src2Height,
        pDst, dstStride1, dstStride2);

Intel MKL matrix-matrix multiplication (CBLAS interface):

 /* Matrix dimensions A is 3x4 (mxk), 
  * B is 4x3 (kxn), and C is 3x3 (mxn) */
    int m   = 3;
    int n   = 3;
    int k   = 4;
    int lda = 4;
    int ldb = 3;
    int ldc = 3;
    /* Initialize matrices */
    float A[3*4] = { 1, 2, 3, 4,
                     5, 6, 7, 8,
                     4, 3, 2, 1 };

    float B[4*3] = { 11, 15, 14,
                     12, 16, 13,
                     13, 17, 12,
                     14, 18, 11 };
    float C[3*3];
    /* Parameters specific to MKL */
    float alpha = 1.0;
    float beta  = 0.0;
    /* CBLAS interface that works with row-major matrices */
    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 
        m, n, k, 
        alpha, A, lda, 
        B, ldb, 
        beta, C, ldc);

Intel MKL matrix-matrix multiplication (column-major Fortran-style interface):

 /* Matrix dimensions A is 3x4 (mxk), 
     * B is 4x3 (kxn), and C is 3x3 (mxn) */
    int m   = 3;
    int n   = 3;
    int k   = 4;
    int lda = 4;
    int ldb = 3;
    int ldc = 3;

    /* Initialize matrices */
    float A[3*4] = { 1, 2, 3, 4,
                     5, 6, 7, 8,
                     4, 3, 2, 1 };
    float B[4*3] = { 11, 15, 14,
                     12, 16, 13,
                     13, 17, 12,
                     14, 18, 11 };
    float C[3*3];

    /* Parameters specific to MKL */
    float alpha = 1.0;
    float beta  = 0.0;

    /* Fortran 77 interface that can be called from C. 
     * Fortran 77 interface assumes column-major storage.
     * Therefore, swap A and B matrices to effectively compute
     * C^T = (A*B)^T = B^T*A^T. */
    sgemm("N", "N", 
        &n, &m, &k, 
        &alpha, B, &ldb, 
        A, &lda, 
        &beta, C, &ldc);

ippMX matrix inversion:

/* Src matrix with widthHeight=3 */
    Ipp64f pSrc[3*3] = {  1,  1, -1,
                         -1,  0,  2,
                         -1, -1,  2 };

    int widthHeight = 3;               
    int srcStride1  = 3*sizeof(Ipp64f);
    int srcStride2  = sizeof(Ipp64f);

    /* Dst matrix */
    Ipp64f pDst[3*3];                 
    int dstStride1 = 3*sizeof(Ipp64f);
    int dstStride2 = sizeof(Ipp64f);

    /* Buffer (work) space */
    Ipp64f pBuffer[3*3+3];

    /* IPP matrix inversion */
    IppStatus status = ippmInvert_m_64f(
        pSrc, srcStride1, srcStride2, 
        pBuffer,
        pDst, dstStride1, dstStride2, 
        widthHeight);

Intel MKL matrix inversion (LAPACKE interface):

/* Matrix dimensions */
    int n   = 3;
    int lda = 3;

    /*
     *  Source/destination (overwritten) A matrix.
     */
    double A[3*3] = {  1,  1, -1,
                      -1,  0,  2,
                      -1, -1,  2 };

    /* Buffer (work) space */
    int lwork = 64*3;
    double work[64*3];

    /* Parameters specific to MKL */
    int ipiv[3];
    int info;

    /* LAPACKE interface */
    /* First need to factorize A, then we can compute the inverse of A */
    LAPACKE_dgetrf(LAPACK_ROW_MAJOR, n, n, A, lda, ipiv);
    LAPACKE_dgetri(LAPACK_ROW_MAJOR, n, A, lda, ipiv);

Intel MKL matrix inversion (column-major Fortran-style interface):

/* Matrix dimensions */
    int n   = 3;
    int lda = 3;

    /* Source/destination (overwritten) A matrix.
     * Entries of the A^T since Fortran interfaces are 
  * column-major. Computing inverse of A^T 
     * is equivalent to computing inverse of A stored 
     * in row-major since (A^T)^-1 = (A^-1)^T. */
    double A[3*3] = {  1,  1, -1,
                      -1,  0,  2,
                      -1, -1,  2 };

    /* Buffer (work) space */
    int lwork = 64*3;
    double work[64*3];

    /* Parameters specific to MKL */
    int ipiv[3];
    int info;

    /* Fortran 77 interfaces (callable from C) */
    /* First need to factorize A^T, then we can compute the inverse of A^T */
    dgetrf(&n, &n, A, &lda, ipiv, &info);
    dgetri(&n, A, &lda, ipiv, work, &lwork, &info);

The examples show some of the differences between ippMX and Intel MKL. First note that the CBLAS/LAPACKE interfaces are more similar to the ippMX API in that most parameters are passed by value (rather than by reference). The CBLAS/LAPACKE interfaces also natively allow a row-major data layout. However, for better performance (described in more detail below), the Fortran-style interface is recommended.

For the matrix-matrix multiplication code samples, note that ippmMul_mm_32f essentially performs C = A * B. Other ippMX routines perform matrix multiplication for transposed matrices, e.g., ippmMul_tm_32f performs C = A^T * B. The sgemm routine performs a more generalized operation: C = alpha * op(A) * op(B) + beta * C, where op(*) can represent transpose, non-transpose, or conjugate transpose (for complex data). By setting alpha to 1.0 and beta to 0.0, we get the operation: C = op(A) * op(B). In the example code, the A and B matrices are stored in row-major order, which is why the A and B matrices are swapped in the sgemm call. The transpose of a row-major matrix is equivalent to the same matrix stored in column-major format, (A*B)^T = B^T*A^T, which is why A and B matrices are swapped for the sgemm call. With the CBLAS interface, we can specify that the matrices are stored in row-major order, which is why the CblasRowMajor enum type is used.

In the matrix inversion example, ippmInvert_m_64f requires both a source and a destination matrix, so that after the call: dst = src^-1. The Intel MKL routines work differently in that they operate directly on the matrix so that the original source matrix is replaced with the result. In addition, the matrix must first be factored by calling dgetrf before it can be inverted with dgetri. The Intel MKL matrix inversion example stores A^T in column major order, and exploits the fact that (A^T)^-1= (A^-1)^T. Finally, note that both ippMX and Intel MKL routines require some workspace. For the function dgetri, the size of the workspace must be at least n, but better performance is usually attained with a workspace of size 16*n or 64*n.

Improving Performance of Intel MKL for Small Size Problems

The overhead of calling an Intel MKL function for small problem sizes can be significant when the function has a large number of parameters or internally checks error in input parameters. To reduce the performance overhead for these small size problems, the Intel MKL direct call feature works in conjunction with the compiler to preprocess the calling parameters and directly call or inline special optimized kernels that bypass error checking.

To activate the feature, do the following:

Compile your C code with the preprocessor macro depending on whether a threaded or sequential mode of Intel MKL is required by supplying the compiler option as explained below:

Intel MKL Mode	Macro	Compiler Option (Windows)	Compiler Option (Linux)
Threaded	MKL_DIRECT_CALL	/DMKL_DIRECT_CALL	-DMKL_DIRECT_CALL
Sequential	MKL_DIRECT_CALL_SEQ	/DMKL_DIRECT_CALL_SEQ	-DMKL_DIRECT_CALL_SEQ

Please refer to the Intel MKL User's Guide for further details on using MKL_DIRECT_CALL and to see the current list of supported subroutines.

Another way to reduce overheads for small sizes is to call the Fortran-style API directly, rather than using CBLAS or LAPACKE. After processing the input parameters, which may include making copies of matrices, the CBLAS and LAPACKE interfaces, in turn, call the Fortran-style API.

Summary

We encourage users to transition their ippMX-based applications to use functionality in Intel MKL. If you require further assistance mapping your ippMX functions to Intel MKL functions, please post your request to the Intel MKL forum.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

The Alternatives for Intel® IPP Legacy Small Matrices Domain

Introduction

Improving Performance of Intel MKL for Small Size Problems

Summary

Product and Performance Information