The Alternatives for Intel® IPP Legacy Small Matrices Domain
By Igor Astakhov, Shane A Story
Published:03/24/2015 Last Updated:04/06/2015
Introduction
Functionally in the Intel® Integrated Performance Primitives (Intel® IPP) Small Matrix Operations (ippMX) domain has been removed from Intel IPP 9.0. This action follows the deprecation announcement (Deprecated Function for Small Matrices and Realistic Rendering) made when Intel IPP 7.1 released. Customers who want to continue using functions in this domain should make a special request here . Note that the last optimizations made to the ippMX domain were a few years ago and only targeted Intel® Advanced Vector Extensions (Intel® AVX). Now that the package that contains the ippMX functionality is separate from the Intel IPP product, performance or stability issues found will not be addressed.
We recommend that you update your application to use the Intel® Math Kernel Library (Intel® MKL) which provides functionality similar to that in the ippMX domain. Functionality included in Intel MKL may be a good alternative to the ippMX functionality for a number of reasons:

Intel MKL is a computing math library of highly optimized, extensively threaded routines for applications that require maximum performance.

The library provides the same C programming language interfaces. Intel MKL can be called from applications written in either C or C++, as well as in any other language that can reference a C interface.

Intel MKL provides comprehensive vector, vectormatrix, and matrixmatrix functionality including the BLAS (level 1, 2, and 3) and LAPACK linear algebra routines.
Differences exist between the ippMX and Intel MKL functionality. These include:

API differences.

Intel MKL relies on industry standard definitions such as BLAS and LAPACK.The function names are typically 46 characters in length, with the first character denoting the data type (D for double precision, corresponding to Ipp64f; S for single precision, corresponding to Ipp32f; Z for complex double precision; C for complex single precision).Additionally, because BLAS and LAPACK APIs are both Fortranstyle, parameters are passed by reference rather than by value.However, there are Cstyle interfaces available in Intel MKL: CBLAS and LAPACKE, where most parameters are passed by value.


Data layout.

The BLAS and LAPACK APIs are Fortranstyle where matrix data is assumed to be stored in columnmajor order. Cstyle interfaces in Intel MKL, CBLAS and LAPACKE, support both rowmajor and colmajor matrix storage.


Allowable matrix structures.

The ippMX objects allow for both regular and irregular matrices, as well as arrays of matrices, with strides between elements, rows, and matrices in the case of an array of matrices.Intel MKL generally works only on regular matrices and only allows one nontrivial stride (the leading dimension).For rowmajor ordering, the elements within a row must be contiguous, and the rows must be equally spaced in memory. For columnmajor ordering, the elements within a column must be contiguous, and the columns must be equally spaced in memory.Only regular vectors, where the stride between elements is constant, are allowed.Intel MKL offers batch processing of matrixmatrix multiplications (?gemm_batch) starting from Intel MKL 11.3.Please refer to the Intel MKL Reference Manual for more details on batch processing.


Destructivity of parameters.

Many of the ippMX routines are nondestructive, whereas many of the Intel MKL routines are destructive.For instance, to perform a Cholesky decomposition of a symmetric positive definite matrix A, the ippMX version will produce the Cholesky factor L without overwriting A while the Intel MKL version will overwrite A with the Cholesky factor L.If the original matrix A is needed after the Cholesky factorization, it can first be copied using an Intel MKL routine such as DLACPY or MKL_DOMATCOPY2. Note that the copy overhead may be significant.


Error reporting.

The ippMX functions typically return an IppStatus value, signifying whether or not any errors were encountered. Intel MKL routines rely on XERBLA for error handling, and by default MKL will print out a message upon encountering an error.An XERBLA routine can be provided by the user to change the default behavior. Additionally, LAPACK routines may require a parameter (info) that provides error information specific to the routine at hand.


Performance on small matrices.

The ippMX domain is tuned for matrices of small size (3x36x6).Intel MKL is tuned for matrices of all size, though the overheads of a library are more apparent on small sizes.For this reason, the ability to avoid some overheads (such as error checking) was introduced starting from Intel MKL 11.2.See the subsection “Improving Performance of Intel MKL for Small Size Problems” for more details.In addition, Intel MKL has optimizations for more recent architectures, such as Intel® Advanced Vector Extensions 2 (Intel® AVX2) and Intel® Advanced Vector Extensions 512 (Intel® AVX512), while optimization of ippMX stopped at Intel® Advanced Vector Extensions (Intel® AVX).


Footprint size.

The Intel MKL footprint is larger than the IPP footprint, both in terms of library size and growth of the binary executable when linking.


Supported precisions.

For the most part, Intel MKL routines support both real and complex data types whereas the ippMX domain only covers real data types.

The following table describes where Intel MKL provides functionality similar to ippMX functionality.
ippMX functionality 
Intel MKL Domain 
Example functionality that can be reproduced with Intel MKL(may require multiple calls) 
Intel MKL does not cover 
Utility functions 
BLAS Level1, LAPACK 
Copy, Loading identity matrix, Extract 

Vector Functions 
BLAS Level1 
Norm 
Cross product 
Matrix Functions 
BLAS Level2/3 
Matrix multiplication 
Norm, Determinant, Trace 
Linear Systems & Least Squares 
LAPACK 
Factor, Solve 

Symmetric Eigenvalue Problem 
LAPACK 
Eigenvalues, Eigenvectors 

Math Functions for 2D/3D/4D vectors 

Vector addition, subtraction 
D3DX 10 Microsoft DirectX (Interpolation, Normalization, Rotation) 
The following table details several Intel MKL functions equivalent to some of the ippMX functions.
ippMX Routine 
Intel MKL Equivalent 
Notes 
ippmTranspose_m_64f, ippmTranspose_m_32f 
MKL_DOMATCOPY2, MKL_SOMATCOPY2 
There are simpler functions, but these allow 2 strides within the matrices like the ippMX domain 
ippmMul_{mv,tv}_64f, ippmMul_{mv,tv}_32f 
DGEMV, SGEMV 

IppmMul_{mm,mt,tm,tt}_64f, ippmMul_{mm,mt,tm,tt}_32f 
DGEMM, SGEMM 
Limited stride support and paramater transformation is required for the column major APIs (see the code example) 
ippmInvert_m_64f, ippmInvert_m_32f 
DGETRF/DGETRI, SGETRF/SGETRI 
User must call LU factorization (DGETRF, SGETRF) first before calling matrix inversion functions (DGETRI, SGETRI) 
ippmQRDecomp_m_64f, ippmQRDecomp_m_32f 
DGEQRF, SGEQRF 

ippmQRBackSubst_mv_64f, ippmQRBackSubst_mv_32f 
DORMQR/DLATRS, SORMQR/SLATRS 
The expert drivers in LAPACK tend to use LU, so one has to construct this manually 
For concreteness, examples are provided for two matrix operations. The first demonstrates matrixmatrix multiplication single precision (float); the second demonstrates a matrix inversion for double precision (double). These examples are only meant to be a starting point for porting code from ippMX to Intel MKL and do not cover all cases.
ippMX matrixmatrix multiplication:
/* Src1 matrix with width=4 and height=3 */
Ipp32f pSrc1[3*4] = { 1, 2, 3, 4,
5, 6, 7, 8,
4, 3, 2, 1 };
int src1Width = 4;
int src1Height = 3;
int src1Stride2 = sizeof(Ipp32f);
int src1Stride1 = 4*sizeof(Ipp32f);
/* Src2 matrix with width=3 and height=4 */
Ipp32f pSrc2[4*3] = { 11, 15, 14,
12, 16, 13,
13, 17, 12,
14, 18, 11 };
int src2Width = 3;
int src2Height = 4;
int src2Stride2 = sizeof(Ipp32f);
int src2Stride1 = 3*sizeof(Ipp32f);
/*
* Destination matrix has width=src2Width=3
* and height=src1Height=3
*/
Ipp32f pDst[3*3];
int dstStride2 = sizeof(Ipp32f);
int dstStride1 = 3*sizeof(Ipp32f);
IppStatus status = ippmMul_mm_32f(
pSrc1, src1Stride1, src1Stride2, src1Width, src1Height,
pSrc2, src2Stride1, src2Stride2, src2Width, src2Height,
pDst, dstStride1, dstStride2);
Intel MKL matrixmatrix multiplication (CBLAS interface):
/* Matrix dimensions A is 3x4 (mxk),
* B is 4x3 (kxn), and C is 3x3 (mxn) */
int m = 3;
int n = 3;
int k = 4;
int lda = 4;
int ldb = 3;
int ldc = 3;
/* Initialize matrices */
float A[3*4] = { 1, 2, 3, 4,
5, 6, 7, 8,
4, 3, 2, 1 };
float B[4*3] = { 11, 15, 14,
12, 16, 13,
13, 17, 12,
14, 18, 11 };
float C[3*3];
/* Parameters specific to MKL */
float alpha = 1.0;
float beta = 0.0;
/* CBLAS interface that works with rowmajor matrices */
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
m, n, k,
alpha, A, lda,
B, ldb,
beta, C, ldc);
Intel MKL matrixmatrix multiplication (columnmajor Fortranstyle interface):
/* Matrix dimensions A is 3x4 (mxk),
* B is 4x3 (kxn), and C is 3x3 (mxn) */
int m = 3;
int n = 3;
int k = 4;
int lda = 4;
int ldb = 3;
int ldc = 3;
/* Initialize matrices */
float A[3*4] = { 1, 2, 3, 4,
5, 6, 7, 8,
4, 3, 2, 1 };
float B[4*3] = { 11, 15, 14,
12, 16, 13,
13, 17, 12,
14, 18, 11 };
float C[3*3];
/* Parameters specific to MKL */
float alpha = 1.0;
float beta = 0.0;
/* Fortran 77 interface that can be called from C.
* Fortran 77 interface assumes columnmajor storage.
* Therefore, swap A and B matrices to effectively compute
* C^T = (A*B)^T = B^T*A^T. */
sgemm("N", "N",
&n, &m, &k,
&alpha, B, &ldb,
A, &lda,
&beta, C, &ldc);
ippMX matrix inversion:
/* Src matrix with widthHeight=3 */
Ipp64f pSrc[3*3] = { 1, 1, 1,
1, 0, 2,
1, 1, 2 };
int widthHeight = 3;
int srcStride1 = 3*sizeof(Ipp64f);
int srcStride2 = sizeof(Ipp64f);
/* Dst matrix */
Ipp64f pDst[3*3];
int dstStride1 = 3*sizeof(Ipp64f);
int dstStride2 = sizeof(Ipp64f);
/* Buffer (work) space */
Ipp64f pBuffer[3*3+3];
/* IPP matrix inversion */
IppStatus status = ippmInvert_m_64f(
pSrc, srcStride1, srcStride2,
pBuffer,
pDst, dstStride1, dstStride2,
widthHeight);
Intel MKL matrix inversion (LAPACKE interface):
/* Matrix dimensions */
int n = 3;
int lda = 3;
/*
* Source/destination (overwritten) A matrix.
*/
double A[3*3] = { 1, 1, 1,
1, 0, 2,
1, 1, 2 };
/* Buffer (work) space */
int lwork = 64*3;
double work[64*3];
/* Parameters specific to MKL */
int ipiv[3];
int info;
/* LAPACKE interface */
/* First need to factorize A, then we can compute the inverse of A */
LAPACKE_dgetrf(LAPACK_ROW_MAJOR, n, n, A, lda, ipiv);
LAPACKE_dgetri(LAPACK_ROW_MAJOR, n, A, lda, ipiv);
Intel MKL matrix inversion (columnmajor Fortranstyle interface):
/* Matrix dimensions */
int n = 3;
int lda = 3;
/* Source/destination (overwritten) A matrix.
* Entries of the A^T since Fortran interfaces are
* columnmajor. Computing inverse of A^T
* is equivalent to computing inverse of A stored
* in rowmajor since (A^T)^1 = (A^1)^T. */
double A[3*3] = { 1, 1, 1,
1, 0, 2,
1, 1, 2 };
/* Buffer (work) space */
int lwork = 64*3;
double work[64*3];
/* Parameters specific to MKL */
int ipiv[3];
int info;
/* Fortran 77 interfaces (callable from C) */
/* First need to factorize A^T, then we can compute the inverse of A^T */
dgetrf(&n, &n, A, &lda, ipiv, &info);
dgetri(&n, A, &lda, ipiv, work, &lwork, &info);
The examples show some of the differences between ippMX and Intel MKL. First note that the CBLAS/LAPACKE interfaces are more similar to the ippMX API in that most parameters are passed by value (rather than by reference). The CBLAS/LAPACKE interfaces also natively allow a rowmajor data layout. However, for better performance (described in more detail below), the Fortranstyle interface is recommended.
For the matrixmatrix multiplication code samples, note that ippmMul_mm_32f essentially performs C = A * B. Other ippMX routines perform matrix multiplication for transposed matrices, e.g., ippmMul_tm_32f performs C = A^{T} * B. The sgemm routine performs a more generalized operation: C = alpha * op(A) * op(B) + beta * C, where op(*) can represent transpose, nontranspose, or conjugate transpose (for complex data). By setting alpha to 1.0 and beta to 0.0, we get the operation: C = op(A) * op(B). In the example code, the A and B matrices are stored in rowmajor order, which is why the A and B matrices are swapped in the sgemm call. The transpose of a rowmajor matrix is equivalent to the same matrix stored in columnmajor format, (A*B)^{T} = B^{T}*A^{T}, which is why A and B matrices are swapped for the sgemm call. With the CBLAS interface, we can specify that the matrices are stored in rowmajor order, which is why the CblasRowMajor enum type is used.
In the matrix inversion example, ippmInvert_m_64f requires both a source and a destination matrix, so that after the call: dst = src^{1}. The Intel MKL routines work differently in that they operate directly on the matrix so that the original source matrix is replaced with the result. In addition, the matrix must first be factored by calling dgetrf before it can be inverted with dgetri. The Intel MKL matrix inversion example stores A^{T} in column major order, and exploits the fact that (A^{T})^{1 }= (A^{1})^{T}. Finally, note that both ippMX and Intel MKL routines require some workspace. For the function dgetri, the size of the workspace must be at least n, but better performance is usually attained with a workspace of size 16*n or 64*n.
Improving Performance of Intel MKL for Small Size Problems
The overhead of calling an Intel MKL function for small problem sizes can be significant when the function has a large number of parameters or internally checks error in input parameters. To reduce the performance overhead for these small size problems, the Intel MKL direct call feature works in conjunction with the compiler to preprocess the calling parameters and directly call or inline special optimized kernels that bypass error checking.
To activate the feature, do the following:
Compile your C code with the preprocessor macro depending on whether a threaded or sequential mode of Intel MKL is required by supplying the compiler option as explained below:
Intel MKL Mode 
Macro 
Compiler Option (Windows) 
Compiler Option (Linux) 
Threaded 
MKL_DIRECT_CALL 
/DMKL_DIRECT_CALL 
DMKL_DIRECT_CALL 
Sequential 
MKL_DIRECT_CALL_SEQ 
/DMKL_DIRECT_CALL_SEQ 
DMKL_DIRECT_CALL_SEQ 
Please refer to the Intel MKL User's Guide for further details on using MKL_DIRECT_CALL and to see the current list of supported subroutines.
Another way to reduce overheads for small sizes is to call the Fortranstyle API directly, rather than using CBLAS or LAPACKE. After processing the input parameters, which may include making copies of matrices, the CBLAS and LAPACKE interfaces, in turn, call the Fortranstyle API.
Summary
We encourage users to transition their ippMXbased applications to use functionality in Intel MKL. If you require further assistance mapping your ippMX functions to Intel MKL functions, please post your request to the Intel MKL forum.
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.