Using Intel® MKL with Threaded Applications

Page Contents:


Memory Allocation MKL: Memory appears to be allocated and not released when calling some Intel® MKL routines (e.g. sgetrf).
One of the advantages of using the Intel MKL is that it is multithreaded using OpenMP*. OpenMP* requires buffers to perform some operations and allocates memory even for single-processor systems and single-thread applications. This memory allocation occurs once the first time the OpenMP software is encountered in the program. This memory allocation persists until the application terminates. In addition, the Windows* operating system will allocate a stack equal to the main stack for every additional thread created, so the amount of memory that is automatically allocated will depend on the main stack, the OpenMP allocations and the number of threads used.  If your program needs to free memory, call mkl_free_buffers(). If another call is made to a library function that needs a memory buffer, the memory manager again allocates the buffers and they again remain allocated until either the program ends or the program deallocates the memory. This behavior facilitates better performance. However, some tools may report this behavior as a memory leak.

Please refer to the User's Guide for more details.


Using Threading with BLAS and LAPACK

Intel MKL is threaded in a number of places: LAPACK (*GETRF, *POTRF, *GBTRF routines and many others.), BLAS, DFTs, and FFTs. The more comprehensive list of MKL's routines are threaded see in the User's Guide, See chapter "Threaded Function and Problems". Intel MKL uses OpenMP* threading software. There are situations in which conflicts can exist that make the use of threads in Intel MKL problematic. We list them here with recommendations for dealing with these. First, a brief discussion of why the problem exists is appropriate. 

If the user threads the program using OpenMP directives and uses the Intel® Compilers to compile the program, Intel MKL and the user program will both use the same threading library. Intel MKL tries to determine if it is in a parallel region in the program, and if it is, it does not spread its operations over multiple threads. But Intel MKL can be aware that it is in a parallel region only if the threaded program and Intel MKL are using the same threading library. 

Please refer to the KB Article "Recommended settings for calling Intel® MKL routines from multi-threaded application" for getting our recommendation how to launch your application into a multithreaded environments.


Setting the Number of Threads for OpenMP* (OMP)
The OpenMP* software responds to the environment variable OMP_NUM_THREADS:
  • Windows*: Open the Environment panel of the System Properties box of the Control Panel on Microsoft* Windows NT*, or it can be set in the shell the program is running in with the command: set OMP_NUM_THREADS=<number of threads to use>.
  • Linux*: To set and export the variable  "export OMP_NUM_THREADS=<number of threads to use>".
Note:
 
  1. Setting the variable when running on Microsoft* Windows* 98 or Windows* Me is meaningless, since multiprocessing is not supported.
  2. Technigues to Set the Number of Threads
Use one of the following techniques to change the number of threads to use in the Intel® Math Kernel Library (Intel® MKL):
  • Set one of the OpenMP or Intel MKL environment variables: OMP_NUM_THREADS, MKL_NUM_THREADS, MKL_DOMAIN_NUM_THREADS
  • Call one of the OpenMP or Intel MKL functions:  omp_set_num_threads(), mkl_set_num_threads(),  mkl_domain_set_num_threads()

Changing the Number of Processors for Threading During Runtime
It is not possible to change the number of processors during runtime using the environment variable OMP_NUM_THREADS. You can call OpenMP API functions from your program to change the number of threads during runtime. The following sample code demonstrates changing the number of threads during runtime using the omp_set_num_threads() routine:

#include "omp.h"
#include "mkl.h"
#include <stdio.h>

#define SIZE 1000

void main(int args, char *argv[]){

double *a, *b, *c;
a = new double [SIZE*SIZE];
b = new double [SIZE*SIZE];
c = new double [SIZE*SIZE];

double alpha=1, beta=1;
int m=SIZE, n=SIZE, k=SIZE, lda=SIZE, ldb=SIZE, ldc=SIZE, i=0, j=0;
char transa='n', transb='n';

for( i=0; i<SIZE; i++){
for( j=0; j<SIZE; j++){
a[i*SIZE+j]= (double)(i+j);
b[i*SIZE+j]= (double)(i*j);
c[i*SIZE+j]= (double)0;
}
}
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
m, n, k, alpha, a, lda, b, ldb, beta, c, ldc);

printf("row a c ");
for ( i=0;i<10;i++){
printf("%d: %f %f ", i, a[i*SIZE], c[i*SIZE]);
}

omp_set_num_threads(1);

for( i=0; i<SIZE; i++){
for( j=0; j<SIZE; j++){
a[i*SIZE+j]= (double)(i+j);
b[i*SIZE+j]= (double)(i*j);
c[i*SIZE+j]= (double)0;
}
}
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
m, n, k, alpha, a, lda, b, ldb, beta, c, ldc);

printf("row a c ");
for ( i=0;i<10;i++){
printf("%d: %f %f ", i, a[i*SIZE],
c[i*SIZE]);
}

omp_set_num_threads(2);
for( i=0; i<SIZE; i++){
for( j=0; j<SIZE; j++){
a[i*SIZE+j]= (double)(i+j);
b[i*SIZE+j]= (double)(i*j);
c[i*SIZE+j]= (double)0;
}
}
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
m, n, k, alpha, a, lda, b, ldb, beta, c, ldc);

printf("row a c ");
for ( i=0;i<10;i++){
printf("%d: %f %f ", i, a[i*SIZE],
c[i*SIZE]);
}

delete [] a;
delete [] b;
delete [] c;
}


Can I use Intel MKL if I thread my application?
The Intel Math Kernel Library is designed and compiled for thread safety so it can be called from programs that are threaded. It is fine to calling Intel MKL routines that are threaded from multiple application threads, like windows API CreateThread() or Pthread API etc.

New threading features in MKL 10.x
Please check Intel® MKL 10.0 threading on new threading feature introduced by Intel MKL 10.x

 

Intel MKL Threading on Intel MIC Architecture
To avoid performance drops caused by oversubscribing Intel Xeon Phi coprocessors, Intel MKL limits the number of threads

  • For native runs on coprocessors, Intel MKL uses 4*Number-of-Phi-Cores threads by default and scales down the number of threads back to this value if you request more threads and MKL_DYNAMIC is true.
  • For runs that offload computations, Intel MKL uses 4*(Number-of-Phi-Cores-1) threads by default and scales down the number of threads back to this value if you request more threads and MKL_DYNAMIC is true.

To improve performance of Intel MKL routines, use the following OpenMP and threading settings:

     Set KMP_AFFINITY=balanced

For more information, see the Knowledge Base article at http://software.intel.com/en-us/articles/performance-tips-of-using-intel-mkl-

new TBB threading layer in MKL 11.3

Intel MKL 11.3 Beta update 1 has introduced Intel TBB support. Intel MKL 11.3 can increase performance of applications threaded using Intel TBB. Applications using Intel TBB can benefit from the following Intel MKL functions:

  •          BLAS dot, gemm, gemv, gels
  •          LAPACK getrf, getrs, syev, gels, gelsy, gesv, pstrf, potrs
  •          Sparse BLAS csrmm, bsrmm
  •          Intel MKL Poisson Solver
  •          Intel MKL PARDISO 

For more information, see the Knowledge Base article at 

https://software.intel.com/en-us/articles/using-intel-mkl-and-intel-tbb-in-the-same-application

       

        Optimization Notice

        The Intel® Math Kernel Library (Intel® MKL) contains functions that are more highly optimized for Intel microprocessors than for other microprocessors. While the functions in Intel® MKL offer optimizations for both Intel and Intel-compatible microprocessors, depending on your code and other factors, you will likely get extra performance on Intel microprocessors.

         

        While the paragraph above describes the basic optimization approach for Intel® MKL as a whole, the library may or may not be optimized to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.

        Intel recommends that you evaluate other library products to determine which best meets your requirements.

        Для получения подробной информации о возможностях оптимизации компилятора обратитесь к нашему Уведомлению об оптимизации.