What's new in Intel® MKL 10.0?

Intel® MKL product changes since Intel® MKL 9.1

  • Linking model change
  • In Version 10.0 of Intel® MKL we have re-architected Intel® MKL and physically separated the interface, threading and computational components of the product. This architecture facilitates the use of multiple library linking combinations to support numerous configurations of interfaces, compilers, and processors in a single package. Multiple layers are provided so that the base Intel® MKL package supports numerous configurations of interfaces, compilers, and processors in a single package. This new Intel® MKL architecture is intended to provide maximum support for our varied customers’ needs, while minimizing the effort it takes to obtain and utilize the great performance of Intel® MKL. For more information, please refer to the "Using Intel® MKL Parallelism" section of the Intel® MKL User’s Guide
  • Cluster enabled capability available in single Intel® MKL product
  • In Intel® MKL 9.1, there were two versions of Intel® MKL (Intel® MKL for Windows, and Intel® MKL Cluster Edition for Windows). In Intel® MKL 10.0, we have merged these two versions and now there is only one version: Intel® MKL for Windows, which includes ScaLAPACK, distributed memory FFT’s and all other capabilities of the former "Cluster Edition"

Performance improvements since Intel® MKL 9.1

  • BLAS
  • DGEMM and SGEMM on Intel® Core™2 Quad processors
  • Large square and large outer product sizes were improved by 1.04 times on 1 thread and 1.1 times to 1.15 times on 8 threads
  • Other level 3 real functions were improved by 1.02 times to 1.04 times on large sizes
  • Several linear equation solvers (?spsv/?hpsv/?ppsv, ?pbsv/?gbsv, ?gtsv/?ptsv, ?sysv/?hesv) have dramatically improved in performance. Banded and packed storage format and multiple right-hand sides cases see performance enhanced up to 100 times
  • All symmetric eigensolvers (?syev/?syev, ?syevd/?heevd, ?syevx/?heevx, ?syevr/?heevr) have significantly improved, since tridiagonalization routines (?sytrd/?hetrd) have sped up to 4 times
  • All symmetric eigensolvers in packed storage (?spev/?hpev, ?spevd/?hpevd, ?spevx/?hpevx) have significantly improved, since tridiagonalization routines in packed storage (?sptrd/?hptrd) perform 3 times better than previous version
  • A number of routines which apply orthogonal/unitary transformations (?ormqr/?unmqr, ?ormrq/?unmrq, ?ormql/?unmql, ?ormlq/?unmlq) are up to 2 times faster
  • FFTs
  • Performance of complex 1D FFTs for power-of-two sizes was improved by up to 1.8 times on 1 thread
  • On systems with Intel® EM64T and running in 64-bit mode
    Complex 2D FFTs were sped up by up to 1.1 times on 1 thread for single and double precision
  • Parallel Complex 2D FFTs were sped up for single precision by up to 1.2 times on 8 threads and for double precision by up to 1.3 times
  • Parallel Complex 3D FFTs were sped up by up to 1.15 times for single and double precision
  • Parallel Complex Backward 2D FFTs were sped up for double precision by up to 1.4 times and for single precision up to 1.3 times
  • Single complex backward 1D FFT size greater than 2^22 were sped up by up to 2 times on 4 threads and up to 2.4 times on 8 threads on Itanium® processors
  • Performance of VSL functions is improved on non-Intel processors by approximately 2 times on average
  • Performance of VML vdExp, vdSin, and vdCos functions is improved on non-Intel processors by 1.18 times on average
  • Performance of VSL functions is improved on IA-32 and Intel® 64 by 1.07 times on average

Other Improvements

  • Change in threading model
  • Previously, when OMP_NUM_THREADS was undefined the number of threads for Intel® MKL defaulted to 1. With Intel® MKL 10.0, when the environment variable OMP_NUM_THREADS is undefined, your compiler run time library (e.g. libguide) determines the default number of threads. Intel® MKL may create multiple threads depending on problem size and the value of the MKL_DYNAMIC or other threading environment variables
  • To provide additional user control over threading, the following environment variables have been added: MKL_NUM_THREADS, MKL_DOMAIN_NUM_THREADS, and MKL_DYNAMIC as well as the corresponding library routines. See the User Guide for details
  • Interface changes
  • The C DFTI has changed in the ILP64 variant of the C/C++ interface. The MKL_LONG type is used instead of long type in C DFTI interface, i.e.
    MKL_LONG Dfti…(… ,MKL_LONG, …) instead of long Dfti…(…,long, …). For example we have difference on Windows where long is 4 byte, MKL_LONG is 8 byte in ILP64 variant. See details in the User’s Guide
  • Out-of-core (OOC) PARDISO for all types of matrices
  • In version 10.0, we have added out-of-core memory support to PARDISO. While computers have greatly increased memory capacity, there continue to be a large number of problems for which problems sizes are too great to solve with in-memory solutions. For customers who are encountering problem size limitations we encourage you to try our new out-of-core memory PARDISO solution. Opportunities for further performance optimizations have been identified and we plan to release an Intel® MKL update in the coming months with significant performance improvements
  • ZGEMM3M and CGEMM3M functions
  • These complex functions use three block matrix multiplies and five additions as opposed to four block matrix multiplies and four additions to reduce the number of operations. These two functions are extensions to the standard BLAS in Intel MKL using the same syntax as ZGEMM and CGEMM respectively
  • Using [Z/C]GEMM3M instead of [Z/C]GEMM can give up to 1.25 times of performance improvement without bit-to-bit correspondence of the results
  • Iterative Sparse Solvers
  • An ILUT pre-conditioner has been added
  • Sparse BLAS
  • Support for sparse 0-based indexing has bee n added
  • The mkl_scsrgemv, a single precision sparse BLAS matrix vector multiply function, has been added
  • FFTs
  • The DftiCommitDescriptor function has been optimized by avoiding double data initialization for serial and parallel 1D FFT. This function now runs faster and allocates less memory
  • Vector Math Library (VML)
  • New VML EP (enhanced performance) accuracy mode has been introduced. The EP routines are significantly faster than LA (low accuracy) routines and are accurate to at least 11 and 26 bits for single and double precisions respectively. See vmlSetMode function description in the Intel® MKL manual for details
  • New VML functions added: v{s,d,c,z}Mul, v{c,z}MulByConj, v{c,z}Div, v{s,d,c,z}Add, v{s,d,c,z}Sub, v{c,z}Conj, v{s,d}Expm1, v{s,d}Log1p, v{s,d}Sqr, v{s,d}Pow3o2, v{s,d}Pow2o3, v{s,d,c,z}Abs, v{c,z}CIS
  • Vector Statistical Library (VSL)
  • Support of 64-bit nskip parameter of vslSkipAheadStream service routine in all versions of the VSL (not only ILP64) introduced
  • Bugs in vslCopyStream, vslCopyStreamState service routines, and VSL QRNG initialization scheme for the case of user-defined parameters were fixed
  • PDE Support
  • Trigonometric Transforms have been extended to support various kinds of DCT/DST transforms. In addition to even size transforms, odd size transforms are supported starting from this release
  • FFTW 3.x Wrappers
  • New FFTW 3.x wrappers have been developed for real-to-real (DCT/DST) transforms

Operating System:

Red Hat* Linux, Windows Vista*, Windows* XP Starter Edition, SUSE* Linux
For more complete information about compiler optimizations, see our Optimization Notice.