Intel® Math Kernel Library Release Notes and New Features

This page provides the current Release Notes for Intel® Math Kernel Library. The notes are categorized by year, from newest to oldest, with individual releases listed within each year.

Click a version to expand it into a summary of new features, changes, and known issues in that version since the last release, or the buttons under each major release to see important information, such as pre-requisites, software compatibility, and installation instructions.

You can copy a link to a specific version's section by clicking the chain icon next to its name.

To get product updates, log in to the Intel® Software Development Products Registration Center.
For questions or technical support, visit Intel® Software Developer Support.

2019

Installation GuideSystem RequirementsGetting StartedBug Fix Log

Initial Release
  • BLAS Features:
    • Introduced automatic S/DGEMM JIT capability for small matrix sizes (m,n,k <=16) to improve S/DGEMM performance for Intel® Advanced Vector Extensions 2 (Intel® AVX2) and Intel® Advanced Vector Extensions 512 (Intel® AVX-512) when compiling with MKL_DIRECT_CALL_JIT (threaded usage) or MKL_DIRECT_CALL_SEQ_JIT (sequential usage).
    • Introduced new functions to JIT (create) optimized S/DGEMM-like matrix multiply kernels for small matrix sizes (m,n,k <=16) for Intel® Advanced Vector Extensions 2 (Intel® AVX2) and Intel® Advanced Vector Extensions 512 (Intel® AVX-512), execute the optimized kernel created using matrices with matching dimensions, and to remove (destroy) the JIT kernel.
  • Sparse BLAS:
    • Introduced SYPR and Sp2M functionality for triple matrix multiply ABA^t and matrix multiply AB (and their transposes).
    • Improved performance of Inspector-Executor Sparse BLAS routines for Intel® TBB and sequential threading layers.
    • Improved performance of SpMV , MKL_SPARSE_[S,D,C,Z]_SYMGS and MKL_SPARSE_[S,D,C,Z]_TRSV routines for Intel® Advanced Vector Extensions 512 (Intel® AVX-512).
  • DNN:
    • Deep Neural Network (DNN) component is deprecated and will be removed in the next Intel MKL release. We will continue to provide optimized functions for deep neural networks in Intel Math Kernel Library for Deep Neural Networks (Intel MKL-DNN).
  • LAPACK:
    • Aligned MKL LAPACK functionality with Netlib LAPACK 3.7.1 and 3.8.0:Added routines for symmetric indefinite matrix factorization using a 2-stage Aasen’s algorithm.
    • Improved performance of ?GETRF for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) and other micro architectures with OpenMP* threading.
    • Improved performance of ?GETRF and ?POTRF with TBB* threading.
  • ScaLAPACK:
    • Improved performance and significantly reduced memory footprint of ScaLAPACK Eigensolvers P?[SY|HE]EV[D|X|R] routine.
  • FFT:
    • Improved performance of 1D real-to-complex FFT.
    • Improved performance of C2C 1D and 2D FFT for Intel® Advanced Vector Extensions 512 (Intel® AVX-512).
  • Sparse Solvers:
    • Introduced SparseQR functionality.
    • Introduced Extreme{EVD/SVD} functionality to calculate set of most positive or most negative eigen/singular values of a symmetric(Hermitian) matrix.
    • Introduced support of partial inversion of sparse matrices (compute diagonal of inverse) in Intel® MKL PARDISO.
  • Random Generators:
    • Introduced Multinominal Random Number Generators.

Product Content

Intel MKL can be installed as a part of the following suite:

Intel MKL consists in one package for both IA-32 and Intel® 64 architectures and in online installer

Deprecation Notices

  • Deep Neural Network (DNN) component is deprecated and will be removed in the next Intel MKL release. We will continue to provide optimized functions for deep neural networks in Intel Math Kernel Library for Deep Neural Networks (Intel MKL-DNN).
  • Removed support for 32 bit applications on macOS*.
  • Xcode10 no longer supports compilation of applications for 32bit. Hence, support will also be removed for MKL 2019 on macOS* in Update 1.
  • If users require 32-bit support on macOS*, they should use MKL 2018 or early versions.

Known Issues

  • –nocompchk flag:
    • Problem Statement: Intel® MPI 2019 no longer handles the flag –nocompchk with mpiexec (it has been officially removed from support) and use of it will return an error. The old default behavior of running compchk.sh scripts when mpiexec (-nocompchk flag turned off this default behavior) is called has been changed. The current behavior is that you must explicitly tell mpiexec you want to run compchk.sh scripts with the –compchk flag.
    • Customer Impact: The file “examples/pblasc/make_lnx.inc” for Intel® MKL 2019 Gold release still has this flag on lines 195, 196, 220 and 221 and so will pblasc example will fail if run with Intel® MPI 2019.
    • Workaround: If using Intel® MPI 2019 to run this pblasc example, you must remove the –nocompchk flags in the “examples/pblasc/make_lnx.inc” file for it to succeed.
  • Input stream redirection on Windows:
    • Problem Statement: Intel® MPI 2019 doesn’t support input stream redirection on Windows.
    • Customer Impact: The examples from “examples/cdftf” for Intel® MKL 2019 Gold release use input stream redirection get data from data file and as the result hang on Windows in case of using Intel® MPI 2019.
    • Workaround:remove stream redirection form “examples/cdftf/makefile” (.exe < data \$*.dat), and update source code using OPEN/READ/CLOSE functions to load data from data file.
  • Schur test fail:
    • Problem Statement:Several Intel® MKL PARDISO (Pardiso) Schur tests failed in MKL2019 after performance improvement of solving step with many rhs.
    • Customer Impact: If customer uses Schur complement in new Pardiso branch they may get an error on the factorization step.
    • Workaround:Switch to an old Pardiso branch (classic one level factorization, use C: iparm[24-1]=0 Fortran: iparm(24)=0 parameters) or use MKL2018.3.
  • Custom fatal error handler usage limitation:
    • Custom fatal errors handler should stop the Intel® MKL 2019 Gold computation.
    • If a fatal error occurs when Intel® MKL cannot load a dynamic library or confronts an unsupported CPU type, and custom fatal errors handler doesn’t force Intel® MKL to stop the computation then Intel® MKL will fail with segmentation fault.
    • Call exit function (for C application) or use exceptions (for C++ applications) in the custom fatal errors handler implementation.
  • Technical Support

     

    If you did not register your Intel software product during installation, please do so now at the Intel® Software Development Products Registration Center. Registration entitles you to free technical support, product updates, and upgrades for the duration of the support term.

    For general information about Intel technical support, product updates, user forums, FAQs, tips and tricks and other support questions, please visit http://www.intel.com/software/products/support/.

    Note: If your distributor provides technical support for this product, please contact them rather than Intel.

    For technical information about Intel MKL, including FAQs, tips and tricks, and other support information, please visit the Intel MKL forum: http://software.intel.com/en-us/forums/intel-math-kernel-library/ and browse the Intel MKL knowledge base: http://software.intel.com/en-us/articles/intel-mkl-kb/all/.

    For paid support with Intel MKL, please purchase either Intel® Parallel Studio XE 2019 - https://software.intel.com/en-us/parallel-studio-xe or Intel® System Studio - https://software.intel.com/en-us/system-studio.".

2018

Installation GuideSystem RequirementsGetting StartedBug Fix Log

Update 3

Release Notes

What’s New in Intel® Math Kernel Library (Intel® MKL) version 2018 Update 3:

  • BLAS
    • Addressed ?TRMM NaN propagation issues on Advanced Vector Extensions 512 (Intel® AVX-512) for 32-bit architectures.
    • Improved performance on small sizes of multithreaded {S,D}SYRK and {C,Z}HERK for Intel® Advanced Vector Extensions 2 (Intel® AVX2)  and Intel® Advanced Vector Extensions 512 (Intel® AVX-512)
  • LAPACK:
    • Added ?POTRF and ?GEQRF optimizations for Intel® Advanced Vector Extensions 2 and  Intel® Advanced Vector Extensions 512 (Intel l®AVX2 and Intel l® AVX-512) instruction sets.
    • Improved the performance of ?GESVD for very small square matrices (N<6).
    • Improved performance of inverse routines ?TRTRI, ?GETRI and ?POTRI.
  • SparseBLAS:
    • Improved the performance of SPARSE_OPTIMIZE, SPARSE_SV and SPARSE_SYPR routines for Intel® TBB threading.
    • Added support of BSR format for the SPARSE_SYPR routine.
  • Library Engineering:
    • Added functionality to write the output of MKL_VERBOSE to a file specified by the user.
    • Enabled optimizations for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instruction set with support of Vector Neural Network Instructions via MKL_ENABLE_INSTRUCTIONS.

Known Limitations:

When the leading dimension of matrix A is not equal to the number of rows or columns, the MKL_?GEMM_COMPACT functions can return incorrect results when executed on a processor that does not support Intel ® AVX-2 or Intel ® AVX-512 instructions.

Update 2

What’s New in Intel® Math Kernel Library (Intel® MKL) version 2018 Update 2:

  • BLAS
    • Improved {S,D}GEMM performance for small sizes on Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instruction set.
    • Improved i{C,S,Z,D}A{MIN,MAX} performance on Intel® AVX-512 instruction set.
    • Improved CSROT performance on Intel® AVX-512 32-bit architectures.
    • Improved parallel and serial performance of BLAS Level 3 routines on Intel® Advanced Vector Extensions 2 (Intel® AVX2) and Intel AVX-512 32-bit architectures.
    • Improved GEMM_BATCH performance for groups with tall and skinny matrices.
  • DNN:
    • Improved initialization phase performance for convolutions.
  • Sparse BLAS
    • Introduced triple product functionality A*B*AT with sparse and dense matrices B.
    • Introduced sparse matrix product functionality A*B that can transpose both matrices and split their multiplication into phases similar to MKL_CSRMULTCSR.
    • Introduced note about deprecation and replacement of the Sparse BLAS API
    • Introduced Intel® Threaded Building Blocks (Intel® TBB) support for triangular solvers and converters routines.
    • Improved performance of matrix vector and matrix product for CSR and BSR formats.
    • Improved performance of matrix product for CSC format.
  • LAPACK
    • Improved parallel performance of ?SYTRF/?SYTRI/?SYTRI2.
    • Improved performance of numerous LAPACK functions for matrix sizes ≤ 30.
    • Improved parallel performance of (S|D)SYEVX Eigensolver.
  • ScaLAPACK:
    • Improved performance of P?(DSY|ZHE)EVD and P?(DSY|ZHE)EVX symmetrical Eigensolver drivers. Observed speed-up is up 4x depending on the matrix size and the grid configuration.
  • FFT
    • Improved 1D and 3D FFT performance for the processors supporting Intel® AVX512 and Intel® AVX2 Instruction sets.
  • Sparse Solvers:
    • Introduced an OOC version of Parallel Direct Sparse Solver for Clusters.
    • Introduced support for Schur complement (dense), partial solving, and customer reordering in Parallel Direct Sparse Solver for Clusters.
    • Introduced support for Sparse Schur complement in Intel MKL PARDISO functionality.
    • Removed restriction of simultaneous use of VBSR format/Schur complement/OOC algorithm in Intel MKL PARDISO for two-level factorization branch.
    • Improved performance of Intel MKL PARDISO for two-level factorization branch.
    • Returned main information via Parallel Direct Sparse Solver for Clusters interface similar to Intel MKL PARDISO interface – memory peaks on different phases after reordering, inertia, and number of pivots after factorization.
  • Vector Mathematics
    • Improved performance of processors supporting Intel® AVX2 Instruction sets for 64-bit implementations of vsErfc_HA/LA/EP, vdSqrt_HA, vsCbrt_HA, vsInvCbrt_HA/LA.
    • Improved performance of processors supporting Intel® AVX2 Instruction sets for 64-bit implementations of vdAtan_LA, vdTanpi_HA, vdTand_HA, vdTan_HA, vsAtan2pi_LA, vdTand_EP, vsTanh_HA, vdExpInt1_EP, vzDiv_HA, vzDiv_LA, vdTand_LA, vcArg_LA, vdTanh_LA, vdTanh_EP, vsLog10_LA, vsAtan2_LA, vsAtan2pi_EP.
  • Data Fitting and Vector Statistics
    • Improved performance of VS SS Summary Statistics Quantiles for Intel® Xeon® processors supporting Intel® AVX-512 (codename Skylake Server) and Intel® Xeon Phi™ processor 72** ( formerly Knights Landing) in OpenMP threading layer for dimensions  n > 10^5
Update 1

What’s New in Intel® Math Kernel Library (Intel® MKL) version 2018 Update 1

  • BLAS
    • Improved single precision and single precision complex Level 3 BLAS performance for Intel® Xeon Phi™ processors based on Intel® Advanced Vector Extensions 512 (Intel® AVX-512) with support for AVX512_4FMAPS instructions
    • Improved irregular-shaped SGEMM performance on Intel® Xeon Phi™ processor x200
    • Added stack unwind support to internal Intel64 assembly kernels on Windows OS
    • Improved MKL_DIRECT_CALL DGEMM performance on Intel® Advanced Vector Extensions 2 (AVX2) for  Intel and GNU C/C++ compilers
  • Sparse BLAS
    • Improved performance of Inspector-Executed mode of  SpMV for CSR format
    • Improved performance of SpMM routine for CSR format
    • Improved performance of Inspector-Executed mode of  SpMV for BSR format in Intel TBB threading layer
  • LAPACK
    • Improved performance of ?(OR|UN)GQR, ?GEQR and ?GEMQR routines in Intel® TBB threading layer
    • Introduced LAPACKE_set_nancheck routine for disabling/enabling nan checks in LAPACKE functions
  • ScaLAPACK:
    • Added optimizations (2-stage band reduction algorithm) for pdsyevr/pzheevr for JOBZ=’N|V’ and for RANGE=A’. New algorithm is enabled for N>=4000 and for appropriate process grids; otherwise traditional algorithm is used. Best possible speed-up is expected for larger matrices.
  • FFT
    • Improved performance for batched real-to-complex 3D for  Intel® Xeon® Processor supporting Intel® Advanced Vector Extensions 512 (Intel® AVX-512) (codename Skylake Server)
    • Improved performance with and without scaling factor across all domains.
  • Sparse Solvers
    • Improved Intel Pardiso performance for small matrices with (iparm(24)=10)
  • Vector Mathematics
    • The default behavior has changed for unmasked exception handling. By default, all floating-point exceptions are now masked before any internal MKL VM computation, whereas until now exceptions unmasked by the user applied to internal computations as well. As a new feature, the user can employ four newly added modes (VML_TRAP_INVALID, VML_TRAP_DIVBYZERO, VML_TRAP_OVERFLOW, and VML_TRAP_UNDERFLOW) to trap on unmasked exceptions raised during internal computation of vector math functions.
  • Data Fitting and Vector Statistics
    • Introduced TBB-threading layer in MKL Data Fitting and Vector Statistics components
  • Library Engineering
    • Added pkg-config files to simplify compilation of applications and libraries with MKL.

Known limitation:

Data Fitting and Vector Statistics: Work in oversubscribed mode is not supported in this release. Please do not set number of TBB threads more than logical cores number

Initial Release

What’s New in Intel® Math Kernel Library (Intel® MKL) version 2018

  • BLAS Features:
    • Introduced compact GEMM and TRSM functions (mkl_{s,d,c,z}gemm_compact and mkl_{s,d,c,z}trsm_compact) to work on groups of matrices in compact format and service functions to support the new format
    • Introduced optimized integer matrix-matrix multiplication routines GEMM_S8U8S32 and GEMM_S16S16S32 to work with quantized matrices for all architectures.
  • BLAS Optimizations: 
    • Optimized SGEMM and SGEMM packed for Intel® Xeon Phi™  processors based on Intel® Advanced Vector Extensions 512 (Intel® AVX-512) with support of AVX512_4FMAPS and AVX512_4VNNIW instructions
    • Optimized GEMM_S8U8S32 and GEMM_S16S16S32 for AVX2, AVX512 and Intel® Xeon Phi™ processors based on Intel® Advanced Vector Extensions 512 (Intel® AVX-512) with support of AVX512_4FMAPS and AVX512_4VNNIW instruction groups
  • Deep Neural Network:
    • Added support for non-square pooling kernels
    • Improved performance of large non-square kernels on Intel® Xeon Phi™ processors
    • Optimized conversions between plain (nchw, nhwc) and internal data layouts
  • LAPACK:
    • Added the following improvements and optimizations for small matrices (N<16):
      • Direct Call feature extended with Cholesky and QR factorizations providing significant performance boost
      • Introduced LU and Inverse routines without pivoting with significantly better performance: mkl_?getrfnp and mkl_?getrinp
      • Introduced Compact routines for much faster solving of multiple matrices packed together: mkl_?getr[f|i]np_compact, mkl_?potrf_compact and mkl_?geqrf_compact
    • Added ?gesvd, ?geqr/?gemqr, ?gelq/?gemlq  optimizations for tall-and-skinny/short-and-wide matrice
    • Added optimizations for ?pbtrs routine
    • Added optimizations for ?potrf routine for Intel® Threading Building Blocks layer      
    • Added optimizations for CS decomposition routines: ?dorcsd and ?orcsd2by1
    • Introduced factorization and solve routines based on Aasen's algorithm: ?sytrf_aa/?hetrf_aa, ?sytrs_aa/?hetrs_aa
    • Introduced new (faster)_rk routines for symmetric indefinite (or Hermitian indefinite) factorization with bounded Bunch-Kaufman (rook) pivoting algorithm
  • ScaLAPACK:
    • Added optimizations (2-stage band reduction) for p?syevr/p?heevr routines for JOBZ=’N’ (eigenvalues only) case
  • FFT:
    • Introduced Verbose support for FFT domain, which enables users to capture the FFT descriptor information for Intel MKL
    • Improved performance for 2D real-to-complex and complex-to-real for  Intel® Xeon® Processor supporting Intel® Advanced Vector Extensions 512 (Intel® AVX-512) (codename Skylake Server) and Intel® Xeon Phi™ processor 72** ( formerly Knights Landing)
    • Improved performance for 3D complex-to-complex for  Intel® Xeon® Processor supporting Intel® Advanced Vector Extensions 512 (Intel® AVX-512) (codename Skylake Server) and Intel® Xeon Phi™ processor 72** ( formerly Knights Landing)
  • Intel® Optimized High Performance Conjugate Gradient Benchmark:         
    • New version of benchmark with Intel® MKL API
  • Sparse BLAS:
    • Introduced Symmetric Gauss-Zeidel preconditioner
    • Introduced Symmetric Gauss-Zeidel preconditioner with ddot calculation of resulted and initial arrays
    • Sparse Matvec routine with ddot calculation of resulted and initial arrays
    • Sparse Syrk routine with both OpenMP and Intel® Threading Building Block support
    • Improved performance of Sparse MM and MV functionality for Intel® AVX-512 Instruction Set
  • Direct Sparse Solver for Cluster:
    • Add support of transpose solver
  • Vector Mathematics:
    • Added 24 new functions: v?Fmod, v?Remainder, v?Powr, v?Exp2; v?Exp10; v?Log2; v?Logb; v?Cospi; v?Sinpi; v?Tanpi; v?Acospi; v?Asinpi; v?Atanpi; v?Atan2pi; v?Cosd; v?Sind; v?Tand; v?CopySign; v?NextAfter; v?Fdim; v?Fmax; v?Fmin; v?MaxMag and v?MinMag including optimizations for processors based on Intel(R) Advanced Vector Extensions 512 (Intel® AVX-512)
  • Data Fitting:
    • Cubic spline-based interpolation in ILP64 interface was optimized up to 8x times on Intel® Xeon® Processor supporting Intel® Advanced Vector Extensions 512 (Intel® AVX-512) (codename Skylake Server) and 2.5x on Intel® Xeon Phi™ processor 72** (formerly Knights Landing) 
  • Documentation:
    • Starting with this version of Intel® MKL, most of the documentation for Parallel Studio XE is only available online at https://software.intel.com/en-us/articles/intel-math-kernel-library-documentation. You can also download it from the Intel Registration Center > Product List > Intel® Parallel Studio XE Documentation
  • Intel continually evaluates the markets for our products in order to provide the best possible solutions to our customer’s challenges. As part of this on-going evaluation process Intel has decided to not offer Intel® Xeon Phi™ 7200 Coprocessor (codenamed Knights Landing Coprocessor) products to the market.
    • Given the rapid adoption of Intel® Xeon Phi™ 7200 processors, Intel has decided to not deploy the Knights Landing Coprocessor to the general market.
    • Intel® Xeon Phi™ Processors remain a key element of our solution portfolio for providing customers the most compelling and competitive solutions possible.
  • Support for the Intel® Xeon Phi™ x100 product family coprocessor (formerly code name Knights Corner) is removed in this release. The Intel® Xeon Phi™ x100 product family coprocessor (formerly code name Knights Corner) was officially announced end of life in January 2017.  As part of the end of life process, the support for this family will only be available in the Intel® Parallel Studio XE 2017 version.  Intel® Parallel Studio XE 2017 will be supported for a period of 3 years ending in January 2020 for the Intel® Xeon Phi™ x100 product family.  Support will be provided for those customers with active support.

Product Content

Intel MKL can be installed as a part of the following suite:

Intel MKL consists in one package for both IA-32 and Intel® 64 architectures and in online installer

Known Issues

  • Convolution primitives for forward pass may return incorrect results or crashes for the case where input spatial dimensions smaller than kernel spatial dimensions for Intel® Advanced Vector Extensions 512 (Intel® AVX-512)
  • Intel® MKL FFT – complex-to-complex in-place batched 1D FFT with transposed output returns incorrect output
  • Intel® ScaLAPACK may fail with OpenMPI* 1.6.1 and later releases due to known OpenMPI* issue: https://github.com/open-mpi/ompi/issues/3937. As a workaround, please avoid using OpenMPI
  • Intel® VML functions may raise spurious FP exceptions even if the (default) ML_ERRMODE_EXCEPT is not set. Recommendation: do not unmask FP exceptions before calling VML functions.
  • When an application uses Vector Math functions with the single dynamic library (SDL) interface combined with TBB threading layer, the application may generate runtime error “Intel MKL FATAL ERROR: Error on loading function mkl_vml_serv_threader_c_1i_2o.”

2017

Installation GuideSystem RequirementsGetting StartedBug Fix Log

Update 4

What's New in Intel MKL 2017 Update 4

  • BLAS:
    • Addressed an early release buffer issue in *GEMV threaded routines
    • Improved Intel® Threading Building Blocks *GEMM performance for small m, n and large k cases
    • Fixed irregular division by zero and invalid floating point exceptions in {C/Z}TRSM for Intel® Xeon Phi™ processor x200 (aka KNL) and Intel® Xeon® Processor supporting Intel® Advanced Vector Extensions 512 (Intel® AVX-512) code path
    • Improved {s/d} GEMV threaded performance on Intel64 architecture
    • Addressed incorrect SSYRK calculation on Intel® Xeon Phi™ processor x200 with Intel® TBB threading  occurring if the value of K is very large
    • Addressed a GEMM multithreading issue, which may cause segfaults for large matrices (M, N >= ~30,000, K >= ~5000) on for Intel® Xeon Phi™ processor x200 (aka KNL)
  • Deep Neural Networks:
    • Added support for non-square pooling kernels
  • Sparse BLAS
    • Improved SpMV and SpMM performance for the processor supporting Intel® AVX512 Instruction set Improved SpMV performance for the processor supporting Intel® AVX2 Instruction set
    • Added Intel® TBB support for SparseSyrk and SpMM routines
  • Intel MKL Pardiso
    • Significantly improved factorization and solving steps for “small” matrices
    • Introduced low rank approach suitable for solving set of systems with small changes in elements
  • Parallel Direct Sparse Solver for Cluster:
    • Added Iterative support
    • Improved performance for number of processes not power of 2
  • LAPACK:
    • Improved performance of ?(OR|UN)GQR, ?GEQR and ?GEMQR routines in Intel(R) TBB threading layer.
    • Introduced LAPACKE_set_nancheck routine for disabling/enabling nan             checks in LAPACKE functions.
  • FFT:
    • Improved 2D and 3D FFT performance for the processors supporting Intel® AVX512 and Intel® AVX2 Instruction sets.
    • Improved FFT performance of w/ and w/o scaling factor across all domains.
    • Introduced MKL_VERBOSE mode support for FFT domain.
Update 3

What's New in Intel MKL 2017 Update 3

  • BLAS:
    • Optimized SGEMM for Intel® Xeon Phi™ processor x*** (codename Knights Mill)
    • Improved performance for ?GEMM for medium problem sizes on Intel® Xeon® Processor supporting Intel® Advanced Vector Extensions 512 (Intel® AVX-512) (codename Skylake Server)
    • Improved performance for SGEMM/DGEMM for small problem sizes on Intel® Xeon® Processor supporting Intel® Advanced Vector Extensions 512 (Intel® AVX-512) (codename Skylake Server)
    • Improved performance for ?GEMM_BATCH on all architectures
    • Improved performance for SSYMV/DSYMV on Intel® Advanced Vector Extensions 2 (Intel® AVX2) and later architectures
    • Improved performance for DGEMM Automatic Offload (AO) for square sizes (3000
    • Improved performance for general BLAS functions on the 32-bit Intel® Advanced Vector Extensions 512 (Intel® AVX512) architecture
    • Fixed ?AXPBY to propagate NaNs in the y vector when beta = 0 on 64-bit Intel® Advanced Vector Extensions 2 (Intel® AVX2) and later architectures
  • FFT:
    • Improved performance of 3D FFT complex-to-real and real-to-complex problems on Intel® Xeon Phi™ processor 72** (formerly Knights Landing) 
    • Improved performance of 2D FFT complex-to-complex problems with scale on Intel® Xeon Phi™ processor 72** (formerly Knights Landing)
  • High Performance Conjugate Gradients (HPCG):
    • Add support of Intel® Xeon® Processor supporting Intel® Advanced Vector Extensions 512 (Intel® AVX-512) (codename Skylake Server)
  • Deep Neural Networks:
    • Added initial convolution and inner product optimizations for the next generation of Intel Xeon Phi processor (code name Knights Mill)
    • Improved parallel performance of convolutions on Intel Xeon Phi processor (code name Knights Landing)
    • Average pooling has an option to include padding into mean values computation
  • LAPACK:
    • Optimized ?GELQ and ?GEMLQ performance for short-and-wide matrices
    • Optimized performance of ?ORCSD2BY1 and ? DORCSD routines
    • Fixed LU performance degradation for medium sizes on 6 threads
  • Vector Statistics:
    • Fixed failure of VSL RNG MT19937 on big vector lengths on Intel® Xeon Phi™ Coprocessor x100 Product Family.
    • Improved performance of Outlier Detection (BACON) algorithm for single and double precisions for processor supporting Intel® AVX2 and intel® AVX512 Instruction sets 
Update 2

What's New in Intel MKL 2017 Update 2

  • Library Engineering:
    • Intel® AVX-512 code is dispatched by default on Intel® Xeon processors 
  • BLAS:
    • Improved performance of dgemv non transpose when number of threads are large (typically on Intel® Xeon Phi™ processor x200 (formerly Knights Landing)). For example: factor 2 speedup when M=K=10000 with 68 threads on Intel® Xeon Phi™ processor x200
    • Improved performance for dgemm, TN and NN cases, with very small N on Intel® Xeon Phi™ processor x200 and  6th Generation Intel® Core™ processor ( as known as Skylake)
    • Introduced MKL_NUM_STRIPES environment variable and accompanying Intel MKL support functions to control the 2D partitioning of multithreaded *GEMM on all Intel architectures except from Intel® Xeon Phi™ Coprocessor x100 Product Family. Please see the related section in Intel MKL Developer Guide for details.
    • Improved the {s,d}gemm_compute performance on Intel64 architectures supporting Intel® AVX2 instruction set.
    • Improved ?gemm_batch performance when N==1. 
  • Sparse BLAS
    • Improved performance of BCSMV functionality with 3-10, 14 and 18 problem sizes for processor supporting Intel® AVX2 and intel® AVX512 Instruction sets
    • Improved performance of CSRMV functionality  for processor supporting Intel® AVX2 and intel® AVX512 Instruction sets
    • Added Intel® Threading Building Blocks (Intel® TBB) threading support for CSRMV functionality with symmetric matrices
  • Intel MKL Pardiso
    • Added support of Intel TBB threading support for Intel MKL Pardiso at the solving step
  • Deep Neural Networks:
    • Improved performance on Intel Xeon processors with Intel® AVX2 and Intel® AVX512 instruction set support
    • Improved performance on the second generation of Intel® Xeon Phi™ processor x200
    • Introduced support for rectangular convolution kernels
    • Significantly improved reference convolution code performance
    • Added unsymmetric  padding support in convolution and pooling
    • Introduced extended Batch Normalization API that allows access to mean, variance, scale and shift parameters
  • LAPACK:
    • Added ?GEQR, ?GEMQR and ?GETSLS functions with performance optimized for tall-and-skinny matrices.
    • Improved LAPACK performance for very small sizes (N<16) in LP64 layer by reducing internal LP64/ILP64 conversion overhead.
    • Improved ?[SY|HE]EVD scalability up to 32 and beyond threads on Intel® Xeon and Intel® Xeon Phi™ processor x200
    • Significantly improved ?LANGE (‘Frobenius’ norm) performance
  • ScaLAPACK:
    • Added MKL_PROGRESS() support in P?GETRF
    • Improved P?TRSM/P?SYRK performance
    • Optimized ?GE(SD|RV|BS|BR)2D routines in BLACS
    • Fixed failure in P?GEMM (‘N’, ‘N’ case)
  • Vector Mathematics:
    • Added Intel TBB threading support for all mathematical functions.
  • Vector Statistics:
    • Improved C interfaces of vsl*SSEdit*() functions

Known Limitations:   

  • For Intel® Xeon Phi™ processor x200 leverage boot mode without Hyper Threading, MKL have an oversubscription of threads for versions prior to MPSS 4.3.2 due to COI occupying 4 cores. This affects the performance of MKL substantially. As an work around, the costumer can add ‘norespect’ to the MIC_KMP_AFFINITY environment variable.
  • ?GETRF functionality can give incorrect results for some matrices of 5x5 size when MKL_DIRECT_CALL is enabled. The patch fixing the issue is posted on MKL Forum.
  • Recently added TS QR functionality (?GEQR and ?GEMQR) may demonstrate very slow performance when the number of threads is less than 30.
  • On SKX DGEMM does not scale C by beta when transa == N, transb == N, K==0 and N==2. A workaround is to set transa == T or transb == T since with K==0 the transpose is not relevant
Update 1

What’s New in Intel MKL 2017 Update 1

  • Added support of Intel® Xeon Phi™ processor x200 leverage boot mode on Windows* OS.
  • BLAS :
    • The Intel Optimized MP LINPACK Benchmark supports various MPI implementations in addition to Intel MPI,  and the contents of the mp_linpack directory have changed.
    • Improved single thread SGEMM/DGEMM performance on Intel® Advanced Vector Extensions 2 (Intel® AVX2), Intel® Advanced Vector Extensions 512 (Intel® AVX-512), and Intel® Xeon® for Intel® Many Integrated Core Architecture.
  • Deep Neural Networks (DNN) primitives :
    • Introduced additional optimizations for Intel® Xeon® processor E3-xxxx V5 ( formerly Skylake).
    • Added support of non-square cores of convolution 
  • Sparse BLAS :
    • Improved Sparse BLAS matrix vector functionality in block compressed sparse row (BSR) format for block size equal to 6,10,14, or 18 on Intel AVX2.
    • Improved Inspector-executor Sparse BLAS matrix-vector and matrix-matrix functionality for symmetric matrices.
  • LAPACK :
    • Improved performance of ?GETRF, ?GETRS and ?GETRI for very small matrices via MKL_DIRECT_CALL. 
    • Improved performance of ?ORGQR and SVD functionality for tall-and-skinny matrices.
    • Parallelized ?ORGQR in Intel® Threading Building Blocks (Intel® TBB) threading layer.
  • Vector Math :
    • Introduced the exponential integral function E1 with three accuracy levels HA, LA, and EP, for single precision and double precision real data types.
  • ScaLAPACK :
    • Improved performance of PZGETRF.
  • Known Limitations for MKL 2017 Update 1 :
    • Intel MKL (in Intel® Parallel Studio XE) integration with Microsoft Visual Studio in IA-32 architecture environments is limited. This issue does not affect the Intel® 64 architecture target environment. Intel MKL (in Intel® System Studio) integration with Microsoft Visual Studio is limited in both IA-32 and Intel 64 architecture environments.
    • 1D complex-to-complex FFT may return incorrect results on systems with Intel AVX-512 support if the number of threads is different at DFT descriptor commit time and DFT execution. 
    • The AVX512 code path works when the MKL_ENABLE_INSTRUCTIONS=AVX512 environment variable is set, but mkl_enable_instructions(MKL_ENABLE_AVX512) function call does not.
    • Building the Intel Optimized MP LINPACK Benchmark for a customized MPI implementation on Windows* is not supported for Microsoft Visual Studio 2015 and later.

Workaround: Use an earlier version of Microsoft Visual Studio.

Issue Description: If the user tries to use MSVS 2015 with our provided build.bat script to build their own xhpl.exe executable, they will see a number of unresolved external symbol errors like:

libhpl_intel64.lib(HPL_pdmatgen.obj) : error LNK2001: unresolved external symbol __iob_func

An older version of MSVS was used to build the libhpl_intel64.lib library we provide to link against when building the MP LINPACK benchmark for a customized MPI implementation.  It appears that these functions are now inlined in MSVS2015.

 

Initial Release

What's New in Intel MKL 2017

  • Introduced optimizations for the Intel® Xeon Phi™ processor x200 (formerly Knights Landing ) self-boot platform for Windows* OS
  • Enabled Automatic Offload (AO) and Compiler Assisted Offload (CAO) modes for the second generation of Intel Xeon Phi coprocessor on Linux* OS
  • Introduced Deep Neural Networks (DNN) primitives including convolution, normalization, activation, and pooling functions intended to accelerate convolutional neural networks (CNNs) and deep neural networks on Intel® Architecture.
    • Optimized for Intel® Xeon® processor E5-xxxx v3 (formerly Haswell), Intel Xeon processor E5-xxxx v4 (formerly Broadwell), and Intel Xeon Phi processor x200 self-boot platform.
    • Introduced inner product primitive to support fully connected layers.
    • Introduced batch normalization, sum, split, and concat primitives to provide full support for GoogLeNet and ResidualNet topologies.
  • BLAS:
    • Introduced new packed matrix multiplication interfaces (?gemm_alloc, ?gemm_pack ,?gemm_compute, ?gemm_free) for single and double precisions.
    • Improved performance over standard S/DGEMM on Intel Xeon processor E5-xxxx v3 and later processors.
    • The Intel Optimized MP LINPACK Benchmark pre-built binaries using Intel MPI were moved to the mp_linpack root folder. Support for multiple MPI implementations was also added. The benchmark source codes in the  mp_linpack directory were removed except for HPL_main.c, which can be used to create an Intel Optimized MP LINPACK benchmark binary for a specific MPI implementation. 
  • Sparse BLAS:
    • Improved performance of parallel BSRMV functionality for processor supporting Intel® Advanced Vector Extensions 2 (Intel® AVX2) instruction set.
    • Improved performance of sparse matrix functionality on the Intel Xeon Phi processor x200.
  • Intel MKL PARDISO:
    • Improved performance of parallel solving step for matrices with fewer than 300000 elements.
    • Added support for mkl_progress in Parallel Direct Sparse Solver for Clusters.
    • Added fully distributed reordering step to Parallel Direct Sparse Solver for Clusters.
  • Fourier Transforms:
    • Improved performance of batched 1D FFT with large batch size on processor supporting Intel® Advanced Vector Extensions (Intel® AVX), Intel AVX2, Intel® Advanced Vector Extensions 512 (Intel® AVX512) and Intel AVX-512 Intel® Many Integrated Core Instructions (Intel® MIC Instructions) instruction sets
    • Improved performance for small size batched 2D FFT on the Intel Xeon Phi processor x200 self-boot platform, Intel Xeon processor E5-xxxx v3, and Intel Xeon processor E5-xxxx v4.
    • Improved performance for 3D FFT on the Intel Xeon Phi processor x200 self-boot platform. 
  • LAPACK
    • Included the latest LAPACK v3.6 enhancements. New features introduced are:
      • SVD by Jacobi ([CZ]GESVJ) and preconditioned Jacobi ([CZ]GEJSV)
      • SVD via EVD allowing computation of a subset of singular values and vectors (?GESVDX)
      • In BLAS level 3, generalized Schur (?GGES3), generalized EVD (?GGEV3), generalized SVD (?GGSVD3), and reduction to generalized upper Hessenberg form (?GGHD3)
      • Multiplication of a general matrix by a unitary or orthogonal matrix that possesses a 2x2 block structure ([DS]ORM22/[CZ]UNM22)
    • Improved performance for large size QR(?GEQRF) on processors supporting theIntel AVX2 instruction set.
    • Improved LU factorization, solve, and inverse (?GETR?) performance for very small sizes (<16).
    • Improved General Eigensolver (?GEEV and ?GEEVD) performance for the case when eigenvectors are needed.
    • Improved?GETRF, ?POTRF and ?GEQRF, linear solver (?GETRS) and SMP LINPACK performance on the Intel Xeon Phi processor x200 self-boot platform.
  • ScaLAPACK
    • Improved performance for hybrid (MPI + OpenMP*) mode of ScaLAPACK and PBLAS.
    • Improved performance of P?GEMM and P?TRSM resulted in better scalability of Qbox First-Principles Molecular Dynamics code.
  • Data Fitting:
    • Introduced two new storage formats for interpolation results (DF_MATRIX_STORAGE_SITES_FUNCS_DERS and DF_MATRIX_STORAGE_SITES_DERS_FUNCS).
    • Added Hyman monotonic cubic spline.
    • Improved performance of Data Fititng functionality on the Intel Xeon Phi processor x200.
    • Modified callback APIs to allow users to pass information about integration limits.
  • Vector Mathematics:
    • Introduced optimizations for the Intel Xeon Phi processor x200.
    • Improved performance for Intel Xeon processor E5-xxxx v3 and Intel Xeon processor E5-xxxx v4.
  • Vector Statistics:
    • Introduced additional optimization of SkipAhead method for MT19937 and SFMT19937.
    • Improved performance of Vector Statistic functionality including Random Number Generators and Summary Statistic on the Intel Xeon Phi processor x200.

Deprecation Notices:

  • Removed pre-compiled BLACS library for MPICH v1; MPICH users can still build the BLACS library with MPICH support via Intel MKL MPI wrappers.
  • The SP2DP interface library is removed.
  • The PGI* compiler on IA32 is no longer supported.
  • Installation on IA-32 architecture hosts is no longer supported, and the Intel MKL packages for Intel® 64 architecture hosts include both 64-bit and 32-bit Intel IPP libraries.
  • Red Hat Enterprise Linux* 5.0 support is dropped

Known Limitations:

  • cblas_?gemm_alloc is not supported on Windows* OS for the IA-32 architectures with single dynamic library linking.
  • Intel MKL (in Intel Parallel Studio XE) Integration with Microsoft Visual Studio in IA-32 environment is limited. This issue does not affect the Intel® 64 target environment. Intel MKL (in Intel® System Studio) integration with Microsoft Visual Studio is limited in both IA-32 and Intel 64 environments.
  • 1D complex-to-complex FFT may return incorrect results on systems with Intel Advanced Vector Instructions 512 (Intel AVX 512) support if the number of threads is different at DFT descriptor commit time and DFT execution.
  • {s,d}gemm_compute may leak memory if only one of the matrices are packed in sequential Intel MKL for Intel AVX2 and above.Workaround: Use multi-threaded Intel MKL and set MKL_NUM_THREADS to 1 instead of using sequential Intel MKL.
  • nodeperf.c, which comes with the MP LINPACK Benchmark for Clusters package, may fail to run on Windows*.

    Workaround : Use the Intel Optimized MP LINPACK Benchmark for benchmarking individual nodes on a Windows* cluster. Alternatively, uncomment line 551 and comment out line 552 in nodeperf.c, to use malloc instead of mkl_malloc.

    Note: nodeperf.c will be removed in MKL 2017 Update 1. We recommend using the MP LINPACK benchmark directly for measuring cluster performance.

Product Contents

Now Intel MKL consists in one package for both IA-32 and Intel® 64 architectures and in online installer

Technical Support

If you did not register your Intel software product during installation, please do so now at the Intel® Software Development Products Registration Center. Registration entitles you to free technical support, product updates, and upgrades for the duration of the support term.

For general information about Intel technical support, product updates, user forums, FAQs, tips and tricks and other support questions, please visit http://www.intel.com/software/products/support/.

Note: If your distributor provides technical support for this product, please contact them rather than Intel.

For technical information about Intel MKL, including FAQs, tips and tricks, and other support information, please visit the Intel MKL forum: http://software.intel.com/en-us/forums/intel-math-kernel-library/ and browse the Intel MKL knowledge base: http://software.intel.com/en-us/articles/intel-mkl-kb/all/.

Attributions

The original versions of the BLAS from which that part of Intel MKL was derived can be obtained from http://www.netlib.org/blas/index.html.

The original versions of LAPACK from which that part of Intel MKL was derived can be obtained from http://www.netlib.org/lapack/index.html. The authors of LAPACK are E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. Our FORTRAN 90/95 interfaces to LAPACK are similar to those in the LAPACK95 package at http://www.netlib.org/lapack95/index.html. All interfaces are provided for pure procedures.

The original versions of ScaLAPACK from which that part of Intel MKL was derived can be obtained from http://www.netlib.org/scalapack/index.html. The authors of ScaLAPACK are L. S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley.

The Intel MKL Extended Eigensolver functionality is based on the Feast Eigenvalue Solver 2.0 http://www.ecs.umass.edu/~polizzi/feast/

PARDISO (PARallel DIrect SOlver)* in Intel MKL was originally developed by the Department of Computer Science at the University of Basel http://www.unibas.ch . It can be obtained at http://www.pardiso-project.org.

Some FFT functions in this release of Intel MKL have been generated by the SPIRAL software generation system (http://www.spiral.net/) under license from Carnegie Mellon University. The Authors of SPIRAL are Markus Puschel, Jose Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo.

For more complete information about compiler optimizations, see our Optimization Notice.