Intel® oneAPI Math Kernel Library (oneMKL) Release Notes

ID 765830
Updated 12/21/2023
Version 2024.0
Public

author-image

By

Where to Find the Release

Intel® oneAPI Math Kernel Library

2024.0

System Requirements  Bug Fix Log

What’s new? 

  • Integrates Vector Math optimizations into Random Number Generators for high performance computer simulations, statistical sampling, and other areas on x86 CPUs and Intel GPUs. 
  • Supports Vector Math for FP16 datatype on Intel® GPUs 
  • Delivers high-performance benchmarks HPL and HPL-AI optimized for Intel® Xeon® CPU Max Series and Intel® Data Center GPU Max Series 

Directory Layout

Directory layout is improved across all products to streamline installation and setup. 

The Unified Directory Layout is implemented in 2024.0. If you have multiple toolkit versions installed, the Unified layout ensures that your development environment contains the correct component versions for each installed version of the toolkit. 

The directory layout used before 2024.0, the Component Directory Layout, is still supported on new and existing installations. 

For detailed information about the Unified layout, including how to initialize the environment and advantages with the Unified layout, refer to Use the setvars and oneapi-vars Scripts with Linux and Use the setvars and oneapi-vars Scripts with Windows

New Features and Optimizations

  • BLAS

    • Features 
      • Scalar parameters (alpha, beta) to BLAS USM APIs may now be passed by pointer or by value.  
      • Added complex_3m acceleration for GEMM (including batched variants) on Intel® Data Center GPU Max Series. 
      • Added strided versions of gemm3m_batch C and Fortran APIs, including OpenMP* offload support. 
      • Added {cblas_}gemm_f16f16f32 C APIs. These are the half-precision (MKL_F16) analogues of the previously introduced gemm_bf16bf16f32 APIs for bfloat16 (MKL_BF16).
    • Optimizations
      • Enhanced HGEMM performance for small matrices on CPUs. 
      • Improved general performance of GEMV and several BLAS level-1 routines on Intel® Data Center GPU Max Series. 
  • Sparse BLAS

    • Features
      • Inspector Executor Sparse BLAS C APIs now include mkl_sparse_<xyz>_64() APIs using MKL_INT64 for all integers in lp64 and ilp64 modes. 
      • Added std::complex<float> and std::complex<double> support for all existing sparse BLAS SYCL* APIs.
      • Added support for oneapi::mkl::transpose::conjtrans operation to sparse::gemv and sparse::omatcopy SYCL* APIs. 
      • Added support for oneapi::mkl::transpose::{trans, conjtrans} operation on the sparse matrix in sparse::gemm SYCL* API. 
    • Optimizations
      • Improved performance for sparse::gemv/trmv with matrices with high variability in the number of non-zeros per row.
      • Improved sparse::matmat performance for key workloads. 
  • LAPACK 

    • Features
      • Introduced SYCL* APIs to compute LU factorization without pivotization (lapack::getrfnp); added support for OpenMP* offloading in C and Fortran (mkl_?getrfnp). 
      • Introduced SYCL* APIs to compute batched matrix inverse of a group of general matrices (lapack::geinv_batch). 
      • Added argument checking for lapack::gerqf, lapack::hetrf, lapack::orgbr, lapack::orgtr, lapack::ormrq, lapack::ormtr, lapack::sytrf, lapack::ungbr, lapack::ungtr, lapack::unmrq, lapack::unmtr, and their scratchpad size functions. 
    • Optimizations
      • Improved performance of QR factorization (lapack:: geqrf) on Intel® Data Center GPU Max Series for SYCL* USM APIs as well as for C and Fortran OpenMP* offloading. 
      • Improved performance of orthogonal/unitary matrix multiplication (lapack::ormqr/ lapack::unmqr) on Intel® GPUs for SYCL* APIs and C and Fortran OpenMP* offloading. 
      • Improved performance of batched strided LU inverse (lapack::getri_batch) on Intel® GPUs for SYCL* APIs, especially for a smaller number of larger matrices. 
  • DFT 

    • Features 
      • Enabled FFTs larger than 4 GiB (up to 64GiB of data) on Intel® Data Center GPU Max Series. 
    • Optimizations 
      • Improved double precision FFT performance on Intel® Data Center GPU Max Series for any FFT with at least one dimension divisible by a prime number in the range [11,61]. 
      • Improved 1D complex FFT performance on Intel® Data Center GPU Max Series for power of two sizes in the range [2²¹, 2²⁵].  
  • Vector Math

    • Features:
      • Added support for OpenMP* 5.1 offloading in C. 
      • Added SYCL*–OpenMP* interoperability support for OpenMP* offloading. 
      • Status and Mode were aligned in the Classic and Offloading versions of VM. 
      • J0/J1 Bessel functions of 1st kind orders 0 and 1 for real arguments added for GPUs. 
      • Y0/Y1 Bessel functions of 2nd kind orders 0 and 1 for real arguments added for GPUs. 
      • I0/I1 Bessel functions of 1st kind orders 0 and 1 for real arguments added for GPUs. 
    • Optimizations: 
      • HA versions of cexp, cln, csqrt were added in native precision for GPUs. 
      • Native FP16 cos/exp/exp10/ln/log10/log2/sin were added for GPUs. 
      • The FP16 host API performance on GPU was improved by up to 30%. 
  • Vector Statistics 

    • Features
      • Enabled Verbose mode support for RNG SYCL* Host API. 
      • Optimizations 
      • Optimized mrg32k3a and philox4x32x10 RNG SYCL* Device API performance on Intel® Data Center GPU Max Series. 
  • Sparse Solvers 

    • Features 
      • Improved accuracy of generalized eigenvalues calculated using mkl_sparse_?_gv for symmetric matrix types. 

 

Library Engineering

  • The following domain specific SYCL* libraries are now made available in addition to the combined mkl_sycl library: 
    • libmkl_sycl_blas.so 
    • libmkl_sycl_lapack.so (depends on libmkl_sycl_blas.so) 
    • libmkl_sycl_sparse.so (depends on libmkl_sycl_blas.so) 
    • libmkl_sycl_vm.so 
    • libmkl_sycl_rng.so 
    • libmkl_sycl_stats.so 
    • libmkl_sycl_data_fitting.so 

      MKLConfig.cmake also provides corresponding targets to link domain specific SYCL* libraries via MKL::MKL_SYCL::<domain> 
  • Dropped all SSSE3 and AVX optimizations 
  • With the removal of classic compiler support, all references to this compiler have been replaced with icx. 
  • MKLConfig.cmake now rejects operation when the oneMKL version found in the environment variable MKLROOT differs from the version found by CMake. 
  • Removed find_package_handle_standard_args() in MKLConfig.cmake, as it incorrectly set MKL_FOUND. 
  • MKLConfig.cmake: Removed oneMKL path from implicit include directories such that oneMKL include directory path is always explicitly defined, independent of whether it is present in the user’s CPATH environment variable or not. This resolves an issue when cmake is called from different environments. Please note, changes are for C and C++, not for Fortran, according to CMake 3.14+ doc implicit directory variable is not used for Fortran. 
  • Removed __cdecl, its related macros, and *_win.h files. 

 

Fixed issues: 

  • oneMKL DFT SYCL* APIs may fail to compute correct results for 2D and 3D real FFT when using a user-allocated SYCL* buffer workspace and the OpenCL*  runtime. 
  • Improved BLAS support for host USM pointers. 
  • Fixed SYMM/TRSM accuracy issues.  
  • Fixed SGEMM/DGEMM/SYRK failures and memory leaks. 
  • Fixed Fortran OpenMP* issues when complex-precision division is used on Windows on Intel® Iris® Xe Max and Intel® Arc™ A-Series GPUs with static linking. 

Known Issues and Limitations

  • The getri_batch_usm and getri_oop_batch_usm LAPACK examples that are located at ${MKLROOT}/examples/dpcpp/lapack may fail on Intel® Iris® Xe MAX Graphics on Windows* in debug_mode. 
  • On Intel® Iris® Xe MAX Graphics, {c,s}getrfnp_batch functions may hang or have a segmentation fault. As a workaround, use the {c,s}getrfnp_batch_strided functions instead. 
  • OpenMP* offload of Fortran LAPACK functions cpotrf, cpotri, cpotrs, ctrtri, spotrf, spotri, spotrs, strtri to GPU under Windows* in static linking mode may crash. As a workaround, use dynamic linking mode. 
  • oneMKL DFT SYCL* APIs using SYCL* buffer for data input do not support SYCL* sub-buffer inputs for a range of large power of two sizes [2²¹,2²⁶] 1D complex FFT. 
  • Double precision FFT of size that are multiple of very large primes may see incorrect results on CPU. 
  • 2D and 3D FFT might hang on Intel® Data Center GPU Max Series when GPU debugging is enabled. As a workaround, set the following environment variables NEOReadDebugKeys=1 EnableRecoverablePageFaults=0 or disable GPU debugging by writing 0 in the files /sys/class/drm/card*/prelim_enable_eu_debug 
  • Mrg32k3a random number engine may fail on Intel® Arc™ A-Series Graphics GPU in case of Windows* OS and /Od enabled option. 
  • Random number generator Device APIs with enabled Vector Math Device APIs underneath do not work on Intel ® GPUs without native double precision support due to Vector Math restrictions. 
  • Some Sparse BLAS SYCL* examples (sparse_gemm_col_major/sparse_gemm_row_major) are known to fail with oneMKL 2024.0 on Windows* when run in Debug mode.  Please use Release mode linking to use this particular functionality. 
  • Use the prebuilt oneMKL 2024.0 HPCG binaries with the oneAPI 2024.0 compiler runtime for the best performance. Compiling HPCG from sources with the current icpx compiler may result in slightly lower performance than when compiling it with compilers from earlier oneAPI releases. 
  • oneapi::mkl::sparse::trsv() sycl::buffer APIs may crash with a segmentation fault when any of the CSR matrix data, x, or y vectors, are sub-buffer(s) of a sycl::buffer 
  • Asynchronous execution of mkl_sparse_optimize() for mkl_sparse_x_mv() using OpenMP* offloading in C can sporadically hang on Intel® Data Center GPU Max Series. As a workaround, use synchronous offloading for mkl_sparse_optimize(). 
  • Strided and group batched non-pivoting LU (getrfnp_batch) for complex precisions provides incorrect values on Intel® Data Center GPU Max Series with certain drivers. 
  • oneMKL SYCL DLL could leak memory after unloading on Windows. The problem can be avoided by adding mkl_free_buffer before unloading the DLL.  
  • The Intel® oneMKL NuGet packages intelmkl.static.cluster.win-x64 and intelmkl.devel.cluster.win-x64 cannot be added to a .Net Standard 2.0 or higher project because a dependent package (intelmpi.devel.win-x64) is not compatible with the 2.0 standard. An updated intelmpi.devel.win-x64 package will be published to address the compatibility with the 2.0 standard.

Known Issues and Limitations for Intel® GPU Driver Version 20231219 

The limitations in this section do not apply to the execution of Intel® oneMKL on CPUs. 

  • The LAPACK batch strided least squares solver (oneapi::mkl::lapack::gels_batch, ?gels_batch_strided with OpenMP* offload) may return incorrect results on all Intel® GPUs. As a workaround, the previous GPU driver version 20231031 can be used. A list of supported GPUs of that version can be found in the driver 20231031 release notes. 
  • oneMKL double precision FFT may fail or crash on the integrated GPUs of Intel® Core Ultra processors for driver version 20231219. The issue will be fixed in future releases of the driver.
  • oneMKL RNG Sobol Host API and Stats routines may throw an exception in case of execution on any Intel® GPU device.  As a workaround, the previous GPU driver version 20231031 can be used. A list of supported GPUs of that version can be found in the driver 20231031 release notes. 

Deprecation/Removal 

  • Graph domain APIs have been removed in the oneMKL 2024.0 release. 
  • Intel® oneAPI Math Kernel Library (oneMKL) for macOS deprecated in release 2023.0 and will now be discontinued as of Intel ® oneMKL release version 2024.0 and later releases. 

 

Previous oneAPI Releases

2023

Release Notes, System Requirements and Bug Fix Log

2022

Release Notes, System Requirements and Bug Fix Log

2021

Release Notes, System Requirements and Bug Fix Log

2017-2020

Release Notes, System Requirements and Bug Fix Log

Notices and Disclaimers

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.