Developer Reference

  • 0.9
  • 09/09/2020
  • Public Content

OpenMP* Offload for Intel® oneAPI Math Kernel Library

You can use
Intel® oneAPI Math Kernel Library (oneMKL)
and OpenMP* offload to run standard
computations on a GPU. You can find the list of
features that support OpenMP offload in the
header file which includes:
  • All Level 1, 2, and 3 BLAS functions through the CBLAS and BLAS interfaces, supporting both synchronous and asynchronous execution
  • BLAS-like extension:
    , and
    functionality through the CBLAS and BLAS interfaces, supporting both synchronous and asynchronous execution
  • LAPACK (currently, only synchronous execution is supported)
    • {s,d,c,z}getrf
    • {s,d,c,z}getri
    • {s,d,c,z}getrs
    • {s,d}orgqr
    • {s,d}ormqr
    • {s,d,c,z}trtrs
    • {s,d,c,z}steqr
    • {s,d,c,z}geqrf
    • {s,d,c,z}gebrd
    • {s,d}syev
    • {s,d}syevd
    • {s,d}syevx
    • {s,d}sygvd
    • {s,d}sygvx
    • {s,d}sytrd
  • DFTI complex-to-complex (C2C) and real-to-complex R2C)/complex-to-real (C2R) transforms in one, two, and three dimensions. For R2C/C2R transforms on the GPU, only DFTI_CONJUGATE_EVEN_STORAGE=DFTI_COMPLEX_COMPLEX is supported (implying DFTI_PACKED_FORMAT=DFTI_CCE_FORMAT). Both synchronous and asynchronous computations are supported.
The OpenMP offload feature from
Intel® oneAPI Math Kernel Library
allows you to run
computations on a GPU through the standard
APIs within an
omp target variant dispatch
section. For example, the standard CBLAS API for single precision real data type matrix multiply is:
void cblas_sgemm(const CBLAS_LAYOUT Layout, const CBLAS_TRANSPOSE TransA, const CBLAS_TRANSPOSE TransB, const MKL_INT M, const MKL_INT N, const MKL_INT K, const float alpha, const float *A, const MKL_INT lda, const float *B, const MKL_INT ldb, const float beta, float *C, const MKL_INT ldc);
If the
function (for example,
) is called outside of an
omp target variant dispatch
section or if offload is disabled, then the CPU implementation is dispatched. If the same function is called within an omp target variant dispatch section and offload is possible then the GPU implementation is dispatched. By default the execution of the
function within a dispatch variant construct is synchronous. OpenMP offload computations may be done asynchronously by adding the
clause to the
target variant dispatch
construct. This ensures that the host thread encountering the task region generated by this target construct will not be blocked by the
call. Rather, the host thread is returned to the caller for further use. To finish the asynchronous (
) computations and ensure memory and execution model consistency (for example, that the results of a computation will be ready in memory to map), the last such
computation is followed by the stand-alone construct
#pragma omp taskwait
From the
OpenMP Application Programming Interface
version 5.0 specification: "The taskwait region binds to the current task region [i.e., in this case, the last nowait computation]. The current task region is suspended at an implicit task scheduling point associated with the construct. The current task region remains suspended until all child tasks that it generated before the taskwait region complete execution [currently, depend clause is not supported]."


Examples for using the OpenMP offload for
are located in the
Intel® oneAPI Math Kernel Library
installation directory, under:
The following code snippet shows how to use OpenMP offload for single-call
features such as most dense linear algebra functionality.
#include <omp.h> #include "mkl.h" #include "mkl_omp_offload.h" // MKL header file for OpenMP offload int dnum = 0; int main() { float *a, *b, *c, alpha = 1.0, beta = 1.0; MKL_INT m = 150, n = 200, k = 128, lda = m, ldb = k, ldc = m; MKL_INT sizea = lda * k, sizeb = ldb * n, sizec = ldc * n; // allocate matrices and check pointers a = (float *)mkl_malloc(sizea * sizeof(float), 64); ... // initialize matrices #pragma omp target map(c[0:sizec]) { for (i = 0; i < sizec; i++) { c[i] = 42; } ... } // run gemm on host, use standard MKL interface cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc); // map the a, b and c matrices on the device memory #pragma omp target data map(to:a[0:sizea],b[0:sizeb]) map(tofrom:c[0:sizec]) device(dnum) { // run gemm on gpu, use standard MKL interface within a variant dispatch construct // if offload is not possible, default to cpu // use the use_device_ptr clause to specify that a, b and c are device memory #pragma omp target variant dispatch device(dnum) use_device_ptr(a, b, c) { cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc); } } // Free matrices mkl_free(a); … }
Some of the
functionality requires to call a set of functions to perform the corresponding computation. This is the case for example for the Discrete Fourier Transform which for a typical computation involves calling the functions.
In that case, only a subset of the calls must be wrapped in an
omp target variant dispatch
construct as shown in the following code snippet for DFTI.
#include <omp.h> #include "mkl.h" #include "mkl_omp_offload.h" int main(void) { const int devNum = 0; const MKL_LONG N = 64; // Size of 1D transform MKL_LONG status = 0; MKL_LONG statusGPU = 0; DFTI_DESCRIPTOR_HANDLE descHandle = NULL; DFTI_DESCRIPTOR_HANDLE descHandleGPU = NULL; MKL_Complex8 *x = NULL; MKL_Complex8 *xGPU = NULL; printf("Create DFTI descriptor\n"); status = DftiCreateDescriptor(&descHandle, DFTI_SINGLE, DFTI_COMPLEX, 1, N); printf("Create GPU DFTI descriptor\n"); statusGPU = DftiCreateDescriptor(&descHandleGPU, DFTI_SINGLE, DFTI_COMPLEX, 1, N); printf("Commit DFTI descriptor\n"); status = DftiCommitDescriptor(descHandle); printf("Commit GPU DFTI descriptor\n"); #pragma omp target variant dispatch device(devNum) { statusGPU = DftiCommitDescriptor(descHandleGPU); } printf("Allocate memory for input array\n"); x = (MKL_Complex8 *)mkl_malloc(N*sizeof(MKL_Complex8), 64); printf("Allocate memory for GPU input array\n"); xGPU = (MKL_Complex8 *)mkl_malloc(N*sizeof(MKL_Complex8), 64); printf("Initialize input for forward FFT\n"); // init x and xGPU ... printf("Compute forward FFT in-place\n"); status = DftiComputeForward(descHandle, x); printf("Compute GPU forward FFT in-place\n"); #pragma omp target data map(tofrom:xGPU[0:N]) device(devNum) { #pragma omp target variant dispatch use_device_ptr(xGPU) device(devNum) { statusGPU = DftiComputeForward(descHandleGPU, xGPU); } } // use results now in x and xGPU ... cleanup: DftiFreeDescriptor(&descHandle); DftiFreeDescriptor(&descHandleGPU); mkl_free(x); mkl_free(xGPU); }
For asynchronous execution of multi-call
computation, the
clause need to be used only on the call to the function performing the actual computation (for example,
). For instance, the following snippet shows how the DFTI example above could be changed to have two, back-to-back, asynchronous (
) computations dispatched, with a
at the end of the second to ensure the completion of both computations before their results are accessed:
printf("Compute GPU forward FFT 1 in-place\n"); #pragma omp target data map(tofrom:x1GPU[0:N1], x2GPU[0:N2]) device(devNum) { #pragma omp target variant dispatch use_device_ptr(x1GPU) device(devNum) nowait { status1GPU = DftiComputeForward(descHandle1GPU, x1GPU); } printf("Compute GPU forward FFT 2 in-place\n"); #pragma omp target variant dispatch use_device_ptr(x2GPU) device(devNum) nowait { status2GPU = DftiComputeForward(descHandle2GPU, x2GPU); } #pragma omp taskwait } if (status1GPU != DFTI_NO_ERROR) goto failed; if (status2GPU != DFTI_NO_ERROR) goto failed;

Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804