Developer Reference

  • 2021.1
  • 12/04/2020
  • Public Content
Contents

OpenMP* Offload for Intel® oneAPI Math Kernel Library

You can use
Intel® oneAPI Math Kernel Library (oneMKL)
and OpenMP* offload to run standard
oneMKL
computations on Intel GPUs. You can find the list of
oneMKL
features that support OpenMP offload in the
mkl_omp_offload.h
header file which includes:
  • All Level 1, 2, and 3 BLAS functions through the CBLAS and BLAS interfaces, supporting both synchronous and asynchronous execution
  • BLAS-like extension:
    ?axpy_batch{_strided}
    ,
    ?gemm_batch{_strided}
    ,
    ?trsm_batch{_strided}
    , and
    ?gemmt
    functionality through the CBLAS and BLAS interfaces, supporting both synchronous and asynchronous execution
  • LAPACK (currently, only synchronous execution is supported)
    • {s,d,c,z}getrf
    • {s,d,c,z}getri
    • {s,d,c,z}getrs
    • {s,d}orgqr
      ,
      {c,z}ungqr
    • {s,d}ormqr
      ,
      {c,z}unmqr
    • {s,d,c,z}trtrs
    • {s,d,c,z}steqr
    • {s,d,c,z}geqrf
    • {s,d,c,z}gebrd
    • {s,d}syev
      ,
      {c,z}heev
    • {s,d}syevd
      ,
      {c,z}heevd
    • {s,d}syevx
      ,
      {c,z}heevx
    • {s,d}sygvd
      ,
      {c,z}hegvd
    • {s,d}sygvx
      ,
      {c,z}hegvx
    • {s,d}sytrd
      ,
      {c,z}hetrd
  • Vector Statistics
    • Distributions:
      v?RngUniform
      ,
      v?RngGaussian
      ,
      v?RngExponential
      ,
      v?RngLognormal
      ,
      v?RngCauchy
      ,
      v?RngGumbel
      ,
      v?RngLaplace
      ,
      v?RngRayleigh
      ,
      v?RngWeibull
      ,
      viRngBernoulli
      ,
      viRngGeometric
      ,
      viRngPoisson
      ,
      viRngUniformBits
      ,
      viRngUniformBits32/64
    • Basic Random Number Generators:
      VSL_BRNG_MCG31
      ,
      VSL_BRNG_MCG59
      ,
      VSL_BRNG_PHILOX4X32X10
      ,
      VSL_BRNG_MRG32K3A
  • DFTI complex-to-complex (C2C) and real-to-complex R2C)/complex-to-real (C2R) transforms in one, two, and three dimensions. For R2C/C2R transforms on the GPU, only DFTI_CONJUGATE_EVEN_STORAGE=DFTI_COMPLEX_COMPLEX is supported (implying DFTI_PACKED_FORMAT=DFTI_CCE_FORMAT). Both synchronous and asynchronous computations are supported.
  • Sparse BLAS
    • mkl_sparse_{s, d}_create_csr
    • mkl_sparse_{s, d}_mv
      :
      • only supported with
        SPARSE_MATRIX_TYPE_GENERAL
        and
        SPARSE_OPERATION_NON_TRANSPOSE
      • both synchronous and asynchronous computations are supported
    • mkl_sparse_destroy
    • In Sparse BLAS, there are three stages for the usage model. The create/inspection stage, the execution stage, and the destruction stage. For Sparse BLAS with C OpenMP Offload, only the middle execution stage can be asynchronously executed, provided any data dependencies are already respected. If the computation part were executed asynchronously,
      #pragma omp taskwait
      should be called before the
      mkl_sparse_destroy()
      call.
The OpenMP offload feature from
Intel® oneAPI Math Kernel Library
allows you to run
oneMKL
computations on Intel GPUs through the standard
oneMKL
APIs within an
omp target variant dispatch
section. For example, the standard CBLAS API for single precision real data type matrix multiply is:
void cblas_sgemm(const CBLAS_LAYOUT Layout, const CBLAS_TRANSPOSE TransA, const CBLAS_TRANSPOSE TransB, const MKL_INT M, const MKL_INT N, const MKL_INT K, const float alpha, const float *A, const MKL_INT lda, const float *B, const MKL_INT ldb, const float beta, float *C, const MKL_INT ldc);
If the
oneMKL
function (for example,
cblas_sgemm
) is called outside of an
omp target variant dispatch
section or if offload is disabled, then the CPU implementation is dispatched. If the same function is called within an omp target variant dispatch section and offload is possible then the GPU implementation is dispatched. By default the execution of the
oneMKL
function within a dispatch variant construct is synchronous. OpenMP offload computations may be done asynchronously by adding the
nowait
clause to the
target variant dispatch
construct. This ensures that the host thread encountering the task region generated by this target construct will not be blocked by the
oneMKL
call. Rather, the host thread is returned to the caller for further use. To finish the asynchronous (
nowait
) computations and ensure memory and execution model consistency (for example, that the results of a computation will be ready in memory to map), the last such
nowait
computation is followed by the stand-alone construct
#pragma omp taskwait
.
From the
OpenMP Application Programming Interface
version 5.0 specification: "The taskwait region binds to the current task region [i.e., in this case, the last nowait computation]. The current task region is suspended at an implicit task scheduling point associated with the construct. The current task region remains suspended until all child tasks that it generated before the taskwait region complete execution [currently, depend clause is not supported]."

Example

Examples for using the OpenMP offload for
oneMKL
are located in the
Intel® oneAPI Math Kernel Library
installation directory, under:
examples/c_openmp
The following code snippet shows how to use OpenMP offload for single-call
oneMKL
features such as most dense linear algebra functionality.
#include <omp.h> #include "mkl.h" #include "mkl_omp_offload.h" // MKL header file for OpenMP offload int dnum = 0; int main() { float *a, *b, *c, alpha = 1.0, beta = 1.0; MKL_INT m = 150, n = 200, k = 128, lda = m, ldb = k, ldc = m; MKL_INT sizea = lda * k, sizeb = ldb * n, sizec = ldc * n; // allocate matrices and check pointers a = (float *)mkl_malloc(sizea * sizeof(float), 64); ... // initialize matrices #pragma omp target map(c[0:sizec]) { for (i = 0; i < sizec; i++) { c[i] = 42; } ... } // run gemm on host, use standard MKL interface cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc); // map the a, b and c matrices on the device memory #pragma omp target data map(to:a[0:sizea],b[0:sizeb]) map(tofrom:c[0:sizec]) device(dnum) { // run gemm on gpu, use standard MKL interface within a variant dispatch construct // if offload is not possible, default to cpu // use the use_device_ptr clause to specify that a, b and c are device memory #pragma omp target variant dispatch device(dnum) use_device_ptr(a, b, c) { cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc); } } // Free matrices mkl_free(a); … }
Some of the
oneMKL
functionality requires to call a set of functions to perform the corresponding computation. This is the case for example for the Discrete Fourier Transform which for a typical computation involves calling the functions.
DFTI_EXTERN MKL_LONG DftiCreateDescriptor(DFTI_DESCRIPTOR_HANDLE*, enum DFTI_CONFIG_VALUE, enum DFTI_CONFIG_VALUE, MKL_LONG, ...); DFTI_EXTERN MKL_LONG DftiCommitDescriptor(DFTI_DESCRIPTOR_HANDLE); DFTI_EXTERN MKL_LONG DftiComputeForward(DFTI_DESCRIPTOR_HANDLE, void*, ...); DFTI_EXTERN MKL_LONG DftiComputeBackward(DFTI_DESCRIPTOR_HANDLE, void*, ...); DFTI_EXTERN MKL_LONG DftiFreeDescriptor(DFTI_DESCRIPTOR_HANDLE*);
In that case, only a subset of the calls must be wrapped in an
omp target variant dispatch
construct as shown in the following code snippet for DFTI.
#include <omp.h> #include "mkl.h" #include "mkl_omp_offload.h" int main(void) { const int devNum = 0; const MKL_LONG N = 64; // Size of 1D transform MKL_LONG status = 0; MKL_LONG statusGPU = 0; DFTI_DESCRIPTOR_HANDLE descHandle = NULL; DFTI_DESCRIPTOR_HANDLE descHandleGPU = NULL; MKL_Complex8 *x = NULL; MKL_Complex8 *xGPU = NULL; printf("Create DFTI descriptor\n"); status = DftiCreateDescriptor(&descHandle, DFTI_SINGLE, DFTI_COMPLEX, 1, N); printf("Create GPU DFTI descriptor\n"); statusGPU = DftiCreateDescriptor(&descHandleGPU, DFTI_SINGLE, DFTI_COMPLEX, 1, N); printf("Commit DFTI descriptor\n"); status = DftiCommitDescriptor(descHandle); printf("Commit GPU DFTI descriptor\n"); #pragma omp target variant dispatch device(devNum) { statusGPU = DftiCommitDescriptor(descHandleGPU); } printf("Allocate memory for input array\n"); x = (MKL_Complex8 *)mkl_malloc(N*sizeof(MKL_Complex8), 64); printf("Allocate memory for GPU input array\n"); xGPU = (MKL_Complex8 *)mkl_malloc(N*sizeof(MKL_Complex8), 64); printf("Initialize input for forward FFT\n"); // init x and xGPU ... printf("Compute forward FFT in-place\n"); status = DftiComputeForward(descHandle, x); printf("Compute GPU forward FFT in-place\n"); #pragma omp target data map(tofrom:xGPU[0:N]) device(devNum) { #pragma omp target variant dispatch use_device_ptr(xGPU) device(devNum) { statusGPU = DftiComputeForward(descHandleGPU, xGPU); } } // use results now in x and xGPU ... cleanup: DftiFreeDescriptor(&descHandle); DftiFreeDescriptor(&descHandleGPU); mkl_free(x); mkl_free(xGPU); }
For asynchronous execution of multi-call
oneMKL
computation, the
nowait
clause need to be used only on the call to the function performing the actual computation (for example,
DftiCompute{Forward,Backward}
). For instance, the following snippet shows how the DFTI example above could be changed to have two, back-to-back, asynchronous (
nowait
) computations dispatched, with a
taskwait
at the end of the second to ensure the completion of both computations before their results are accessed:
printf("Compute Intel GPU forward FFT 1 in-place\n"); #pragma omp target data map(tofrom:x1GPU[0:N1], x2GPU[0:N2]) device(devNum) { #pragma omp target variant dispatch use_device_ptr(x1GPU) device(devNum) nowait { status1GPU = DftiComputeForward(descHandle1GPU, x1GPU); } printf("Compute Intel GPU forward FFT 2 in-place\n"); #pragma omp target variant dispatch use_device_ptr(x2GPU) device(devNum) nowait { status2GPU = DftiComputeForward(descHandle2GPU, x2GPU); } #pragma omp taskwait } if (status1GPU != DFTI_NO_ERROR) goto failed; if (status2GPU != DFTI_NO_ERROR) goto failed;
For sparse BLAS computations, the workflow ‘create a CSR matrix handle’ -> ‘compute’ -> ‘destroy the CSR matrix handle’ should be done in a single target data region. For instance, the following snippet shows how the Sparse BLAS OpenMP Offload example for
mkl_sparse_s_mv()
could be run, where
N
is the number of rows,
M
is the number of columns, and
NNZ
is the number of non-zero entries of the sparse matrix
csrA_gpu
,
x
is the input vector, and the output is stored in the
z
array:
#pragma omp target data map(to:ia[0:N+1],ja[0:NNZ],a[0:NNZ],x[0:M]) map(tofrom:z[0:N]) device(devNum) { #pragma omp target variant dispatch device(devNum) use_device_ptr(ia, ja, a) { status_gpu1 = mkl_sparse_s_create_csr(&csrA_gpu, SPARSE_INDEX_BASE_ZERO, N, M, ia, ia + 1, ja, a); } #pragma omp target variant dispatch device(devNum) use_device_ptr(x, z) { status_gpu2 = mkl_sparse_s_mv(SPARSE_OPERATION_NON_TRANSPOSE, alpha, csrA_gpu, descrA, x, beta, z); } #pragma omp target variant dispatch device(devNum) { status_gpu3 = mkl_sparse_destroy(csrA_gpu); } }
For asynchronous execution of multi-call
oneMKL
Sparse BLAS computation, the
nowait
clause can be added to the call of the function performing the actual computation (for example, calls to the
mkl_sparse_{s,d}_mv()
function). Adding the
nowait
clause to the
mkl_sparse_s_create_csr()
or
mkl_sparse_destroy()
calls may result in incorrect behavior. As an example, the following snippet shows how the Sparse BLAS example above could be changed to have two asynchronous (
nowait
) computations using the same matrix handle,
csrA_gpu
, but unrelated vector data so there is no read/write dependency between them. Add a
taskwait
at the end of the second execution to ensure the completion of both computations before the
mkl_sparse_destroy()
function is called:
#pragma omp target data map(to:ia[0:N+1],ja[0:NNZ],a[0:NNZ],x[0:M],w[0:M]) map(tofrom:y[0:N],z[0:N]) device(devNum) { #pragma omp target variant dispatch device(devNum) use_device_ptr(ia, ja, a) { status_gpu1 = mkl_sparse_s_create_csr(&csrA_gpu, SPARSE_INDEX_BASE_ZERO, N, M, ia, ia + 1, ja, a); } #pragma omp target variant dispatch device(devNum) use_device_ptr(x, z) nowait { status_gpu2 = mkl_sparse_s_mv(SPARSE_OPERATION_NON_TRANSPOSE, alpha, csrA_gpu, descrA, x, beta, z); } #pragma omp target variant dispatch device(devNum) use_device_ptr(w, y) nowait { status_gpu3 = mkl_sparse_s_mv(SPARSE_OPERATION_NON_TRANSPOSE, alpha, csrA_gpu, descrA, w, beta, y); } #pragma omp taskwait #pragma omp target variant dispatch device(devNum) { status_gpu4 = mkl_sparse_destroy(csrA_gpu); } }

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.