Developer Guide

Using Libraries for Accelerator Offload

This section discusses using efficient functions from libraries like oneAPI Math Kernel Library (oneMKL) or oneAPI Deep Neural Network Library (oneDNN) instead of hand-coded alternatives. Unless you’re an expert studying a particular mathematical operation, it’s usually a bad idea to write your own version of that operation. For example, matrix multiplication is a common, straightforward mathematical operation:
LaTex Math image.
It’s also easy to implement with just a few lines of code:
// Multiply matrices A and B for (m = 0; m < M; m++) { for (n = 0; n < N; n++) { C[m][n] = 0.0; for (k = 0; k < K; k++) { C[m][n] += A[m][k] * B[k][n]; } } } // End matrix multiplication
However, this naive implementation won’t give the best possible performance. Simple visual inspection of the inner loop shows non-contiguous memory access for matrix B. Cache reuse, and hence performance, will be poor.
It’s not difficult to port the naive algorithm to Data Parallel C++ (DPC++) to offload the matrix multiplication kernel to an accelerator. The following code initializes the queue to submit work to the default device and allocates space for the matrices in unified shared memory (USM):
// Initialize SYCL queue sycl::queue Q(sycl::default_selector{}); auto sycl_device = Q.get_device(); auto sycl_context = Q.get_context(); std::cout << "Running on: " << Q.get_device().get_info<sycl::info::device::name>() << std::endl; // Allocate matrices A, B, and C in USM auto A = sycl::malloc_shared<float *>(M, sycl_device, sycl_context); for (m = 0; m < M; m++) A[m] = sycl::malloc_shared<float>(K, sycl_device, sycl_context); auto B = sycl::malloc_shared<float *>(K, sycl_device, sycl_context); for (k = 0; k < K; k++) B[k] = sycl::malloc_shared<float>(N, sycl_device, sycl_context); auto C = sycl::malloc_shared<float *>(M, sycl_device, sycl_context); for (m = 0; m < M; m++) C[m] = sycl::malloc_shared<float>(N, sycl_device, sycl_context); // Initialize matrices A, B, and C
Data in USM can be moved between host and device memories by the DPC++ runtime. Explicit buffering is not required. To offload the computation to the default accelerator, it is converted to a DPC++ kernel and submitted to the queue:
// Offload matrix multiplication kernel Q.parallel_for(sycl::range<2>{M, N}, [=](sycl::id<2> id) { unsigned int m = id[0]; unsigned int n = id[1]; float sum = 0.0; for (unsigned int k = 0; k < K; k++) sum += A[m][k] * B[k][n]; C[m][n] = sum; }).wait(); // End matrix multiplication
However, simply offloading such code to an accelerator is unlikely to restore performance. In fact, performance gets worse. Badly written code is still badly written whether it runs on the host or a device. The table below shows how the naive matrix multiplication implementation performs on matrices of different shapes.
Common, computationally-demanding operations like matrix multiplication are well-studied. Experts have devised a number of algorithms that give better performance than naive implementations of the basic mathematical formulas. They also use tuning techniques like cache blocking and loop unrolling to achieve performance regardless of the shapes of matrices A and B.
oneMKL provides an optimized general matrix multiplication function (
oneapi::mkl::blas::gemm
) that gives high performance on the host processor or a variety of accelerator devices. The matrices are allocated in USM as before, and passed to the
gemm
function along with the device queue, matrix dimensions, and various other options:
// Offload matrix multiplication float alpha = 1.0, beta = 0.0; oneapi::mkl::transpose transA = oneapi::mkl::transpose::nontrans; oneapi::mkl::transpose transB = oneapi::mkl::transpose::nontrans; sycl::event gemm_done; std::vector<sycl::event> gemm_dependencies; gemm_done = oneapi::mkl::blas::gemm(Q, transA, transB, M, N, K, alpha, A, M, B, K, beta, C, M, gemm_dependencies); gemm_done.wait();
As expected, the library function gives better performance and is more versatile than the naive implementations. For example, the library function can transpose one or both matrices before multiplication, if necessary.
Matrix A Dimensions (Time in Seconds)
Implementation
4000 x 4000
8000 x 2000
2000 x 8000
Processor
Naive DPC++
19.2
38.1
9.8
Gen9
oneMKL
gemm
0.9
1.3
0.8
Gen9
This simple example illustrates the separation of concerns between application developers and tuning experts. The former should rely on the latter to encapsulate common computations in highly-optimized libraries. The oneAPI specification defines many libraries to help create accelerated applications, e.g.:
  • oneMKL for math operations
  • oneDAL for data analytics and machine learning
  • oneDNN for the development of deep learning frameworks
  • oneVPL for video processing
Check whether your required operation is already available in a oneAPI library before implementing it from scratch.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.