Programming Guide

Contents

oneMKL Code Sample

To demonstrate a typical workflow for the oneMKL with DPC++ interfaces, the following example source code snippets perform a double precision matrix-matrix multiplication on a GPU device.
The following code example requires additional code to compile and run, as indicated by the inline comments.
// Standard SYCL header #include <CL/sycl.hpp> // STL classes #include <exception> #include <iostream> // Declarations for Intel oneAPI Math Kernel Library DPC++ APIs #include "oneapi/mkl.hpp" int main(int argc, char *argv[]) { // // User obtains data here for A, B, C matrices, along with setting m, n, k, ldA, ldB, ldC. // // For this example, A, B and C should be initially stored in a std::vector, // or a similar container having data() and size() member functions. // // Create GPU device sycl::device my_device; try { my_device = sycl::device(sycl::gpu_selector()); } catch (...) { std::cout << "Warning: GPU device not found! Using default device instead." << std::endl; } // Create asynchronous exceptions handler to be attached to queue. // Not required; can provide helpful information in case the system isn’t correctly configured. auto my_exception_handler = [](sycl::exception_list exceptions) { for (std::exception_ptr const& e : exceptions) { try { std::rethrow_exception(e); } catch (sycl::exception const& e) { std::cout << "Caught asynchronous SYCL exception:\n" << e.what() << std::endl; } catch (std::exception const& e) { std::cout << "Caught asynchronous STL exception:\n" << e.what() << std::endl; } } }; // create execution queue on my gpu device with exception handler attached sycl::queue my_queue(my_device, my_exception_handler); // create sycl buffers of matrix data for offloading between device and host sycl::buffer<double, 1> A_buffer(A.data(), A.size()); sycl::buffer<double, 1> B_buffer(B.data(), B.size()); sycl::buffer<double, 1> C_buffer(C.data(), C.size()); // add oneapi::mkl::blas::gemm to execution queue and catch any synchronous exceptions try { using oneapi::mkl::blas::gemm; using oneapi::mkl::transpose; gemm(my_queue, transpose::nontrans, transpose::nontrans, m, n, k, alpha, A_buffer, ldA, B_buffer, ldB, beta, C_buffer, ldC); } catch (sycl::exception const& e) { std::cout << "\t\tCaught synchronous SYCL exception during GEMM:\n" << e.what() << std::endl; } catch (std::exception const& e) { std::cout << "\t\tCaught synchronous STL exception during GEMM:\n" << e.what() << std::endl; } // ensure any asynchronous exceptions caught are handled before proceeding my_queue.wait_and_throw(); // // post process results // // Access data from C buffer and print out part of C matrix auto C_accessor = C_buffer.template get_access<sycl::access::mode::read>(); std::cout << "\t" << C << " = [ " << C_accessor[0] << ", " << C_accessor[1] << ", ... ]\n"; std::cout << "\t [ " << C_accessor[1 * ldC + 0] << ", " << C_accessor[1 * ldC + 1] << ", ... ]\n"; std::cout << "\t [ " << "... ]\n"; std::cout << std::endl; return 0; }
Consider that (double precision valued) matrices A(of size m-by-k), B( of size k-by-n) and C(of size m-by-n) are stored in some arrays on the host machine with leading dimensions ldA, ldB, and ldC, respectively. Given scalars (double precision) alpha and beta, compute the matrix-matrix multiplication (
mkl::blas::gemm
):
C = alpha * A * B + beta * C
Include the standard SYCL headers and the oneMKL DPC++ specific header that declares the desired
mkl::blas::gemm
API:
// Standard SYCL header #include <CL/sycl.hpp> // STL classes #include <exception> #include <iostream> // Declarations for Intel oneAPI Math Kernel Library DPC++ APIs #include "oneapi/mkl.hpp"
Next, load or instantiate the matrix data on the host machine as usual and then create the GPU device, create an asynchronous exception handler, and finally create the queue on the device with that exception handler. Exceptions that occur on the host can be caught using standard C++ exception handling mechanisms; however, exceptions that occur on a device are considered asynchronous errors and stored in an exception list to be processed later by this user-provided exception handler.
// Create GPU device sycl::device my_device; try { my_device = sycl::device(sycl::gpu_selector()); } catch (...) { std::cout << "Warning: GPU device not found! Using default device instead." << std::endl; } // Create asynchronous exceptions handler to be attached to queue. // Not required; can provide helpful information in case the system isn’t correctly configured. auto my_exception_handler = [](sycl::exception_list exceptions) { for (std::exception_ptr const& e : exceptions) { try { std::rethrow_exception(e); } catch (sycl::exception const& e) { std::cout << "Caught asynchronous SYCL exception:\n" << e.what() << std::endl; } catch (std::exception const& e) { std::cout << "Caught asynchronous STL exception:\n" << e.what() << std::endl; } } };
The matrix data is now loaded into the DPC++ buffers, which enables offloading to desired devices and then back to host when complete. Finally, the
mkl::blas::gemm
API is called with all the buffers, sizes, and transpose operations, which will enqueue the matrix multiply kernel and data onto the desired queue.
// create execution queue on my gpu device with exception handler attached sycl::queue my_queue(my_device, my_exception_handler); // create sycl buffers of matrix data for offloading between device and host sycl::buffer<double, 1> A_buffer(A.data(), A.size()); sycl::buffer<double, 1> B_buffer(B.data(), B.size()); sycl::buffer<double, 1> C_buffer(C.data(), C.size()); // add oneapi::mkl::blas::gemm to execution queue and catch any synchronous exceptions try { using oneapi::mkl::blas::gemm; using oneapi::mkl::transpose; gemm(my_queue, transpose::nontrans, transpose::nontrans, m, n, k, alpha, A_buffer, ldA, B_buffer, ldB, beta, C_buffer, ldC); } catch (sycl::exception const& e) { std::cout << "\t\tCaught synchronous SYCL exception during GEMM:\n" << e.what() << std::endl; } catch (std::exception const& e) { std::cout << "\t\tCaught synchronous STL exception during GEMM:\n" << e.what() << std::endl; }
At some time after the
gemm
kernel has been enqueued, it will be executed. The queue is asked to wait for all kernels to execute and then pass any caught asynchronous exceptions to the exception handler to be thrown. The runtime will handle transfer of the buffer’s data between host and GPU device and back. By the time an accessor is created for the
C_buffer
, the buffer data will have been silently transferred back to the host machine if necessary. In this case, the accessor is used to print out a 2x2 submatrix of
C_buffer
.
// Access data from C buffer and print out part of C matrix auto C_accessor = C_buffer.template get_access<sycl::access::mode::read>(); std::cout << "\t" << C << " = [ " << C_accessor[0] << ", " << C_accessor[1] << ", ... ]\n"; std::cout << "\t [ " << C_accessor[1 * ldC + 0] << ", " << C_accessor[1 * ldC + 1] << ", ... ]\n"; std::cout << "\t [ " << "... ]\n"; std::cout << std::endl; return 0;
Note that the resulting data is still in the
C_buffer
object and, unless it is explicitly copied elsewhere (like back to the original C container), it will only remain available through accessors until the
C_buffer
is out of scope.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.