Developer Guide

Using multiple heterogeneous devices

Most accelerators typically reside in a server that has a significant amount of compute resources in it. For instance, a typical server can have up to eight sockets with each socket containing over 50 cores. DPC++ provides the ability to treat the CPUs and the accelerators uniformly to distribute work among them. It is the responsibility of the programmer to ensure a balanced distribution of work among the heterogeneous compute resources in the platform.

Overlapping Compute on Various Accelerators in the Platform

DPC++ provides access to different kinds of devices through abstraction of device selectors. Queues can be created for each of the devices, and kernels can be submitted to them for execution. All kernel submits in DPC++ are non-blocking, which means that once the kernel is submitted to a queue for execution, the host does not wait for it to finish unless waiting on the queue is explicitly requested. This allows the host to do some work itself or initiate work on other devices while the kernel is executing on the accelerator.
The host CPU can be treated as an accelerator and the DPCPP can submit kernels to it for execution. This is completely independent and orthogonal to the job done by the host to orchestrate the kernel submission and creation. The underlying operating system manages the kernels submitted to the CPU accelerator as another process and uses the same
openCL/Level0
runtime mechanisms to exchange information with the host device.
In the following example, a simple vector add operation is shown that works on a single GPU device.
int VectorAdd1(sycl::queue &q, const IntArray &a, const IntArray &b, IntArray &sum, int iter) { sycl::range num_items{a.size()}; sycl::buffer a_buf(a); sycl::buffer b_buf(b); sycl::buffer sum_buf(sum.data(), num_items); auto start = std::chrono::steady_clock::now(); for (int i = 0; i < iter; i++) { auto e = q.submit([&](auto &h) { // Input accessors sycl::accessor a_acc(a_buf, h, sycl::read_only); sycl::accessor b_acc(b_buf, h, sycl::read_only); // Output accessor sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::noinit); h.parallel_for(num_items, [=](auto i) { sum_acc[i] = a_acc[i] + b_acc[i]; }); }); } q.wait(); auto end = std::chrono::steady_clock::now(); std::cout << "Vector add1 completed on device - took " << (end - start).count() << " u-secs\n"; return ((end - start).count()); } // end VectorAdd1
In the following kernel the input vector is split into two parts and computation is done on two different accelerators (one is a CPU and the other is a GPU) that can execute concurrently. Care must be taken to ensure that the kernels, in addition to be being submitted, are actually launched on the devices to get this parallelism. The actual time that a kernel is launched can be substantially later than when it was submitted by the host. The implementation decides the time to launch the kernels based on some heuristics to maximize some metric like utilization, throughput or latency. For instance, in the case of the OpenCL backend, on certain platforms one needs to explicitly issue a
clFlush
(as shown on line 41) on the queue to launch the kernels on the accelerators.
int VectorAdd2(sycl::queue &q1, sycl::queue &q2, const IntArray &a, const IntArray &b, IntArray &sum, int iter) { sycl::range num_items{a.size() / 2}; auto start = std::chrono::steady_clock::now(); { sycl::buffer a1_buf(a.data(), num_items); sycl::buffer b1_buf(b.data(), num_items); sycl::buffer sum1_buf(sum.data(), num_items); sycl::buffer a2_buf(a.data() + a.size() / 2, num_items); sycl::buffer b2_buf(b.data() + a.size() / 2, num_items); sycl::buffer sum2_buf(sum.data() + a.size() / 2, num_items); for (int i = 0; i < iter; i++) { q1.submit([&](auto &h) { // Input accessors sycl::accessor a_acc(a1_buf, h, sycl::read_only); sycl::accessor b_acc(b1_buf, h, sycl::read_only); // Output accessor sycl::accessor sum_acc(sum1_buf, h, sycl::write_only, sycl::noinit); h.parallel_for(num_items, [=](auto i) { sum_acc[i] = a_acc[i] + b_acc[i]; }); }); // do the work on host q2.submit([&](auto &h) { // Input accessors sycl::accessor a_acc(a2_buf, h, sycl::read_only); sycl::accessor b_acc(b2_buf, h, sycl::read_only); // Output accessor sycl::accessor sum_acc(sum2_buf, h, sycl::write_only, sycl::noinit); h.parallel_for(num_items, [=](auto i) { sum_acc[i] = a_acc[i] + b_acc[i]; }); }); } // On some platforms this explicit flush of queues is needed // to ensure the overlap in execution between the CPU and GPU // cl_command_queue cq = q1.get(); // clFlush(cq); // cq=q2.get(); // clFlush(cq); } q1.wait(); q2.wait(); auto end = std::chrono::steady_clock::now(); std::cout << "Vector add2 completed on device - took " << (end - start).count() << " u-secs\n"; return ((end - start).count()); } // end VectorAdd2
Checking the running time of the above two kernels it can be seen that the application runs almost twice as fast as before since it has more hardware resources dedicated to solving the problem. In order to achieve good balance one will have to split the work proportional to the capability of the accelerator instead of distributing it evenly as was done in the above example.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.