DPC++ Foundations Code Sample Walk-Through

By Dylan Benito,

Last Updated: 10/23/2020

A Code Walk-Through for Data Parallel C++ (DPC++) Foundations

This sample walk-through features vector_add as a program that demonstrates oneAPI concepts and functionality, using the Data Parallel C++ (DPC++) programming language. The program adds two arrays of integers together using hardware acceleration. In this walkthrough, you will learn about:

  • DPC++ headers
  • Asynchronous exceptions from kernels
  • Device selectors for different accelerators
  • Buffers and accessors
  • Queues
  • parallel_for kernel

Download the vector_add source from GitHub.

DPC++ (SYCL*) Headers

DPC++ is based on familiar and industry-standard C++, plus it incorporates the SYCL* specification 1.2.1 from the Khronos Group* and includes language extensions developed using an open community process. The header file sycl.hpp, as specified in the SYCL specification, is also provided in the Intel® oneAPI DPC++/C++ Compiler. FPGA support is included with a DPC++ extension with the fpga_extensions.hpp header file.

The code snippet below, from vector_add, illustrates the different headers needed when you are supporting different accelerators.

//For CPU or GPU

#include <CL/sycl.hpp>
#include <array>
#include <iostream>
using namespace sycl;

//For FPGA

#include <CL/sycl.hpp>
#include <array>
#include <iostream>
#if FPGA || FPGA_EMULATOR
#include <CL/sycl/INTEL/fpga_extensions.hpp>
#endif
using namespace sycl;

Catching async Exceptions from DPC++ Kernels

DPC++ kernels run asynchronously on accelerators in different stackframes. The kernel may have asynchronous errors that cannot be propagated up to the stack. In order to catch the asynchronous exceptions, the SYCL queue class provides a way for error handler functions. The following code in the vector_add sample shows how:

// Use this to create an exception handler with catch async exceptions.

static auto exception_handler = [](cl::sycl::exception_list eList) {
	for (std::exception_ptr const &e : eList) {
		try {
			std::rethrow_exception(e);
		}
		catch (std::exception const &e) {
#if _DEBUG
			std::cout << "Failure" << std::endl;
#endif
			std::terminate();
		}
	}
};
… … 
try {
    queue q(d_selector, exception_handler);
    … … 
} catch (exception const &e) {
    … … 
}

Using a Default Selector for Accelerators

Selecting an accelerator for offload kernels is straightforward. SYCL and oneAPI provide selectors that can discover and provide access to the hardware that is available on your environment. The default_selector enumerates all the available accelerators and selects the most performant one among them.

DPC++ provides additional selector classes for the FPGA accelerator with fpga_selector and fpga_emulator_selector classes that are found in fpga_extensions.hpp. An example is shown in the following code snippet from the vector_add sample:

​#if FPGA || FPGA_EMULATOR
#include <CL/sycl/INTEL/fpga_extensions.hpp>
#endif
… … 
#if FPGA_EMULATOR
  // DPC++ extension: FPGA emulator selector on systems without FPGA card.
  INTEL::fpga_emulator_selector d_selector;
#elif FPGA
  // DPC++ extension: FPGA selector on systems with FPGA card.
  INTEL::fpga_selector d_selector;
#else
  // The default device selector will select the most performant device.
  default_selector d_selector;
#endif

Data, Buffers, and Accessors

DPC++ uses kernels that run on accelerators to process large pieces of data or computation. Data declared on the host is wrapped in a buffer and is transferred to the accelerators implicitly by the DPC++ runtime. The accelerators read or write to the buffer through an accessor. The runtime also draws the kernel dependencies from the accessors used, then dispatches and runs the kernels in most efficient order.

  • a_array, b_array, and sum_parallel are array objects from the host
  • a_buf, b_buf, and sum_buf are buffer wrapers
  • a and b are read-only accessors; sum is a write-only accessor

  buffer a_buf(a_array);
  buffer b_buf(b_array);
  buffer sum_buf(sum_parallel.data(), num_items);
… … 
  q.submit([&](handler &h) {

// Create an accessor for each buffer with access permission: read, write, or
// read/write. The accessor is used to access the memory in the buffer.
    
    accessor a(a_buf, h, read_only);
    accessor b(b_buf, h, read_only);

// The sum_accessor is used to store (with write permission) the sum data.

    accessor sum(sum_buf, h, write_only);
… … 
  });

Queue and parallel_for Kernels

A DPC++ queue encapsulates all the context and states needed for kernel execution. By default, a queue is created and associated with an accelerator through a default selector when no parameter is passed. It can also take a specific device selector and an asynchronous exception handler, which is used in vector_add.

Kernels are enqueued to the queue and executed. There are different types of kernels: single task kernel, basic data-parallel kernel, hierarchical parallel kernel, etc. The basic data-parallel parallel_for kernel is used in vector_add.

try {
    queue q(d_selector, exception_handler);
    … … 
    q.submit([&](handler &h) {
    … … 
        h.parallel_for(num_items, [=](auto i) { sum[i] = a[i] + b[i]; });
  });  
} catch (exception const &e) {
    … … 
}

The kernel body is an addition of two arrays captured in the Lambda function.

sum[i] = a[i] + b[i]; 

The range of data the kernel can process is specified in the first parameter num_items of h.parallel_for. Example: A 1-D range with size of num_items. Two read-only data, a_array and b_array, are transferred to the accelerator by the runtime. When the kernel is completed, the sum of the data in the sum_buf buffer is copied to host when the sum_buf goes out of scope.

Summary

Device selectors, buffers, accessors, queues and kernels are the building blocks of oneAPI programming. DPC++ is an open, standards-based evolution of ISO C++ that incorporates SYCL and community extensions to simplify data parallel programming. DPC++ allows code reuse across hardware targets, and enables high productivity and performance across CPU, GPU, and FPGA architectures, while permitting accelerator-specific tuning

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.