Code Sample: Vector Add – An Intel® oneAPI DPC++/C++ Compiler Example

By Dylan Benito,

Last Updated:10/23/2020

File(s): GitHub
License: MIT
Optimized for...  
Software:
(Programming Language, Tool, IDE, Framework)
Intel® oneAPI DPC++/C++ Compiler
Prerequisites: Familiarity with C++ and an interest in DPC++

 

This tutorial describes parallel implementations for adding two vectors using Data Parallel C++ (DPC++). The two code samples, showing two different memory management techniques, (vector-add-buffers.cpp and vector-add-usm.cpp), are available at GitHub. They show the basic elements of the DPC++ programming language. You can use them as a starting point for developing more complex applications.

This tutorial and its code examples uses the Intel® oneAPI DPC++/C++ Compiler and assumes the compiler and its environment have been setup and configured correctly. Building and running these samples verify that your development environment is ready for using the core features of DPC++.

Introduction

Some basic attributes of writing DPC++ applications include: Specifying the device for offload, where kernels execute, and managing the interaction and data propagation between the device and the host. In these samples, you will learn how to write the kernel and how to use two memory management techniques: Buffers and USM.

The samples use two .cpp files to demonstrate the options independently: vector-add-buffers.cpp and vector-add-usm.cpp:

  • vector-add-buffers.cpp uses buffers, along with accessors to perform memory copy and from the device.
    Buffers provide data mapping between the host device and the accelerator. 1-, 2-, or 3-dimensional arrays are placed into buffers, while submitting work to a queue. The queue provides work scheduling, orchestration, and high-level parallel operations. Work is submitted to the queue using a lambda that encapsulates the work kernel and the data needed for its execution.
    Buffers are initialized on the host and accessed by the lambda. The lambda requests read access for the input vectors and write access for the output vector.
  • vector-add-usm.cpp uses USM.
    USM is a DPC++ tool for data management. It is a pointer-based approach that is familiar to C/C++ programmers who use malloc or new to allocate memory.
    USM requires hardware support for unified virtual address space (this allows for consistent pointer values between host and device). All memory is allocated by the host. It offers three distinct allocation types:
    • Shared: Located either on the host or on the device (managed by compiler), accessible by the host or device.
    • Device: Located on the device, accessible only by the device.
    • Host: Located on the host, accessible by the host or device.

The following diagram illustrates the difference between the two:

Left: both the host and the device (for example, discrete GPU) may have their own physical memories. Right: the logical view of USM. It provides a unified address space, across the host and the device, even when their memories are physically separate.

Problem Statement

You can compute a vector from two vectors by adding the corresponding elements. This is a simple, but fundamental piece of computation that is used in many linear algebra algorithms among applications in a wide range of areas. By using this problem, this tutorial demonstrates two parallel implementations using DPC++ buffers and USM. It also provides a sequential implementation to verify that the result of the offloaded computation is correct.

Parallel Implementation and Sample Code Walkthrough

Sample 1: Buffers

The basic DPC++ implementation explained in this code sample includes device selector, queue, buffer, accessor, kernel, and command group.

Create Device Selector

Using DPC++, you can offload computation from CPU to a device. The first step is to select a device, (the sample uses FPGA). Based on the availability of devices and their intended use cases, create a device selector object:

  • Default selector: default_selector
  • For FPGA: INTEL::fpga_selector
  • For FPGA emulator: INTEL::fpga_emulator_selector

This sample uses definitions to choose between device selectors; you can pick between FPGA, FPGA emulator, or the default device. The appropriate definition can be provided at the time of compiling the sample to create the intended device selector object. The makefile illustrates how to specify such a definition. If no device selector definition is explicitly specified during compile time, then the default device selector object is created. This has the advantage of selecting the most performant device among the available devices during runtime. If you do not intend to use either a FPGA or a FPGA emulator, then you can omit specifying a definition and the default device selector object is created.

int main() {

  // Create device selector for the device of your interest.

#if FPGA_EMULATOR

  // DPC++ extension: FPGA emulator selector on systems without FPGA card.

  INTEL::fpga_emulator_selector d_selector;
#elif FPGA

  // DPC++ extension: FPGA selector on systems with FPGA card.

  INTEL::fpga_selector d_selector;
#else

  // The default device selector will select the most performant device.

  default_selector d_selector;
#endif

Create a Command Queue

The next step is to create a command queue. Any computation you want to offload to a device is queued/submitted to the command queue. This sample instantiates a command queue by passing following arguments to its constructor:

  • A device selector
  • An exception handler

The first argument, the device selector, is what you created in the Create Device Selector section above. The second argument, the exception handler, is needed to handle any asynchronous exception that the offloaded computation may encounter at the time of execution. A exception_handler has been implemented to handle async exceptions:

 queue q(d_selector, exception_handler);

The exception_handler handles exceptions by invoking std::terminate() to terminate the process. On debug builds, the handler first prints Failure as a message, and then terminates the process.

If you want to handle exceptions differently, without terminating the process for example, then you can provide your own exception handler.

Create and Initialize Vectors

For a vector_add operation, you need two vectors as inputs and a third vector to store the result. This sample creates two vectors for a result: One is the output vector for sequential computation performed on the host and the other result vector is used to retain the output from the parallel computation performed on the device.

Two input vectors are initialized with values: 0, 1, 2, …

Allocate Device Visible Memory

To perform compute on the device, you need to make the input vectors visible to the device. You also need to copy back the computed result from the device to the host.

Along with offloading the compute on the device, this sample demonstrates how to make input visible to the device and copying back the computed result from the device. There are two major options to achieve this goal:

  • Buffers and accessors
  • USM

Buffers and Accessors

This code sample uses buffers and accessors. Below is a high-level summary of this technique and code snippets.

Create Buffers

This implementation demonstrates two ways to create buffers. Either pass a reference to a C++ container to the buffer constructor or pass both a pointer and a size as arguments to the buffer constructor. The first option is useful when your data is in C++/STL containers. The second option is useful when the data is in regular C/C++ arrays, for example:

// Create buffers that hold the data shared between the host and the devices.
// The buffer destructor is responsible to copy the data back to host when it
// goes out of scope.
  
  buffer a_buf(a_array);
  buffer b_buf(b_array);
  buffer sum_buf(sum_parallel.data(), num_items);
Create Accessors

You can create accessors from respective buffer objects by specifying data access mode. With vector_add, for example, the device accesses the first two vectors to read inputs and the third vector to write output. For input vectors/buffers, for example, a_buf, specify access::mode::read as an access mode to obtain the appropriate accessor. For output vector/buffer, for example, sum_buf, specify access::mode::write as an access mode.

Specifying the appropriate access mode is required for correctness and performance. Write access for output vector, for example, tells the Intel® oneAPI DPC++/C++ Compiler that the result computed on the device must be copied back to the host at the end of computation. This is required for correctness. At the same time, the compiler avoids copying the content of the result vector from the host to the device before performing computation on the device. This step improves performance by avoiding an extra copy from the host to device. This is correct because the computation writes/overwrites sum_buf.

 // Create an accessor for each buffer with access permission: read, write or
 // read/write. The accessor is a mean to access the memory in the buffer.

    accessor a(a_buf, h, read_only);
    accessor b(b_buf, h, read_only);

 // The sum_accessor is used to store (with write permission) the sum data.

    accessor sum(sum_buf, h, write_only);

Command Group Handler

The command group handler object is constructed by DPC++ runtime. All accessors, which are defined in a command group, take command group handler as an argument. That way the runtime keeps track of the data dependencies. The kernel invocation functions, for example, parallel_for, are member functions of command group handler class. A command group handler object cannot be copied nor be moved.

parallel_for

parallel_for is a commonly used DPC++ programming construct. While iterations of C++ for loop run sequentially, multiple logical iterations of parallel_for can run simultaneously by multiple execution/compute units of the device. As a result, the overall compute runs faster when parallel_for is used. It is suitable for data parallel compute where each logical iteration typically executes the same code, but it operates on different pieces of data (also known as single instruction multiple data: SIMD). parallel_for is optimal when there is no data dependency between logical iterations.

In this sample, you can use parallel_for because the corresponding elements from two input vectors can be independently and parallelly added to compute individual elements of the resultant vector, with no dependency on other elements of the input or output vectors. parallel_for is used to offload the compute onto the device. The first argument is the number of work items. For vector_add, the number of work items is simply the number of elements of the vector. The second argument is the kernel, a lambda function, encapsulating the compute for each work item. Each work item is responsible for computing the sum of two elements from two input vectors and writing the result of the sum operation to the corresponding element of the output vector.

The offloaded work (for example, parallel_for) continues asynchronously on the device. The last statement of the following code snippet waits for this asynchronous operation to complete.

    // Use parallel_for to run vector addition in parallel on device. This
    // executes the kernel.
    //    1st parameter is the number of work items.
    //    2nd parameter is the kernel, a lambda that specifies what to do per
    //    work item. The parameter of the lambda is the work item id.
    // DPC++ supports unnamed lambda kernel by default.

    h.parallel_for(num_items, [=](auto i) { sum[i] = a[i] + b[i]; });
  });
}

Kernel for Vector Add (Parallel Compute on the Device)

The compute is offloaded to the device by submitting a lambda function to the command queue. The device gets access to two input vectors and the output vector through the accessors. The previous sections covered how buffers and accessors are setup and how parallel_for is used to compute the result vector.

Before the result buffer (for example, sum_buf) goes out of scope, the DPC++ runtime copies the computed result from the device memory over to the host memory.

// Submit a command group to the queue by a lambda function that contains the
// data access permission and device computation (kernel).

q.submit([&](handler &h) {

  // Create an accessor for each buffer with access permission: read, write or
  // read/write. The accessor is a mean to access the memory in the buffer.

    accessor a(a_buf, h, read_only);
    accessor b(b_buf, h, read_only);

  // The sum_accessor is used to store (with write permission) the sum data.

    accessor sum(sum_buf, h, write_only);

  // Use parallel_for to run vector addition in parallel on device. This
  // executes the kernel.
  // The 1st parameter is the number of work items.
  // The 2nd parameter is the kernel, a lambda that specifies what to do per
  // work item. The parameter of the lambda is the work item ID.
  // DPC++ supports unnamed lambda kernel by default.

    h.parallel_for(num_items, [=](id<1> i) { sum[i] = a[i] + b[i]; });
  });
}

Sample 2: USM

USM offers three different types of allocations: Device, host, and shared. This sample uses shared allocations.

Shared allocations are accessible on the host and the device. They are similar to host allocations, but they differ in that data can now migrate between host memory and device-local memory. This means that access on a device, after the migration from the host memory to the device local memory has completed, come from the device local memory instead of remotely accessing host memory. This is accomplished by the DPC++ runtime and lower-level drivers. Shared allocations use implicit data movement. With this type of allocation, you do not need to explicitly insert copy operations to move data between the host and device. Instead, you access data using the pointers inside a kernel, and any required data movement is performed automatically. This simplifies porting existing code to DPC++: replace any malloc or new with the appropriate DPC++ USM allocation functions. Shared allocation is supported via software abstraction, which is like SYCL* buffer or device allocation. The only advantage is that the data is implicitly migrated rather than explicitly. With necessary hardware support, the page migration starts (the data will migrate page-by-page between the host and the device). With page migration, computation overlaps with incremental data movement, without waiting for the data transfer to complete. This implicit overlap potentially increases the throughput of the computation. This is an advantage of using shared allocation. On the other hand, with device allocation and SYCL buffers, the computation does not start until data transfer is complete.

Memory Allocation and Memory Free

Use malloc_shared() to allocate shared memory accessible to both the host and the device. Specify data type arguments as int. The two arguments are:

  • The number of elements in vector and
  • The queue (associated with a device to offload the compute)

Use free() to release memory. Using free() in DPC++ is similar to C++. DPC++ uses an extra argument, the queue:

    // Create arrays with "array_size" to store input and output data. Allocate
    // unified shared memory so that both CPU and device can access them.

    int *a = malloc_shared<int>(array_size, q);
    int *b = malloc_shared<int>(array_size, q);
    int *sum_sequential = malloc_shared<int>(array_size, q);
    int *sum_parallel = malloc_shared<int>(array_size, q);

    if ((a == nullptr) || (b == nullptr) || (sum_sequential == nullptr) ||
        (sum_parallel == nullptr)) {
      if (a != nullptr) free(a, q);
      if (b != nullptr) free(b, q);
      if (sum_sequential != nullptr) free(sum_sequential, q);
      if (sum_parallel != nullptr) free(sum_parallel, q);

      std::cout << "Shared memory allocation failure.\n";
      return -1;
    }

Kernel Using USM

Like Sample 1, this sample (using USM) employs parallel_for to add two vectors. e.wait() waits for everything submitted to the queue before it completes. For example, a copy back computed result from the device local memory to the host memory:

// Create the range object for the arrays.

  range<1> num_items{size};

  // Use parallel_for to run vector addition in parallel on device. This
  // executes the kernel.
  //    1st parameter is the number of work items.
  //    2nd parameter is the kernel, a lambda that specifies what to do per
  //    work item. the parameter of the lambda is the work item id.
  // DPC++ supports unnamed lambda kernel by default.

  auto e = q.parallel_for(num_items, [=](auto i) { sum[i] = a[i] + b[i]; });


  // q.parallel_for() is an asynchronous call. DPC++ runtime enqueues and runs
  // the kernel asynchronously. Wait for the asynchronous call to complete.

  e.wait();
}

Verification of Results

Once the computation is complete, you can compare the outputs from sequential and parallel computations to verify that the execution on the device computed the same result as on the host:

    // Compute the sum of two arrays in sequential for validation.

    for (size_t i = 0; i < array_size; i++) sum_sequential[i] = a[i] + b[i];

If the verification is successful, a few elements from input vectors and the result vector are printed to show the output of the computation:

    // Print out the result of vector add.

    for (int i = 0; i < indices_size; i++) {
      int j = indices[i];
      if (i == indices_size - 1) std::cout << "...\n";
      std::cout << "[" << j << "]: " << j << " + " << j << " = "
                << sum_sequential[j] << "\n";
    }

    free(a, q);
    free(b, q);
    free(sum_sequential, q);
    free(sum_parallel, q);
  } catch (exception const &e) {
    std::cout << "An exception is caught while adding two vectors.\n";
    std::terminate();
  }

  std::cout << "Vector add successfully completed on device.\n";
  return 0;
}

Summary

This tutorial demonstrated the basic features of DPC++ that are commonly used in a DPC++ program, including how to create a queue, basic memory management (using buffers and accessors, or using USM), write kernel to offload the compute to the device, etc. Use this tutorial as a starting point for developing more complex applications.

Resources

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804