Migrating OpenCL™ Designs to DPC++

Published:05/11/2020

Overview

This document explores the similarities and differences between the Khronos OpenCL™ and the Data Parallel C++ (DPC++) standard, so developers will be able to transition existing OpenCL applications to DPC++ easily. This document assumes proficiency in OpenCL.

Data Parallel C++

DPC++ is an open standard-based cross-architecture programming language. The language allows developers to extract high performance from various hardware targets, including CPUs and accelerators such as GPUs and FPGAs. At the same time, it offers functional portability so users can reuse code across those architectures. DPC++ is part of the oneAPI cross-industry, open, standards-based unified programming model.

DPC++ is built upon modern ISO C++ and Khronos SYCL* standards while also extending it with additional capabilities that ease programmability and optimization. As a result, DPC++ delivers productivity benefits of modern C++ and performance benefits of data and task parallelism in a heterogeneous environment.

Fundamental Comparison to OpenCL

Because DPC++ includes SYCL, which is a higher-level abstraction layer that builds on OpenCL, when comparing DPC++ and OpenCL, most fundamental concepts are the same with an easy mapping of equivalent constructs between OpenCL and DPC++. DPC++, however, adds the convenience, productivity, and flexibility of single-source C++.  With the kernel code embedded in the host code, programmers gain the simplicity of coding and compilers gain the ability to analyze and optimize across the entire program regardless of the device on which the code is to be run.

DPC++ accomplishes this through single-source multiple-compiler passes (SMCP). With SMCP, a single source file is parsed by different compilers for different target devices generating different device binaries. In many cases, those binaries are combined into a single executable. For example, the host compiler generates native host code from an application. A device compiler will parse the same source file, targeting only kernels, and generates the device binary.

Migrating from OpenCL to DPC++

Migrating from OpenCL to DPC++ is relatively straight forward since both languages have similar programming models and features. DPC++ typically requires fewer lines of code to execute kernels due to fewer calls to essential host API functions.

Most of the OpenCL application developers are aware of the somewhat verbose setup code that goes with offloading kernels on devices. Using DPC++, it is possible to develop a clean, modern C++ based application without most of the setup associated with OpenCL C code. This reduces the learning effort and allows for a focus on parallelization techniques.

Let’s demonstrate with a simple example. The vector add kernel and application program are shown below for both OpenCL and DPC++ in their entirety. The OpenCL and DPC++ versions perform the same task of offloading the kernel to the default device. But in DPC++, because of DPC++ defaults, the code is much simplified.

Vector Add Example

OpenCL
const char *kernelsource =  (R”(__kernel void vecAdd(  __global int *a,
                                                       __global int *b,
                                                       __global int *c)
{
   int i = get_global_id(0);
   c[i] = a[i] + b[i];
} )”);

void opencl_code(int* a, int* b, int* c, int N)
{
   cl_int err;
   cl_platform_id myplatform;
   cl_device_id mydevice;
   //Choose the first platform and device
   err=clGetPlatformIDs(1, &myplatform, NULL);
   err=clGetDeviceIDs(myplatform, CL_DEVICE_TYPE_DEFAULT, 1, &mydevice, NULL);
   //Set up context and queue
   cl_context mycontext = clCreateContext(0, 1, &mydevice, NULL, NULL, &err);
   cl_command_queue myq = clCreateCommandQueue(mycontext, mydevice, 0, &err);
   //Create program from source code, compile, and create kernel object
   cl_program myprogram = clCreateProgramWithSource(mycontext, 1,
      (const char **) & kernelsource, NULL, &err);
   clBuildProgram(myprogram, 1, &mydevice, NULL, NULL, NULL);
   cl_kernel mykernel = clCreateKernel(myprogram, "vecAdd", &err);

   //Create Buffers and Write A and B to the device.
   size_t nbytes=N*sizeof(int);
   cl_mem buf_a= clCreateBuffer(mycontext, CL_MEM_READ_ONLY, nbytes, NULL, &err);
   cl_mem buf_b= clCreateBuffer(mycontext, CL_MEM_READ_ONLY, nbytes, NULL, &err);
   cl_mem buf_c= clCreateBuffer(mycontext, CL_MEM_WRITE_ONLY, nbytes, NULL, &err);
   err=clEnqueueWriteBuffer(myq, buf_a, CL_TRUE, 0, nbytes,  a, 0, NULL, NULL);
   err=clEnqueueWriteBuffer(myq, buf_b, CL_TRUE, 0, nbytes,  b, 0, NULL, NULL); 
   err = clSetKernelArg(mykernel, 0, sizeof(cl_mem), &buf_a);
   err = clSetKernelArg(mykernel, 1, sizeof(cl_mem), &buf_b);
   err = clSetKernelArg(mykernel, 2, sizeof(cl_mem), &buf_c);
   //Execute Kernel and upon completion copy data back to the host.
   err = clEnqueueNDRangeKernel(myq, mykernel, 1, NULL, &N, NULL, 0, NULL, NULL);
   clFinish(myq);
   clEnqueueReadBuffer(myq, buf_c, CL_TRUE, 0, nbytes, c, 0, NULL, NULL);
}
DPC++
void dpcpp_code(int* a, int* b, int* c, int N)
{
   //queue and buffer creation
   queue q;
   buffer<int,1> buf_a(a, range<1>(N));
   buffer<int,1> buf_b(b, range<1>(N));
   buffer<int,1> buf_c(c, range<1>(N));

   q.submit([&](handler &h){
      //Accessors provide kernel access to buffer
      auto A=buf_a.get_access<access::mode::read>(h);
      auto B=buf_b.get_access<access::mode::read>(h);
      auto C=buf_c.get_access<access::mode::write>(h);
      
      //Parallel Launch of Kernel with 1D range of N elements
      h.parallel_for(range<1>(N), [=](item<1> i) {
         C[i] = A[i]+B[i];
      });
   });
}

OpenCL interoperability

In DPC++, all of OpenCL features can be supported through SYCL API. In addition, there are interoperability functions that allow the direct calling of OpenCL APIs in SYCL. For example, SYCL buffers can be constructed from OpenCL buffers, OpenCL queues can be obtained from SYCL queues, and OpenCL kernels can be invoked from a SYCL program. We will show examples of these at the end of the document.

Programming Model

The DPC++ programming model is very similar and in many aspects, equivalent to that of OpenCL.

Platform Model

The DPC++ platform model is based on OpenCL but includes additional abstractions. In both DPC++ and OpenCL, the platform model specifies a host that coordinates and controls the compute work performed on one or more devices. Devices can include CPUs, GPUs, FPGAs, and other accelerators. In DPC++, there is always a device corresponding to the host and this guarantees there will always be an available target for device kernel code. The most effective programming approach makes use of both the host and devices, hides latencies caused by data movement to and from the device, and accelerates appropriate workloads on the devices.

Device Terminology Equivalence

The following table shows how OpenCL and DPC++ terminology map to GPU hardware. As you can see, both OpenCL and DPC++ uses the same terminology.

OpenCL DPC++ GPU Hardware
Compute Unit Compute Unit Subslice (Intel)
[Dual] Compute Unit (AMD*)
Streaming Multiprocessor (Nvidia*)
Processing Element Processing Element Execution Unit (Intel)
SIMD Unit (AMD)
SM Core (Nvidia)

Execution Model Equivalence

Both OpenCL and DPC++ allow hierarchical and parallel execution. The concept of work-group, subgroup, and work-items are equivalent in the two languages. Subgroups, which sits in between work-groups and work-items, defines a grouping of work-items within a work-group. Typically, subgroups map to Single Instruction Multiple Data (SIMD) hardware where it exists. Synchronization of work-items in a subgroup can occur independently of work-items in other subgroups, and subgroups expose communication operations across work-items in the group.

OpenCL DPC++ GPU Hardware
Work-group Work-group Thread Group
Subgroup Subgroup Execution Unit Thread
(Vector Hardware)
Work-item Work-item SIMD Lane (Channel)

Host API

The Host API is used in both OpenCL and DPC++ to manage the platforms and devices. The following table shows the various include files that define the host API.

Include Files
OpenCL DPC++
CL/opencl.h or CL/cl.h
CL/cl.hpp (C++ Binding)
CL/cl2.hpp(C++ Binding v2.x)
CL/sycl.hpp

Platform Layer API

The platform layer API sets up the device execution environment. The tasks performed by the platform layer API include the following

  1. Allow the host to discover devices and capabilities
  2. Query, select, and initialize compute devices
  3. Create compute contexts

In OpenCL, an explicit platform layer API must be called to select the platform (vendor), devices (accelerators), and context. In DPC++, the programmer has the option to do the same as OpenCL or to rely on the DPC++ runtime to choose a default platform and device. The categories of accelerator devices available in DPC++ are similar to OpenCL.

To explicitly specify a device In DPC++, use a subclass derived from sycl::device_selector abstract class. DPC++ provides built-in device selectors to help get code up and running quickly.

Available built-in device selectors:

  • default_selector : implementation defined, selects host if no other device is available.
  • host_selector : selects host device, always returns a valid device.
  • cpu_selector : attempts to select a CPU device.
  • gpu_selector : attempts to select a GPU device.
  • intel::fpga_selector : attempts to select an FPGA device. 
  • intel::fpga_emulator_selector : attempts to select an FPGA emulation device, for FPGA kernel debugging purposes.
  • accelerator_selector : attempts to select other possible accelerators. 

If an application requires a specific selection of a device such as a specific GPU from many GPUs available in a system, you’re able to write your own device selector class inherited from the sycl::device_selector class.

Platform Layer API
OpenCL DPC++
//Get the first platform ID
cl_platform_id myp;
err=clGetPlatformIDs(1, &myp, NULL);

// Get the device
cl_device_id mydev;
err=clGetDeviceIDs(myp,
    CL_DEVICE_TYPE_GPU, 1, &mydev, NULL);

//Create Context
cl_context context;
context = clCreateContext(NULL, 1,
          &mydev, NULL, NULL, &err);
using namespace sycl;

//Optional, prepare device selector
default_selector selector;
//or
//For development and debugging
host_selector selector;
//or
cpu_selector selector;
//or
gpu_selector selector;
//or
intel::fpga_selector selector;
//or write your own selector
my_custom_selector selector;

Command Queue

In both OpenCL and DPC++, the command queue is the mechanism for the host to request action by the device. Each command queue is associated with one device and the host submits commands to the queue. Multiple queues can be mapped to the same device.

In OpenCL, command queues are created from a deviceID and tasks can then be submitted to them through clEnqueue… commands.

In DPC++, queues are created from device selectors. If a device selector is not provided, the run time will simply use the default_selector. Tasks can then be submitted to the queue using the submit command.

Queue
OpenCL DPC++
cl_command_queue q;
q = clCreateCommandQueue(…);
clEnqueue…(q, …);
//queue class
queue q(selector);
q.submit([&](handler& h) {
     //Command Group Code
});

Data Management

To manage data in OpenCL, buffers or images that represent an abstract view of the device memory must be created, and functions are used to read from, write to, or copy with the device memory. In DPC++, you can still work with buffers and images, but you may also access device memory through pointers by using the Unified Shared Memory (USM) feature. When using buffers and images in DPC++, accessors are used as the mechanism to access the data in the kernel. With accessors, implicit data dependency is detected, and kernel execution is scheduled accordingly.

Data Management
OpenCL DPC++
clCreateBuffer() buffer class
clEnqueueReadBuffer() Handled implicitly with accessors
or explicitly with handler::copy() or queue::memcpy()
clEnqueueWriteBuffer()
clEnqueueCopyBuffer()
N/A accessor class

Data Management Example

OpenCL
cl_mem buf_a = clCreateBuffer(context, CL_MEM_READ_ONLY, N*sizeof(int), NULL, &err);
cl_mem buf_b = clCreateBuffer(context, CL_MEM_WRITE_ONLY, N*sizeof(int), NULL, &err);

clEnqueueWriteBuffer(q, buf_a, CL_TRUE, 0, N*sizeof(int),  a, 0, NULL, NULL); 
//Code to Launch Kernel
…
clEnqueueReadBuffer(q, buf_b, CL_TRUE, 0, N*sizeof(int), b, 0, NULL, NULL);
DPC++
{
   //Create buffers
   //Reading and writing of data automatically derived through scope
   buffer<int,1> buf_a(a, range<1>(N));
   buffer<int,1> buf_b(b, range<1>(N));
   q.submit([&](handler &h){
        auto A=buf_a.get_access<access::mode::read>(h);
        auto B=buf_b.get_access<access::mode::write>(h);
        h.parallel_for…  {         //Launch Kernel
             B[i] = process(A[i]);
        }
   });
}

Unified Shared Memory (USM)

DPC++ also includes support for USM that simplifies programming by enabling the sharing of memory between the host and device without explicit accessors in the source code. USM provides a pointer-based alternative to buffers. This allows C++ pointer-based programs to be easily ported to DPC++ since DPC++ can continue to accept pointers. Devices that support USM must support virtual address space, so that a pointer value returned by a USM allocation routine on the host is guaranteed to be valid on the device. With USM, programmers manage access and enforce dependencies with functions to wait on events or by signaling a depend_on relationship between events.

Kernel Launch

Because of the single-source nature of DPC++, the process to specify, compile, and launch kernels is different between OpenCL and DPC++. In OpenCL, the kernel function is usually located in a separate source file or precompiled binary. In DPC++, the kernel function is specified inside the kernel scope portion which is inside the command group scope of the unified source file. DPC++ kernels are typically expressed in the form of C++ lambdas or function objects (functors).

To launch a kernel, in OpenCL, you first construct a program object from either the source code or precompiled binary, then run the clBuildProgram function to compile the kernels. Next you extract the specific kernel from the compiled program using the clCreateKernel command. After that you use the clSetKernelArg command to map each of the kernel arguments to a buffer, image, or variable. Lastly you’re able to launch your kernel using the clEnqueueNDRangeKernel command. If a hierarchical NDRange launch is required, you’ll need to specify the global_work_size which is the total number of work-items as well as the local_work_size which is the number of work-items in each work-group.

The process to launch a kernel in DPC++ is much simpler. All you need to do is write your kernel as a lambda or functor inside command group scope that was created by the queue::submit() call. When you call a function on the command group handle that executes the kernel, for example, parallel_for, single_task, or parallel_for_work_group, pass in the lambda or functor, and the kernel function will execute on the device. For parallel kernels, you’ll need to pass in either the range which specifies the global range or the nd_range which specifies both the global and local execution ranges.

Kernel Launch
OpenCL DPC++
clCreateProgramWithSource/Binary()
clBuildProgram()
clCreateKernel()
clSetKernelArg()
clEnqueueNDRangeKernel()
queue::submit()
parallel_for()
parallel_for_work_group()
parallel_for_work_item()
single_task()
global_work_size, local_work_size variables range class
nd_range class

Kernel Launch Example

OpenCL
cl_program myprogram = clCreateProgramWithSource(…);
clBuildProgram(myprogram…);
cl_kernel mykernel = clCreateKernel(program, “kernel_name”, &err);
clSetKernelArg(mykernel, 0, sizeof(cl_mem), (void *) &a_buf);
clSetKernelArg(mykernel, 1, sizeof(cl_mem), (void *) &b_buf);
clEnqueueNDRangeKernel(queue, mykernel, 2, NULL, global_size, local_size, …);
DPC++
q.submit([&](handler &h){
     range<2> global(N,N);
     range<2> local(B,B);
     h.parallel_for(nd_range<2>(global,local), [=](nd_item<2> item)
     {
          // Kernel Code
     });
});

Synchronization

The features to synchronize execution between the host and various command queues corresponding to the devices are the same between OpenCL and DPC++. However, in DPC++, there’s also implicit dependency with accessor usage. If two kernels use the same buffer, accessors will automatically create data dependencies in the SYCL graph so that the second kernel will wait for the completion of the first kernel. Explicit synchronization is available in both OpenCL and DPC++, with equivalencies shown in the following table.

Kernel Launch
OpenCL DPC++
clFinish() queue::wait()
clWaitForEvents() event::wait()

Device Kernel Code

With kernel code, OpenCL and DPC++ are very similar. This section will explain the differences.

The first difference is the use of kernel qualifiers and address space qualifiers. These qualifiers are not needed in DPC++ as they are all abstracted by the runtime classes.

Qualifiers
OpenCL DPC++
__kernel N/A
__constant N/A
__global N/A
__local N/A
__private N/A

The concept of indexing work-groups and work-items inside a kernel is the same across DPC++ and OpenCL. But the way the indexing functions are called is slightly different. In DPC++, the nd_item class provides functions that return various indexes, sizes, and ranges. The nd_item class includes additional functionality that returns the global or local linear index, rather than index within a specific dimension. The nd_item class can also return group and sub_group objects, which encapsulate functionality related to work-groups and sub-groups.

Also, DPC++, through the group and sub_group classes, provides functionality on work-groups and sub-groups such as broadcast, any, all, reduce, exclusive scan, inclusive scan, and shuffles for members of a work-group or sub-group.

Kernel Queries
OpenCL DPC++
get_global_id() nd_item::get_global_id()
get_local_id() nd_item::get_local_id()
get_group_id() nd_item::get_group_id()
get_global_size() nd_item::get_global_range()
get_local_size() nd_item::get_local_range()
get_num_group() nd_item::get_num_group()
N/A get_group()
N/A get_sub_group()
N/A get_global_linear_id()
N/A get_local_linear_id()

As with indexing functions, kernel synchronization functionality in DPC++ is also provided on the nd_item class.  With each of the functions, pass in the fence_space, which can be local, global, or global_and_local.

Synchronization
OpenCL DPC++
barrier() nd_item::barrier()
mem_fence() nd_item::mem_fence()
read_mem_fence() nd_item::mem_fence()
write_mem_fence() nd_item::mem_fence()

Kernel Example

In this section, we will examine and compare the OpenCL and DPC++ implementation of a tiled matrix multiply, where we are calculating matrix c=matrix a x b. Local memories a_tile and b_tile are used to minimize loads and stores to global memory. The kernel is launched as an ND range kernel with a 2D global size of N x N and 2D local size of B x B. For DPC++, we first use parallel_for_work_group to enable work-group parallelism and then use parallel_for_work_item to enable work-item parallelism. With these constructs and the corresponding scopes, barriers are implicit.

Kernel Matrix Multiply Example

OpenCL
__kernel void matrix_mul(__global float *restrict a,
                         __global float *restrict b,
                         __global float *restrict c)
{
     __local float a_tile[B][B];
     __local float b_tile[B][B];
     int j=get_global_id(0);
     int i=get_global_id(1);
     int lj=get_local_id(0);
     int li=get_local_id(1);
     for (int kb=0; kb < N/B; ++kb)
          //Load tiles of A and B matrices into local memory
          a_tile[lj][li] = a[j][kb*B+li];
          b_tile[lj[li] = b[kb*B+lj][i];
   
          //Wait for load into local memory to complete
          barrier(CLK_LOCAL_MEM_FENSE);

          //Compute matrix multiply using local memory
          for (int k=0; k < B; ++k)
          {
               c[j][i] += a_tile[lj][k] + b_tile[k][li];
          }
          
          //Barrier to ensure all work-items are done
          barrier(CLK_LOCAL_MEM_FENCE);
     }
}
DPC++
h.parallel_for_work_group<class matrix_mul>(range<2>(N/B, N/B), 
                                            [=] (group<2> grp) {
float a_tile[B][B];
float b_tile[B][B];
int jb=grp.get_id(0);
int ib=grp.get_id(1);

for (int kb=0;kb<N/B; ++kb) {
     //This parallel_for_work_item is for loading tiles of A and B
     grp.parallel_for_work_item(range<2>(B,B), [&](h_item<2> item) {
          int lj=item.get_logical_local_id(0);
          int li=item.get_logical_local_id(1);
          int j=jb*B+lj;
          int i=ib*B+li;
          //Load tiles of A and B matrices into local memory
          a_tile[lj][li] = a[j][kb*B+li];
          b_tile[lj][li] = b[kb*B+lj][i];
     });
//Implicit Barrier Here
     grp.parallel_for_work_item(range<2>(B,B), [&](h_item<2> item) {
          int lj=item.get_logical_local_id(0);
          int li=item.get_logical_local_id(1);
          int j=jb*B+lj;
          int i=ib*B+li;
 
          //Compute matrix multiply using local memory
          for (int k=0; k < B; ++k)
          {
                c[j][i] += a_tile[lj][k] * b_tile[k][li];
          }
      });
//Implicit Barrier Here As well
}
});

OpenCL and DPC++ Interoperability Example

When migrating existing OpenCL applications to DPC++, developers may prefer to port their code piecemeal incrementally. OpenCL and DPC++ are interoperable in several ways and we’ll examine a few of those in this section.

Executing OpenCL kernels from DPC++ programs

If you would like to keep your kernel code in OpenCL while executing it in a DPC++ environment, SYCL provides the mechanisms to do exactly that. Here are the steps to perform OpenCL kernel ingestion through a DPC++ program.

  • Create a sycl::program object from the same context used by the device queue
  • Build the kernel code with the program::build_with_source() function by passing in the OpenCL kernel source string.
  • Create accessors for kernel arguments in the DPC++ command queue scope
  • Assign assessor[s] as kernel argument by using the handler::set_arg() or handler::set_args() function.
  • Launched the kernel using handler::single_task() or handler::parallel_for() functions, passing in the range or NDrange along with the kernel object obtained from the program object.

In the example below, we’re performing the same vector add as the example at the beginning of this document, except we’re executing the OpenCL kernel in the SYCL environment.

OpenCL Kernel in a DPC++Program Example

void dpcpp_code(int* a, int* b, int* c, int N)
{
  queue q{gpu_selector()};       //Create Command Queue Targeting GPU
  program p(q.get_context());  //Create program from the same context as q

  //Compile OpenCL vecAdd kernel. which is expressed as a C++ Raw String
  //as indicated by R”
  p.build_with_source(R"( __kernel void vecAdd(__global int *a, 
                                               __global int *b, 
                                               __global int *c) 
                         {
                                 int i=get_global_id(0);
                                 c[i] = a[i] + b[i]; 
                         } )");
    buffer<int, 1> buf_a(a, range<1>(N));
    buffer<int, 1> buf_b(b, range<1>(N));
    buffer<int, 1> buf_c(c, range<1>(N));
    q.submit([&](handler& h) {
            auto A = buf_a.get_access<access::mode::read>(h);
            auto B = buf_b.get_access<access::mode::read>(h);
            auto C = buf_c.get_access<access::mode::write>(h);
            // Set buffers as arguments to the kernel
            h.set_args(A, B, C);
            // Launch vecAdd kernel from the p program object across N elements.
            h.parallel_for(range<1> (N), p.get_kernel("vecAdd"));
    });
}

Converting DPC++ objects to OpenCL objects

If you would like to augment DPC++ programs also to use OpenCL API, you will need to convert DPC++ objects to their underlying interoperable OpenCL versions. Many DPC++ objects have get() methods to derive the OpenCL object that can be used with OpenCL API, below is a list of those objects.

DPC++/SYCL Objects with get()

  • cl::sycl::platform::get() -> cl_platform_id
  • cl::sycl::context::get() -> cl_context
  • cl::sycl::device::get() -> cl_device_id
  • cl::sycl::queue::get() -> cl_command_queue
  • cl::sycl::event::get() -> cl_event
  • cl::sycl::program::get() -> cl_program
  • cl::sycl::kernel::get() -> cl_kernel

Converting OpenCL objects to DPC++ objects

OpenCL objects can be used in SYCL constructors to create the corresponding SYCL object. This technique is most often used when SYCL runtime functionality is desired to be added to an existing OpenCL source base. Be aware, while DPC++ buffers and images cannot be converted to cl_mem objects from get(), buffers and images do have constructors that take cl_mem.

DPC++/SYCL Constructors Using OpenCL objects

  • cl::sycl::platform::platform(cl_platform_id)
  • cl::sycl::context::context(cl_context, …)
  • cl::sycl::device::device(cl_device_id)
  • cl::sycl::queue::queue(cl_command_queue,…)
  • cl::sycl::event::event(cl_event, …)
  • cl::sycl::program::program(context, cl_program)
  • cl::sycl::kernel::kernel(cl_kernel, …)
  • cl::sycl::buffer::buffer(cl_mem, …)
  • cl::sycl::image::image(cl_mem, …)

The following example shows a DPC++ program using OpenCL objects cl_mem, cl_command_queue, cl_kernel, and cl_context.

DPC++Program Using OpenCL Objects

cl_kernel ocl_kernel = clCreateKernel(ocl_program, "vecAdd", &err);
cl_mem ocl_buf_a=clCreateBuffer(ocl_context, CL_MEM_READ_ONLY, bytes, NULL, NULL);
cl_mem ocl_buf_b=clCreateBuffer(ocl_context, CL_MEM_READ_ONLY, bytes, NULL, NULL);
cl_mem ocl_buf_c=clCreateBuffer(ocl_context, CL_MEM_READ_ONLY, bytes, NULL, NULL);
clEnqueueWriteBuffer(ocl_queue, ocl_buf_a, CL_TRUE, 0, bytes, host_a, 0, NULL, NULL );        
clEnqueueWriteBuffer(ocl_queue, ocl_buf_b, CL_TRUE, 0, bytes, host_b, 0, NULL, NULL );        
{ //DPC++ Application Scope
     //Construct SYCL versions of the context, queue, kernel, and buffers
     context sycl_context(ocl_context);
     queue sycl_queue(ocl_queue, sycl_context);
     kernel sycl_kernel(ocl_kernel, sycl_context); 
     buffer<int, 1> sycl_buf_a(ocl_buf_a, sycl_context);
     buffer<int, 1> sycl_buf_b(ocl_buf_b, sycl_context);
     buffer<int, 1> sycl_buf_c(ocl_buf_c, sycl_context);
     sycl_queue.submit([&](handler& h) {
          // Create accessors for each of the buffers
          auto a_accessor = sycl_buf_a.get_access<access::mode::read>(h);
          auto b_accessor = sycl_buf_b.get_access<access::mode::read>(h);
          auto c_accessor = sycl_buf_c.get_access<access::mode::write>(h);
          // Map kernel arguments to accessors
          h.set_args(a_accessor, b_accessor, c_accessor);
          //Launch Kernel
          h.parallel_for(r, sycl_kernel);
     });
}
//Read buffer content back to host array
clEnqueueReadBuffer(ocl_queue, ocl_buf_c, CL_TRUE, 0, bytes, host_c, 0, NULL, NULL );

Conclusion

OpenCL and DPC++ are both heterogeneous, parallel, open languages with similar features and characteristics, so existing OpenCL applications can be easily migrated to take advantage of DPC++ features. DPC++ includes modern programming features such as single-source programming, type-safe kernel execution, built-in defaults, simplified synchronization, and much more. These features allow applications to be cleanly-coded or ported, and performance-optimized to execute on various accelerators.
 
Further Information

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.