Developer Guide

Performance Impact of USM and Buffers

SYCL offers several choices for managing memory on the device. In this section, we discuss the performance tradeoffs. We briefly introduce the concepts. See Data Parallel C++ for an in-depth explanation.
As with other language features, the specification defines the behavior but not the implementation so performance characteristics can change between software versions and devices. We provide best practices.
Buffers
. A buffer is a container for data that can be accessed from a device and the host. The SYCL runtime manages the memory by providing APIs for allocating, reading, and writing memory. The runtime is responsible for moving data between host and device and synchronizing access to the data.
Unified Shared Memory
(USM). USM allows reading and writing of data with conventional pointers, in contrast to buffers where access to data is exclusively by API. USM has 2 commonly-used variants.
Device
allocations can only be accessed from the device and therefore require explicit movement of data between host and device.
Shared
allocations can be referenced from device or host, with the runtime automatically moving memory.
We illustrate the tradeoffs between the choices by showing the same example program written with the 3 models. To bring out the issues, we use a program where a GPU and the host cooperatively compute, and therefore need to ship data back and forth.
We start by showing the serial computation below. Assume that we want to perform the loop at line 9 on the GPU and the loop on line 14 on the CPU. Both loops read and write the data array so data must move between host and GPU for each iteration of the loop in line 8.
void serial(int stride) { // Allocate and initialize data float *data = new float[data_size]; init(data); timer it; for (int i = 0; i < time_steps; i++) { for (int j = 0; j < data_size; j++) { for (int k = 0; k < device_steps; k++) data[j] += 1.0; } for (int j = 0; j < data_size; j += stride) data[j] += 1.0; } put_elapsed_time(it); check(data); delete[] data; } // serial

Buffers

Below, we show the same computation using buffers to manage data. A buffer is created at line 3 and initialized by the
init
function. The
init
function is not shown. It accepts an
accessor
or a pointer. The
parallel_for
executes the kernel defined on line 13. The kernel uses the
device_data
accessor
to read and write data in
buffer_data
. Note that the code does not specify the location of data. An
accessor
indicates when and where the data is needed, and the SYCL runtime moves the data to the device (if necessary) and then launches the kernel. The
host_accessor
on line 21 indicates that the data will be read/written on the host. Since the kernel is also read/writing
buffer_data
, the
host_accessor
constructor waits for the kernel to complete and moves data to the host to perform the read/write on line 23. In the next iteration of the loop the
accessor
constructor on line 11 waits until the until the data is moved back to the device, which effectively delays launching the kernel.
void buffer_data(int stride) { // Allocate buffer, initialize on host sycl::buffer<float> buffer_data{data_size}; init(sycl::host_accessor(buffer_data, sycl::write_only, sycl::noinit)); timer it; for (int i = 0; i < time_steps; i++) { // Compute on device q.submit([&](auto &h) { sycl::accessor device_data(buffer_data, h); auto compute = [=](auto id) { for (int k = 0; k < device_steps; k++) device_data[id] += 1.0; }; h.parallel_for(data_size, compute); }); // Compute on host sycl::host_accessor host_data(buffer_data); for (int i = 0; i < data_size; i += stride) host_data[i] += 1.0; } put_elapsed_time(it); const sycl::host_accessor h(buffer_data); check(h); } // buffer_data
Performance Considerations
The data access on lines 15 and 23 appear to be simple array references, but they are implemented by the SYCL runtime with C++ operator overloading. The efficiency of accessor array references depends on the implementation. In practice, device code pays no overhead for overloading compared to direct memory references. The runtime does not know in advance which part of the buffer is accessed, so it must ensure all the data is on the device before the kernel begins. This is true today, but may change over time. The same is not currently true for the
host_accessor
. The runtime does not move all the data to the host. The array references are implemented with more complex code and are significantly slower than native C++ array references. While it is acceptable to reference a small amount of data, computationally intensive algorithms using
host_accessor
pay a large performance penalty and should be avoided.
Another issue is concurrency. A
host_accessor
can block kernels that reference the same buffer from launching, even if the accessor is not actively being used to read/write data. Limit the scope that contains the
host_accessor
to the minimum possible. In this example, the host accessor on line 4 is destroyed after the
init
function returns and the host accessor on line 21 is destroyed at the end of each loop iteration.

Shared Allocations

Next we show the same algorithm implemented with shared allocations. Data is allocated on line 2. Accessors are not needed because USM-allocated data can be referenced with conventional allows pointers. Therefore, the array references on lines 10 and 15 can be implemented with simple indexing. The
parallel_for
on line 12 ends with a
wait
to ensure the kernel finishes before the host accesses
data
on line 15. Similar to buffers, the SYCL runtime ensures that all the data is resident on the device before launching a kernel. And like buffers, shared allocations are not copied to the host unless it is referenced. The first time the host references data, there is an operating system page fault, a page of data is copied from device to host, and execution continues. Subsequent references to data on the same page execute at full speed. When a kernel is launched, all of the host-resident pages are flushed back to the device.
void shared_usm_data(int stride) { float *data = sycl::malloc_shared<float>(data_size, q); init(data); timer it; for (int i = 0; i < time_steps; i++) { auto compute = [=](auto id) { for (int k = 0; k < device_steps; k++) data[id] += 1.0; }; q.parallel_for(data_size, compute).wait(); for (int k = 0; k < data_size; k += stride) data[k] += 1.0; } q.wait(); put_elapsed_time(it); check(data); sycl::free(data, q); } // shared_usm_data
Performance Considerations
Compared to buffers, data references are simple pointers and perform well. However, servicing page faults to bring data to the host incurs overhead in addition to the cost of transferring data. The impact on the application depends on the reference pattern. Sparse random access has the highest overhead and linear scans through data have lower impact from page faults.
Since all synchronization is explicit and under programmer control, concurrency is not an issue for a well designed program.

Device Allocations

The same program with device allocation can be found below. With device allocation, data can only be directly accessed on the device and must be explicitly copied to the host, as is done on line 21. All synchronization between device and host are explicit. Line 21 ends with a
wait
so the host code will not execute until the asynchronous copy finishes. The queue definition is not show but uses an in-order queue so the
memcpy
on line 21 waits for the
parallel_for
on line 18 to complete.
void device_usm_data(int stride) { // Allocate and initialize host data float *host_data = new float[data_size]; init(host_data); // Allocate device data float *device_data = sycl::malloc_device<float>(data_size, q); timer it; for (int i = 0; i < time_steps; i++) { // Copy data to device and compute q.memcpy(device_data, host_data, sizeof(float) * data_size); auto compute = [=](auto id) { for (int k = 0; k < device_steps; k++) device_data[id] += 1.0; }; q.parallel_for(data_size, compute); // Copy data to host and compute q.memcpy(host_data, device_data, sizeof(float) * data_size).wait(); for (int k = 0; k < data_size; k += stride) host_data[k] += 1.0; } q.wait(); put_elapsed_time(it); check(host_data); sycl::free(device_data, q); delete[] host_data; } // device_usm_data
Performance Considerations
Both data movement and synchronization are explicit and under the full control of the programmer. Array references are array references on host, so it has neither the page faults overhead of shared allocations, nor the overloading overhead associated with buffers. Shared allocations only transfer the data that the host actually references, with a memory page granularity. In theory, device allocations allow on-demand movement of any granularity. In practice, fine grained, asynchronous movement of data can be complex and most programmers simply move the entire data structure once. The requirement for explicit data movement and synchronization makes the code more complicated, but device allocations can provide the best performance.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.