Developer Guide

Data types for Atomic Operations

Atomics allow multiple work-items for any cross work-item communication via memory. DPC++ atomics are similar to C++ atomics and make the access to resources protected by atomics guaranteed to be executed as a single unit. The following kernel shows the implementation of a reduction operation in DPC++ where every work-item is updating a global accumulator atomically. The input data type of this addition and the vector on which this reduction operation is being applied is an integer. The performance of this kernel is reasonable as compared to other techniques like blocking used for reduction.
q.submit([&](auto &h) { sycl::accessor buf_acc(buf, h, sycl::read_only); sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::noinit); h.parallel_for(data_size, [=](auto index) { size_t glob_id = index[0]; auto v = sycl::ONEAPI::atomic_ref<int, sycl::ONEAPI::memory_order::relaxed, sycl::ONEAPI::memory_scope::device, sycl::access::address_space::global_space>( sum_acc[0]); v.fetch_add(buf_acc[glob_id]); }); });
If the data type of the vector is a float or a double as shown in the kernel below then the performance on certain accelerators is not good due to lack of hardware support for float or double atomics. The following two kernels demonstrates how that time to execute an atomic add can vary drastically based on whether native atomic is supported.
// int VectorInt(sycl::queue &q, int iter) { VectorAllocator<int> alloc; AlignedVector<int> a(array_size, alloc); AlignedVector<int> b(array_size, alloc); InitializeArray<int>(a); InitializeArray<int>(b); sycl::range num_items{a.size()}; sycl::buffer a_buf(a); sycl::buffer b_buf(b); auto start = std::chrono::steady_clock::now(); for (int i = 0; i < iter; i++) { q.submit([&](sycl::handler &h) { // InpuGt accessors sycl::accessor a_acc(a_buf, h, sycl::read_write); sycl::accessor b_acc(a_buf, h, sycl::read_only); h.parallel_for(num_items, [=](auto i) { auto v = sycl::ONEAPI::atomic_ref<int, sycl::ONEAPI::memory_order::relaxed, sycl::ONEAPI::memory_scope::device, sycl::access::address_space::global_space>( a_acc[0]); v += b_acc[i]; }); }); } q.wait(); auto end = std::chrono::steady_clock::now(); std::cout << "Vector int completed on device - took " << (end - start).count() << " u-secs\n"; return ((end - start).count()); }
When using atomics care must be taken to ensure that there is support in the hardware and that they can be executed efficiently. In Gen9 and Intel
®
Iris
®
X
e
integrated graphics there is no support for atomics on float or double data types and the performance of
VectorDouble
will be very poor. In future GPUs where the float and double atomics are supported in hardware the performance of the above kernel will be much better.
// int VectorDouble(sycl::queue &q, int iter) { VectorAllocator<double> alloc; AlignedVector<double> a(array_size, alloc); AlignedVector<double> b(array_size, alloc); InitializeArray<double>(a); InitializeArray<double>(b); sycl::range num_items{a.size()}; sycl::buffer a_buf(a); sycl::buffer b_buf(b); auto start = std::chrono::steady_clock::now(); for (int i = 0; i < iter; i++) { q.submit([&](sycl::handler &h) { // InpuGt accessors sycl::accessor a_acc(a_buf, h, sycl::read_write); sycl::accessor b_acc(a_buf, h, sycl::read_only); h.parallel_for(num_items, [=](auto i) { auto v = sycl::ONEAPI::atomic_ref<double, sycl::ONEAPI::memory_order::relaxed, sycl::ONEAPI::memory_scope::device, sycl::access::address_space::global_space>( a_acc[0]); v += b_acc[i]; }); }); } q.wait(); auto end = std::chrono::steady_clock::now(); std::cout << "Vector Double completed on device - took " << (end - start).count() << " u-secs\n"; return ((end - start).count()); }
By analyzing these kernels using VTune Profiler we can measure the impact of native atomic support. You can see that the VectorInt kernel is much faster than VectorDouble and VectorFloat.
VTune dynamic instruction
VTune dynamic instruction
VTune Profiler dynamic instruction analysis allows us to see the instruction counts vary dramatically when there is no support for native atomic.
Here is the assembly code for our VectorInt kernel.
VTune atomic int
VTune atomic int
Compared to the assembly code for VectorDouble. There are 33 million more GPU instructions required when we execute our VectorDouble kernel.
VTune atomic double
VTune atomic double
The standard C++ memory model assumes that applications execute on a single device with a single address space. Neither of these assumptions holds for DPC++ applications: different parts of the application execute on different devices (i.e., a host device and one or more accelerator devices); each device has multiple address spaces (i.e., private, local, and global); and the global address space of each device may or may not be disjoint (depending on USM support).
When using atomics in the global address space, again, care must be taken because global updates are much slower than local.
#include <CL/sycl.hpp> #include <iostream> int main() { constexpr int N = 256 * 256; constexpr int M = 512; int total = 0; int *a = static_cast<int *>(malloc(sizeof(int) * N)); for (int i = 0; i < N; i++) a[i] = 1; sycl::queue q({sycl::property::queue::enable_profiling()}); sycl::buffer<int> buf(&total, 1); sycl::buffer<int> bufa(a, N); auto e = q.submit([&](sycl::handler &h) { sycl::accessor acc(buf, h); sycl::accessor acc_a(bufa, h, sycl::read_only); h.parallel_for(sycl::nd_range<1>(N, M), [=](auto it) { auto i = it.get_global_id(); sycl::ONEAPI::atomic_ref<int, sycl::ONEAPI::memory_order_relaxed, sycl::ONEAPI::memory_scope_device, sycl::access::address_space::global_space> atomic_op(acc[0]); atomic_op += acc_a[i]; }); }); sycl::host_accessor h_a(buf); std::cout << "Reduction Sum : " << h_a[0] << "\n"; std::cout << "Kernel Execution Time of Global Atomics Ref: " << e.get_profiling_info<sycl::info::event_profiling::command_end>() - e.get_profiling_info<sycl::info::event_profiling::command_start>() << "\n"; return 0; }
It is possible to refactor your code to use local memory space as the following example demonstrates.
#include <CL/sycl.hpp> #include <iostream> int main() { constexpr int N = 256 * 256; constexpr int M = 512; constexpr int NUM_WG = N / M; int total = 0; int *a = static_cast<int *>(malloc(sizeof(int) * N)); for (int i = 0; i < N; i++) a[i] = 1; sycl::queue q({sycl::property::queue::enable_profiling()}); sycl::buffer<int> global(&total, 1); sycl::buffer<int> bufa(a, N); auto e1 = q.submit([&](sycl::handler &h) { sycl::accessor b(global, h); sycl::accessor acc_a(bufa, h, sycl::read_only); auto acc = sycl::accessor<int, 1, sycl::access::mode::read_write, sycl::access::target::local>(NUM_WG, h); h.parallel_for(sycl::nd_range<1>(N, M), [=](auto it) { auto i = it.get_global_id(0); auto group_id = it.get_group(0); sycl::ONEAPI::atomic_ref<int, sycl::ONEAPI::memory_order_relaxed, sycl::ONEAPI::memory_scope_device, sycl::access::address_space::local_space> atomic_op(acc[group_id]); sycl::ONEAPI::atomic_ref<int, sycl::ONEAPI::memory_order_relaxed, sycl::ONEAPI::memory_scope_device, sycl::access::address_space::global_space> atomic_op_global(b[0]); atomic_op += acc_a[i]; it.barrier(sycl::access::fence_space::local_space); if (it.get_local_id() == 0) atomic_op_global += acc[group_id]; }); }); sycl::host_accessor h_global(global); std::cout << "Reduction Sum : " << h_global[0] << "\n"; int total_time = (e1.get_profiling_info<sycl::info::event_profiling::command_end>() - e1.get_profiling_info<sycl::info::event_profiling::command_start>()); std::cout << "Kernel Execution Time of Local Atomics : " << total_time << "\n"; return 0; }

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.