Developer Guide

Sub-groups

The index space of an ND-Range kernel is divided into work-groups, sub-groups, and work-items. A work-item is the basic unit. A collection of work-items form a sub-group, and a collection of sub-groups form a work-group. The mapping of work-items and work-groups to hardware execution units (EU) is implementation-dependent. All the work-groups run concurrently but may be scheduled to run at different points in time depending on availability of resources. Work-group execution may or or may not be preempted depending on the capabilities of underlying hardware. Work-items in the same work-group are guaranteed to run concurrently. Work-items in the same sub-group may have additional scheduling guarantees and have access to additional functionality.
A sub-group is a collection of contiguous work-items in the global index space that execute in the same EU thread. When the device compiler compiles the kernel, multiple work-items are packed into a sub-group by vectorization so the generated SIMD instruction stream can perform tasks of multiple work-items simultaneously. Properly partitioning work-items into sub-groups can make a big performance difference.
Let’s start with a simple example illustrating sub-groups:
q.submit([&](auto &h) { sycl::stream out(65536, 256, h); h.parallel_for(sycl::nd_range(sycl::range{32}, sycl::range{32}), [=](sycl::nd_item<1> it) { int groupId = it.get_group(0); int globalId = it.get_global_linear_id(); sycl::ONEAPI::sub_group sg = it.get_sub_group(); int sgSize = sg.get_local_range()[0]; int sgGroupId = sg.get_group_id()[0]; int sgId = sg.get_local_id()[0]; out << "globalId = " << sycl::setw(2) << globalId << " groupId = " << groupId << " sgGroupId = " << sgGroupId << " sgId = " << sgId << " sgSize = " << sycl::setw(2) << sgSize << sycl::endl; }); });
The output of this example may look like this:
Device: Intel(R) Gen12HP globalId = 0 groupId = 0 sgGroupId = 0 sgId = 0 sgSize = 16 globalId = 1 groupId = 0 sgGroupId = 0 sgId = 1 sgSize = 16 globalId = 2 groupId = 0 sgGroupId = 0 sgId = 2 sgSize = 16 globalId = 3 groupId = 0 sgGroupId = 0 sgId = 3 sgSize = 16 globalId = 4 groupId = 0 sgGroupId = 0 sgId = 4 sgSize = 16 globalId = 5 groupId = 0 sgGroupId = 0 sgId = 5 sgSize = 16 globalId = 6 groupId = 0 sgGroupId = 0 sgId = 6 sgSize = 16 globalId = 7 groupId = 0 sgGroupId = 0 sgId = 7 sgSize = 16 globalId = 16 groupId = 0 sgGroupId = 1 sgId = 0 sgSize = 16 globalId = 17 groupId = 0 sgGroupId = 1 sgId = 1 sgSize = 16 globalId = 18 groupId = 0 sgGroupId = 1 sgId = 2 sgSize = 16 globalId = 19 groupId = 0 sgGroupId = 1 sgId = 3 sgSize = 16 globalId = 20 groupId = 0 sgGroupId = 1 sgId = 4 sgSize = 16 globalId = 21 groupId = 0 sgGroupId = 1 sgId = 5 sgSize = 16 globalId = 22 groupId = 0 sgGroupId = 1 sgId = 6 sgSize = 16 globalId = 23 groupId = 0 sgGroupId = 1 sgId = 7 sgSize = 16 globalId = 8 groupId = 0 sgGroupId = 0 sgId = 8 sgSize = 16 globalId = 9 groupId = 0 sgGroupId = 0 sgId = 9 sgSize = 16 globalId = 10 groupId = 0 sgGroupId = 0 sgId = 10 sgSize = 16 globalId = 11 groupId = 0 sgGroupId = 0 sgId = 11 sgSize = 16 globalId = 12 groupId = 0 sgGroupId = 0 sgId = 12 sgSize = 16 globalId = 13 groupId = 0 sgGroupId = 0 sgId = 13 sgSize = 16 globalId = 14 groupId = 0 sgGroupId = 0 sgId = 14 sgSize = 16 globalId = 15 groupId = 0 sgGroupId = 0 sgId = 15 sgSize = 16 globalId = 24 groupId = 0 sgGroupId = 1 sgId = 8 sgSize = 16 globalId = 25 groupId = 0 sgGroupId = 1 sgId = 9 sgSize = 16 globalId = 26 groupId = 0 sgGroupId = 1 sgId = 10 sgSize = 16 globalId = 27 groupId = 0 sgGroupId = 1 sgId = 11 sgSize = 16 globalId = 28 groupId = 0 sgGroupId = 1 sgId = 12 sgSize = 16 globalId = 29 groupId = 0 sgGroupId = 1 sgId = 13 sgSize = 16 globalId = 30 groupId = 0 sgGroupId = 1 sgId = 14 sgSize = 16 globalId = 31 groupId = 0 sgGroupId = 1 sgId = 15 sgSize = 16
Each sub-group in this example has 16 work-items or the sub-group size is 16. This means each thread simultaneously executes 16 work-items and 32 work-items are executed by two EU threads.
By default, the compiler selects a sub-group size using device-specific information and a few heuristics. The user can override the compiler’s selection using a kernel attribute
intel::reqd_sub_group_size
to specify the maximum sub-group size. Sometimes, not always, explicitly requesting a sub-group size may help performance.
q.submit([&](auto &h) { sycl::stream out(65536, 256, h); h.parallel_for( sycl::nd_range(sycl::range{32}, sycl::range{32}), [= ](sycl::nd_item<1> it) [[intel::reqd_sub_group_size(32)]] { int groupId = it.get_group(0); int globalId = it.get_global_linear_id(); sycl::ONEAPI::sub_group sg = it.get_sub_group(); int sgSize = sg.get_local_range()[0]; int sgGroupId = sg.get_group_id()[0]; int sgId = sg.get_local_id()[0]; out << "globalId = " << sycl::setw(2) << globalId << " groupId = " << groupId << " sgGroupId = " << sgGroupId << " sgId = " << sgId << " sgSize = " << sycl::setw(2) << sgSize << sycl::endl; }); });
The output will be:
Device: Intel(R) Gen12HP globalId = 0 groupId = 0 sgGroupId = 0 sgId = 0 sgSize = 32 globalId = 1 groupId = 0 sgGroupId = 0 sgId = 1 sgSize = 32 globalId = 2 groupId = 0 sgGroupId = 0 sgId = 2 sgSize = 32 globalId = 3 groupId = 0 sgGroupId = 0 sgId = 3 sgSize = 32 globalId = 4 groupId = 0 sgGroupId = 0 sgId = 4 sgSize = 32 globalId = 5 groupId = 0 sgGroupId = 0 sgId = 5 sgSize = 32 globalId = 6 groupId = 0 sgGroupId = 0 sgId = 6 sgSize = 32 globalId = 7 groupId = 0 sgGroupId = 0 sgId = 7 sgSize = 32 globalId = 8 groupId = 0 sgGroupId = 0 sgId = 8 sgSize = 32 globalId = 9 groupId = 0 sgGroupId = 0 sgId = 9 sgSize = 32 globalId = 10 groupId = 0 sgGroupId = 0 sgId = 10 sgSize = 32 globalId = 11 groupId = 0 sgGroupId = 0 sgId = 11 sgSize = 32 globalId = 12 groupId = 0 sgGroupId = 0 sgId = 12 sgSize = 32 globalId = 13 groupId = 0 sgGroupId = 0 sgId = 13 sgSize = 32 globalId = 14 groupId = 0 sgGroupId = 0 sgId = 14 sgSize = 32 globalId = 15 groupId = 0 sgGroupId = 0 sgId = 15 sgSize = 32 globalId = 16 groupId = 0 sgGroupId = 0 sgId = 16 sgSize = 32 globalId = 17 groupId = 0 sgGroupId = 0 sgId = 17 sgSize = 32 globalId = 18 groupId = 0 sgGroupId = 0 sgId = 18 sgSize = 32 globalId = 19 groupId = 0 sgGroupId = 0 sgId = 19 sgSize = 32 globalId = 20 groupId = 0 sgGroupId = 0 sgId = 20 sgSize = 32 globalId = 21 groupId = 0 sgGroupId = 0 sgId = 21 sgSize = 32 globalId = 22 groupId = 0 sgGroupId = 0 sgId = 22 sgSize = 32 globalId = 23 groupId = 0 sgGroupId = 0 sgId = 23 sgSize = 32 globalId = 24 groupId = 0 sgGroupId = 0 sgId = 24 sgSize = 32 globalId = 25 groupId = 0 sgGroupId = 0 sgId = 25 sgSize = 32 globalId = 26 groupId = 0 sgGroupId = 0 sgId = 26 sgSize = 32 globalId = 27 groupId = 0 sgGroupId = 0 sgId = 27 sgSize = 32 globalId = 28 groupId = 0 sgGroupId = 0 sgId = 28 sgSize = 32 globalId = 29 groupId = 0 sgGroupId = 0 sgId = 29 sgSize = 32 globalId = 30 groupId = 0 sgGroupId = 0 sgId = 30 sgSize = 32 globalId = 31 groupId = 0 sgGroupId = 0 sgId = 31 sgSize = 32
The valid sub-group sizes are device dependent. You can query the device to get this information:
std::cout << "Sub-group Sizes: "; for (const auto &s : q.get_device().get_info<sycl::info::device::sub_group_sizes>()) { std::cout << s << " "; } std::cout << std::endl;
The valid sub-group sizes supported may be:
Device: Intel(R) Gen12HP Subgroup Sizes: 8 16 32
Next, we will show how to use sub-groups to improve performance.

Vectorization and Memory Access

The Intel
®
graphics device has multiple EUs. Each EU is a multithreaded SIMD processor. The compiler generates SIMD instructions to pack multiple work-items in a sub-group to be executed simultaneously in an EU thread. The SIMD width (thus the sub-group size), selected by the compiler based on device characteristics and heuristics or requested explicitly by the kernel, can be 8, 16, or 32.
Given a SIMD width, maximizing SIMD lane utilization gives optimal instruction performance. If one or more lanes (or kernel instances or work items) diverge, the thread executes both branch paths before the paths merge later, increasing dynamic instruction count. SIMD divergence negatively impacts performance. The compiler works hard to optimize divergence, but still it helps to avoid divergence in the source code, if possible.
How memory is accessed in work-items affects how memory is accessed in the sub-group or how the SIMD lanes are utilized. Accessing contiguous memory in a work-item is often not optimal. For example:
constexpr int N = 1024 * 1024; int *data = sycl::malloc_shared<int>(N, q); auto e = q.submit([&](auto &h) { h.parallel_for(sycl::nd_range(sycl::range{N / 16}, sycl::range{32}), [=](sycl::nd_item<1> it) { int i = it.get_global_linear_id(); i = i * 16; for (int j = i; j < (i + 16); j++) { data[j] = -1; } }); }); q.wait();
This simple kernel initializes an array of 1024 x 1024 integers. Each work-item initializes 16 contiguous integers. Assuming the sub-group size chosen by the compiler is 16, 256 integers are initialized in each sub-group or thread. However, the stores in 16 SIMD lanes are scattered.
Instead of initializing 16 contiguous integers in a work-item, initializing 16 contiguous integers in one SIMD instruction is more efficient.
constexpr int N = 1024 * 1024; int *data = sycl::malloc_shared<int>(N, q); auto e = q.submit([&](auto &h) { h.parallel_for(sycl::nd_range(sycl::range{N / 16}, sycl::range{32}), [=](sycl::nd_item<1> it) { int i = it.get_global_linear_id(); sycl::ONEAPI::sub_group sg = it.get_sub_group(); int sgSize = sg.get_local_range()[0]; i = (i / sgSize) * sgSize * 16 + (i % sgSize); for (int j = 0; j < sgSize * 16; j += sgSize) { data[i + j] = -1; } }); });
We use memory writes in our examples, but the same technique is applicable to memory reads as well.
constexpr int N = 1024 * 1024; int *data = sycl::malloc_shared<int>(N, q); int *data2 = sycl::malloc_shared<int>(N, q); memset(data2, 0xFF, sizeof(int) * N); auto e = q.submit([&](auto &h) { h.parallel_for(sycl::nd_range(sycl::range{N / 16}, sycl::range{32}), [=](sycl::nd_item<1> it) { int i = it.get_global_linear_id(); i = i * 16; for (int j = i; j < (i + 16); j++) { data[j] = data2[j]; } }); });
This kernel copies an array of 1024 x 1024 integers to another integer array of the same size. Each work-item copies 16 contiguous integers. However, the reads from
data2
are gathered and stores to
data
are scattered. It will be more efficient if we change the code to read and store contiguous integers in each sub-group instead of each work-item.
constexpr int N = 1024 * 1024; int *data = sycl::malloc_shared<int>(N, q); int *data2 = sycl::malloc_shared<int>(N, q); memset(data2, 0xFF, sizeof(int) * N); auto e = q.submit([&](auto &h) { h.parallel_for(sycl::nd_range(sycl::range{N / 16}, sycl::range{32}), [=](sycl::nd_item<1> it) { int i = it.get_global_linear_id(); sycl::ONEAPI::sub_group sg = it.get_sub_group(); int sgSize = sg.get_local_range()[0]; i = (i / sgSize) * sgSize * 16 + (i % sgSize); for (int j = 0; j < sgSize * 16; j += sgSize) { data[i + j] = data2[i + j]; } }); });
Intel
®
graphics have instructions optimized for memory block loads/stores. So if work-items in a sub-group access a contiguous block of memory, we can use the sub-group block access functions to take advantage of these block load/store instructions.
constexpr int N = 1024 * 1024; int *data = sycl::malloc_shared<int>(N, q); int *data2 = sycl::malloc_shared<int>(N, q); memset(data2, 0xFF, sizeof(int) * N); auto e = q.submit([&](auto &h) { h.parallel_for( sycl::nd_range(sycl::range{N / 16}, sycl::range{32}), [= ](sycl::nd_item<1> it) [[intel::reqd_sub_group_size(16)]] { sycl::ONEAPI::sub_group sg = it.get_sub_group(); sycl::vec<int, 8> x; using global_ptr = sycl::multi_ptr<int, sycl::access::address_space::global_space>; int base = (it.get_group(0) * 32 + sg.get_group_id()[0] * sg.get_local_range()[0]) * 16; x = sg.load<8>(global_ptr(&(data2[base + 0]))); sg.store<8>(global_ptr(&(data[base + 0])), x); x = sg.load<8>(global_ptr(&(data2[base + 128]))); sg.store<8>(global_ptr(&(data[base + 128])), x); }); });
You probably noticed that we explicitly requested the sub-group size 16. When you use sub-group functions, it is always good to override the compiler choice to make sure the sub-group size always matches what you expect. Please also note that, at the time of writing, block load/store does not work with sub-group size 32 on current Intel
®
hardware. So the group size explicitly requested must be 16 or smaller.

Data Sharing

Because the work-items in a sub-group execute in the same thread, it is more efficient to share data between work-items, even if the data is private to each work-item. Sharing data in a sub-group is more efficient than sharing data in a work-group using shared local memory, or SLM. One way to share data among work-items in a sub-group is to use shuffle functions.
constexpr size_t blockSize = 16; sycl::buffer<unsigned int, 2> m(matrix.data(), sycl::range<2>(N, N)); auto e = q.submit([&](auto &h) { sycl::accessor marr(m, h); sycl::accessor<unsigned int, 2, sycl::access::mode::read_write, sycl::access::target::local> barr1(sycl::range<2>(blockSize, blockSize), h); sycl::accessor<unsigned int, 2, sycl::access::mode::read_write, sycl::access::target::local> barr2(sycl::range<2>(blockSize, blockSize), h); h.parallel_for( sycl::nd_range<2>(sycl::range<2>(N / blockSize, N), sycl::range<2>(1, blockSize)), [=](sycl::nd_item<2> it) [[intel::reqd_sub_group_size(16)]] { int gi = it.get_group(0); int gj = it.get_group(1); sycl::ONEAPI::sub_group sg = it.get_sub_group(); int sgId = sg.get_local_id()[0]; unsigned int bcol[blockSize]; int ai = blockSize * gi; int aj = blockSize * gj; for (int k = 0; k < blockSize; k++) { bcol[k] = sg.load(marr.get_pointer() + (ai + k) * N + aj); } unsigned int tcol[blockSize]; for (int n = 0; n < blockSize; n++) { if (sgId == n) { for (int k = 0; k < blockSize; k++) { tcol[k] = sg.shuffle(bcol[n], k); } } } for (int k = 0; k < blockSize; k++) { sg.store(marr.get_pointer() + (ai + k) * N + aj, tcol[k]); } }); });
This kernel transposes a 16 x 16 matrix. It looks more complicated than the previous examples, but the idea is simple: a sub-group loads a 16 x 16 sub-matrix, then the sub-matrix is transposed using the sub-group shuffle functions. There is only one sub-matrix and the sub-matrix is the matrix so only one sub-group is needed. A bigger matrix, say 4096 x 4096, can be transposed using the same technique: each sub-group loads a sub-matrix, then the sub-matrices are transposed using the sub-group shuffle functions. We leave this to the reader as an exercise.
There are multiple variants of sub-group shuffle functions available in DPC++. Each variant is optimized for its specific purpose on the specific device. It is always a good idea to use these optimized functions (if they fit your needs) instead of creating your own.

Sub-group Size vs. Maximum Sub-group Size

So far in our examples, the work-group size is divisible by the sub-group size and both the work-group size and the sub-group size (either required by the user or automatically picked by the compiler are powers of two). The sub-group size and maximum sub-group size are the same if the work-group size is divisible by the maximum sub-group size and both sizes are powers of two. But what happens if the work-group size is not divisible by the sub-group size? Consider the following example:
auto e = q.submit([&](auto &h) { sycl::stream out(65536, 128, h); h.parallel_for( sycl::nd_range<1>(7, 7), [= ](sycl::nd_item<1> it) [[intel::reqd_sub_group_size(8)]] { int i = it.get_global_linear_id(); sycl::ONEAPI::sub_group sg = it.get_sub_group(); int sgSize = sg.get_local_range()[0]; int sgMaxSize = sg.get_max_local_range()[0]; int sId = sg.get_local_id()[0]; int j = data[i]; int k = data[i + sgSize]; out << "globalId = " << i << " sgMaxSize = " << sgMaxSize << " sgSize = " << sgSize << " sId = " << sId << " j = " << j << " k = " << k << sycl::endl; }); }); q.wait();
The output of this example looks like this:
globalId = 0 sgMaxSize = 8 sgSize = 7 sId = 0 j = 0 k = 7 globalId = 1 sgMaxSize = 8 sgSize = 7 sId = 1 j = 1 k = 8 globalId = 2 sgMaxSize = 8 sgSize = 7 sId = 2 j = 2 k = 9 globalId = 3 sgMaxSize = 8 sgSize = 7 sId = 3 j = 3 k = 10 globalId = 4 sgMaxSize = 8 sgSize = 7 sId = 4 j = 4 k = 11 globalId = 5 sgMaxSize = 8 sgSize = 7 sId = 5 j = 5 k = 12 globalId = 6 sgMaxSize = 8 sgSize = 7 sId = 6 j = 6 k = 13
The sub-group size is seven, though the maximum sub-group size is still eight! The maximum sub-group size is actually the SIMD width so it does not change, but there are less than eight work-items in the sub-group, so the sub-group size is seven. So be careful when your work-group size is not divisible by the maximum sub-group size. The last sub-group with fewer work-items may need to be specially handled.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.