A sub-group represents an implementation-defined grouping of work-items in a work-group. The work-items within a sub-group can communicate and synchronize independently of work-items in other sub-groups, and sub-groups are therefore commonly mapped to SIMD hardware where it exists.
Sub-groups have been part of the OpenCL execution model since OpenCL 2.0, but many important functions are missing: several hardware features are exposed only as vendor-specific extensions, and functions common in other programming models are not exposed at all. This extension defines SYCL syntax and semantics for the core OpenCL functionality, while it also seeks to expose some of these missing functions.
When a feature is common to both OpenCL and C++, this extension opts for C++-style naming:
Providing a generic group abstraction encapsulating the shared functionality of all synchronizable SYCL groups (that is, work-groups and sub-groups) in a single interface will enable users to write more general code and simplify the introduction of additional SYCL groups in the future (for example, device-wide synchronization groups). Some names in this extension demonstrate how this might look:
Many aspects of sub-group behavior are implementation-defined and/or device-specific. In order to maximize the portability of code written to use the sub-group class, all functions are supported for the fundamental standard scalar types that are supported by SYCL (see Section 6.5 of the SYCL 1.2.1 specification): bool, char, signed char, unsigned char, short int, unsigned short int, int, unsigned int, long int, unsigned long int, long long int, unsigned long long int, size_t, float, double, half.
In keeping with Section 6.7 of the SYCL 1.2.1 specification, attributes are made available as a C++11 attribute specifier in the cl namespace, and the attributes of a kernel are the sum of all the kernel attributes of all device functions called. Attribute names are prefixed with intel to denote that they are Intel extensions.
The [[cl::intel_reqd_sub_group_size(n)]] attribute indicates that the kernel must be compiled and executed with a sub-group of size n. The value of n must be set to a sub-group size supported by the device, or device compilation will fail.
In addition to device functions, the required sub-group size attribute may also be specified in the definition of a named functor object and lambda functions, as in the examples below:
class Functor { void operator()(item<1> item) [[cl::intel_reqd_sub_group_size(16)]] { /* kernel code */ } } kernel<class kernel_name>( []() [[cl::intel_reqd_sub_group_size(n)]] { /* kernel code */ });
Under the OpenCL execution model (see Section 3.2.2 of the OpenCL 2.2 specification), several aspects of sub-group functionality are implementation-defined: the size and number of sub-groups is implementation-defined (and may differ for each kernel), and different devices may make different guarantees with respect to how sub-groups within a work-group are scheduled. Developers can query these behaviors at a device level and for individual kernels.
To maximize portability across devices, developers should not assume that work-items within a sub-group execute in lockstep, nor that two sub-groups within a work-group will make independent forward progress with respect to one another.
Device descriptor | Return type | Description |
---|---|---|
info::device::max_num_sub_groups | cl_uint | Returns the maximum number of sub-groups in a work-group for any kernel executed on the device. The minimum value is 1. |
info::device::sub_group_independent_forward_progress | bool | Returns true if the device supports independent forward progress of sub-groups with respect to other sub-groups in the same work-group. |
info::device::sub_group_sizes | vector_class<size_t> | Returns a vector_class of size_t containing the set of sub-group sizes supported by the device. |
Member function | Description |
---|---|
template <info::kernel_sub_group param>typename info::param_traits<info::kernel_sub_group, param>::return_type get_sub_group_info(const device &dev) const | Queries information from the sub-group from a kernel using the info::kernel_sub_group descriptor for a specific device. |
template <info::kernel_sub_group param>typename info::param_traits<info::kernel_sub_group, param>::return_type get_sub_group_info(const device &dev, typename info::param_traits<info::kernel_sub_group, param>::input_type value) const | Queries information from the sub-group from a kernel using the info::kernel_sub_group descriptor for a specific device and input parameter. The expected value of the input parameter depends on the information being queried. |
Kernel descriptor | Input type | Return type | Description |
---|---|---|---|
info::kernel_sub_group::max_sub_group_size_for_ndrange | range<D> | uint32_t | Returns the maximum sub-group size for the specified work-group size. |
info::kernel_sub_group::sub_group_count_for_ndrange | range<D> | uint32_t | Returns the number of sub-groups for the specified work-group size. |
info::kernel_sub_group::local_size_for_sub_group_count | size_t | range<D> | Returns a work-group size that will contain the specified number of sub-groups. |
info::kernel_sub_group::max_num_sub_groups | N/A | uint32_t | Returns the maximum number of sub-groups for this kernel. |
info::kernel_sub_group::compile_num_sub_groups | N/A | uint32_t | Returns the number of sub-groups specified by the kernel, or 0 (if not specified). |
info::kernel_sub_group::compile_sub_group_size | N/A | size_t | Returns the required sub-group size specified by the kernel, or 0 (if not specified). |
The sub_group class encapsulates all functionality required to represent a particular sub-group within a parallel execution. It is not user-constructable, and it can be accessed only via the nd_item class.
Member function | Description |
---|---|
sub_group get_sub_group() const | Returns the sub-group to which the work-item belongs. |
The following shows example usage of the sub_group class.
parallel_for<class kernel>(..., [&](nd_item item) { sub_group sg = item.get_sub_group(); for (int v = sg.get_local_id(); v < N; v += sg.get_local_range()) { ... } });
With the exception of the common interface members, all member functions of the sub_group class are sub-group functions. Sub-group functions synchronize all work-items in a sub-group (that is, they act as sub-group barriers) and must therefore be encountered within converged control flow across all work-items in the sub-group. All the work-items of a sub-group must execute the sub-group function before any are allowed to continue execution beyond the sub-group function.
Each sub-group function applies only to the work-items within a single sub-group. Communication between multiple sub-groups requires the use of work-group functions, or reads/writes from/to memory with appropriate work-group barriers and/or memory fences.
The sub-group functions comprise a core set of functions that should ideally be supported by all implementations and that have a clear mapping to all devices. The vast majority of these functions have an equivalent in specifications such as OpenCL and SPIR, and the semantics defined here are intended to be compatible. Additional, highly specialized, sub-group functions should be relegated to vendor- or device-specific extensions.
The following functions compose the core functionality for NDRange parallelism:
The following functions compose the extended functionality for NDRange parallelism:
namespace cl { namespace sycl { namespace intel { struct sub_group { /* --- common interface members --- */ id<1> get_local_id() const; range<1> get_local_range() const; range<1> get_max_local_range() const; id<1> get_group_id() const; uint32_t get_group_range() const; uint32_t get_uniform_group_range() const; /* --- vote/ballot functions --- */ bool any(bool predicate) const; bool all(bool predicate) const; /* --- data-sharing --- */ template <typename T> T broadcast(T x, id<1> local_id) const; template <typename T, class BinaryOp> T reduce(T x, BinaryOp binary_op) const; template <typename T, class BinaryOp> T reduce(T x, T init, BinaryOp binary_op) const; template <typename T, class BinaryOp> T exclusive_scan(T x, BinaryOp binary_op) const; template <typename T, class BinaryOp> T exclusive_scan(T x, T init, BinaryOp binary_op) const; template <typename T, class BinaryOp> T inclusive_scan(T x, BinaryOp binary_op) const; template <typename T, class BinaryOp> T inclusive_scan(T x, BinaryOp binary_op, T init) const; /* --- one-input shuffles --- */ template <typename T> T shuffle(T x, id<1> local_id) const; template <typename T> T shuffle_down(T x, uint32_t delta) const; template <typename T> T shuffle_up(T x, uint32_t delta) const; template <typename T> T shuffle_xor(T x, id<1> value) const; /* --- two-input shuffles --- */ template <typename T> T shuffle(T x, T y, id<1> local_id) const; template <typename T> T shuffle_down(T current, T next, uint32_t delta) const; template <typename T> T shuffle_up(T previous, T current, uint32_t delta) const; /* --- sub-group load/stores --- */ template <typename T, access::address_space Space> T load(const multi_ptr<T,Space> src) const; template <typename T, int N, access::address_space Space> vec<T,N> load(const multi_ptr<T,Space> src) const; template <typename T, int N, access::address_space Space> void store(multi_ptr<T,Space> dst, const T& x) const; template <typename T, int N, access::address_space Space> void store(multi_ptr<T,Space> dst, const vec<T,N>& x) const; }; } // intel } // sycl } // cl