Developer Guide

Contents

Specify a Work-Group Size

Specify a maximum or the required work-group size whenever possible. The 
Intel® oneAPI DPC++/C++ Compiler
relies on this specification to optimize hardware use of the DPC++ kernel without involving excess logic.
  • If you do not specify the
    [[intelfpga::max_work_group_size(Z, Y, X)]]
    or 
    [[cl::reqd_work_group_size(Z, Y, X)]]
     attribute in your kernel, the work-group size assumes a default value depending on compilation time and runtime constraints.
  • If your kernel contains a barrier, the
    Intel® oneAPI DPC++/C++ Compiler
    sets a default maximum scalarized work-group size of 128 work-items.
  • If your kernel does not query any DPC++ intrinsic that allow different threads to behave differently (that is, local or global thread IDs, or work-group ID), the
    Intel® oneAPI DPC++/C++ Compiler
    infers a single-threaded execution mode and sets the maximum work-group size to
    (1, 1, 1)
    . In this case, the DPC++ runtime also enforces a global enqueue size of
    (1, 1, 1)
    , and loop pipelining optimizations are enabled within the
    Intel® oneAPI DPC++/C++ Compiler
    .
To specify the work-group size, modify your kernel code in the following manner:
  • To specify the maximum number of work-items that the compiler provisions for a work-group in a kernel, insert the
    [[intelfpga::max_work_group_size(Z, Y, X)]]
    attribute in your kernel source code.
    For example:
    constexpr unsigned MAX_WG_SIZE = 4; ... cgh.parallel_for<class kernelCompute>( nd_range<1>(range<1>(N), range<1>(wg_size)), [=] (nd_item<id> it) [[intelfpga::max_work_group_size(1, 1, MAX_WG_SIZE)]] { auto gid = it.get_global_id(0); accessorRes[gid] = accessorIdx[gid] * 2; }
  • To specify the required number of work-items that the
    Intel® oneAPI DPC++/C++ Compiler
    provisions for a work-group in a kernel, insert the
    [[cl::reqd_work_group_size(Z, Y, X)]]
    attribute in your kernel source code.
    For example:
    [[cl::reqd_work_group_size(1, 1, 64)]] void sum (cl::sycl::nd_item<1> item, accessor<float, access::mode::read, access:: target::global_buffer> a, accessor<int, access::mode::read, access:: target::global_buffer> b, accessor<int, access::mode::read, access:: target::global_buffer> answer) { uint gid = item.get_global_id(0); answer[gid] = a[gid] + b[gid]; }

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804