• 2019 Update 4
  • 03/20/2019
  • Public Content
Contents

Work-Group Size Considerations

It is recommended to let the OpenCL™ implementation automatically determine the optimal work-group size for a kernel: pass
NULL
for a pointer to the work-group size when calling
clEnqueueNDRangeKernel
.
If you want to experiment with work-group size, you need to consider the following:
  • To get best performance from using the vectorization module (see the "Benefitting from Implicit Vectorization" section), the work-group size must be larger or a multiple of 8.
  • To reduce the overhead of maintaining a workgroup, you should create work-groups that are as large as possible, which means 64 and more work-items. One upper bound is the size of the accessed data set as it is better not to exceed the size of the L1 cache in a single work group. Also there should be sufficient number of work-groups, see the "Work-Group Level Parallelism" section for more information.
  • To accommodate multiple architectures, query the device for the
    CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE
    parameter by calling to
    clGetKernelWorkGroupInfo
    , and set the work-group size accordingly.
  • If your kernel code contains the barrier instruction, the issue of work-group size becomes a tradeoff. The more local and private memory each work-item in the work-group requires, the smaller the optimal work-group size is. The reason is that a barrier also issues copy instructions for the total amount of private and local memory used by all work-items in the work-group in the work-group since the state of each work-item that arrived at the barrier is saved before proceeding with another work-item.
See Also

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.