Work-Group Size Considerations
- To get best performance from using the vectorization module (see the Benefitting from Implicit Vectorization section), the work-group size must be larger or a multiple of 4, 8, or 16 depending on the SIMD width supported by CPU otherwise case the runtime can make a wrong guess of using the work-groups size of one, which results in running the scalar code for the kernel.
- To accommodate multiple architectures, query the device for theCL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLEparameter by calling toclGetKernelWorkGroupInfoand set the work-group size accordingly.
- To reduce the overhead of maintaining a workgroup, you should create work-groups that are as large as possible, which means 64 and more work-items. One upper bound is the size of the accessed data set as it is better not to exceed the size of the L1 cache in a single work group. Also there should be sufficient number of work-groups, see the Work-Group Level Parallelism section for more information.
- If your kernel code contains the barrier instruction, the issue of work-group size becomes a tradeoff. The more local and private memory each work-item in the work-group requires, the smaller the optimal work-group size is. The reason is that a barrier also issues copy instructions for the total amount of private and local memory used by all work-items in the work-group in the work-group since the state of each work-item that arrived at the barrier is saved before proceeding with another work-item. Make the work-group size be multiple of 4, 8, or 16, otherwise the scalar version of the resulted code might execute.