If your kernel uses local memory and/or barriers, the actual number of work-groups that can run simultaneously on one of the Intel® Graphics sub-slice is limited by the following key factors:
- There are 16 barrier registers per sub-slice, so no more than 16 work-groups can be executed simultaneously.
- The amount of shared local memory available per sub-slice (64KB). If for example a work-group requires 32KB of shared local memory, only 2 of those work-groups can run concurrently, regardless of work-group size.
Therefore, to keep the device utilization high with the limited number of workgroups, larger workgroup sizes are required. Use power-of-two workgroup sizes between 64 and 256.
The number of sub-slices depends on the hardware generation and specific product. Refer to the See Also section for the details of the architecture.
A bare minimum SLM allocation size is 4k per workgroup, so even if your kernel requires less bytes per work-group, the actual allocation still will be 4k. To accommodate many potential execution scenarios try to minimize local memory usage to fit the optimal value of 4K per workgroup. Also notice that the granularity of SLM allocation is 1K.
If your kernel is not using local memory or barriers, these restrictions do not apply, and work-group size of 32 work-items is optimal for the most cases.
Try different local sizes to find the value that provides better performance. You can leave the “local group size” to clEnqueueNDRangeKernel() specified as NULL, enabling the system to choose the work-group size.
More on the Gen7.5 and Gen8 Compute Architectures: https://software.intel.com/en-us/articles/intel-graphics-developers-guides