When you tune your programs for execution on the Intel® Graphics device to improve performance, be aware of the way your kernels are executed on the hardware:
- Optimize the number of work-groups
- Optimize the work-group size
- Use barriers in kernels wisely
- Optimize thread utilization
The primary goal of every throughput computing machine is to keep a sufficient number of work-groups active, so that if one is stalled, another can run on its hardware resource.
The primary things to consider:
- Launch enough work items to keep EU threads busy, keep in mind that compiler may pack up to 32 work items per thread (with SIMD-32).
- In short/lightweight kernels: use short vector data types and compute multiple pixels to better amortize thread launch cost.