• 2019 Update 4
  • 03/20/2019
  • Public Content

Optimizing Utilization of Execution Units

When you tune your programs for execution on the Intel® Graphics device to improve performance, be aware of the way your kernels are executed on the hardware:
  • Optimize the number of work-groups
  • Optimize the work-group size
  • Use barriers in kernels wisely
  • Optimize thread utilization
The primary goal of every throughput computing machine is to keep a sufficient number of work-groups active, so that if one is stalled, another can run on its hardware resource.
The primary things to consider:
  • Launch enough work items to keep EU threads busy, keep in mind that compiler may pack up to 32 work items per thread (with SIMD-32).
  • In short/lightweight kernels: use short vector data types and compute multiple pixels to better amortize thread launch cost.

Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804