• 03/25/2021
  • Public Content
Contents

Compute Shader Considerations

When developing compute shaders, the following guidelines can help to achieve optimal performance when selecting thread group sizes:
  • Pick thread group sizes and dimensions that fit the nature of your workload’s memory access patterns. For instance, if your application accesses memory in a linear fashion, specify a linear dimension thread group size, such as 64 x 1 x 1.
  • For two-dimensional thread groups, smaller thread group sizes typically lead to better performance and achieve better execution unit thread occupancy.
  • Generally, a thread group size of 8 x 8 performs well on Xᵉ-LP. In some cases, this may not be optimal due to memory access patterns and/or cache locality. In this case, 16 x 16 or higher dimensions should be experimented with and chosen based on their performance in testing.
  • Thread group size higher or equal to 256 threads can cause thread occupancy issues.
  • Avoid using Atomics on UAVs when possible. However, Atomics on SLM variables do not show any performance issues
When developing compute shaders that use SLM, consider the following:
  • Minimize the number of reads and writes. For instance, an array of float4 data should be loaded and stored in one bank of float4 types, rather than four banks of float arrays.
  • Try to keep variables in registers rather than SLM to save on memory access penalties.
  • Load and store data in a manner such that data elements that are consecutively accessed are located back to back. This allows read and write access to be coalesced and to use memory bandwidth efficiently.
  • Use HLSL interlocked functions to perform min, max, or, and other reductions, instead of moving data to and from SLM to perform the same operation with a user-defined operation. The compiler can map HLSL functions to a hardware-implemented version.
  • Avoid using more than 73-Byte of SLM per lane, as this will reduce the SIMD width (e.g.
    • For 8x8 thread group use less than 4,672-Byte
    • For 16x16 thread group use less than 18,688-Byte)

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.