• 03/25/2021
  • Public Content
Contents

General Shader Guidance

When writing shaders, look for these opportunities to optimize:
  • Xᵉ-LP supports double rate FP16 math. Use lower precision when possible. Also, note that Xᵉ-LP removed FP64 support to improve power and performance. Make sure that you query hardware support for double and ensure proper fallback.
  • Structure the shader to avoid unnecessary dependencies, especially high latency operations such as sampling or memory fetches.
  • Avoid shader control flow based on results from sampling operations.
  • Aim for uniform execution of shaders by avoiding flow control based on non-uniform variables.
  • Implement early returns in shaders where the output of an algorithm can be predetermined or computed at a lower cost of the full algorithm.
  • Use shader semantics to flatten, branch, loop, and unroll wisely. It is often better to explicitly specify the desired unrolling behavior, rather than let the shader compiler make those decisions.
  • Branching is preferable if there are enough instruction cycles saved that outweigh the cost of branching.
  • Extended math and sampling operations have a higher weight and may be worth branching (see the table below for issue rate).
    Xᵉ-LP EU instruction Issue Rates
    Instruction
    Single Precision (ops/EU/clk)
    Theoretical Cycle Count
    FMAD
    8
    1
    FMUL
    8
    1
    FADD
    8
    1
    MIN,MAX
    8
    1
    CMP
    8
    1
    INV
    2
    4
    SQRT
    2
    4
    RSQRT
    2
    4
    LOG
    2
    4
    EXP
    2
    4
    POW
    1
    8
    IDIV
    1 – 6
    1.33 – 8
    TRIG
    2
    4
    FDIV
    1
    8
  • Small branches of code may perform better when flattened.
  • Unroll conservatively. In most cases, unrolling short loops helps performance; however, unrolling loops does increase the shader instruction count. Unrolling long loops with high iteration counts can impact shader residency in instruction caches, and therefore negatively impact performance.
  • Avoid extra sampler operations when it is possible that the sampler operation will later be multiplied by zero. For example, when interpolating between two samples, if there is a high probability of the interpolation being zero or one, a branch can be added to speed up the common case and only perform the load only when needed.
  • Avoid querying resource information at runtime; for example, High-Level Shading Language (HLSL) GetDimensions call to make decisions on control flow, or unnecessarily incorporating resource information into algorithms.
  • When passing attributes to the pixel shader, mark attributes that do not change per vertex within a primitive as constant.
  • For shaders where depth test is disabled, use discard (or other kill operations) where output will not contribute to the final color in the render target. Blending can be skipped where the output of the algorithm has an alpha channel value of zero or adding inputs into shaders that are zeros that negate output.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.