Vertex fetch throughput is eight attributes per clock (and max two vertices per clock). Ensure all bound attributes are used. When a draw is bottlenecked on geometry work, reducing the number of attributes per vertex can improve performance.
Implement a level of detail system that allows flexibility in model accuracy by adjusting the number of vertices per model per level of detail.
Analyze primitive to pixels covered ratios. When this ratio is high, the extra vertex shader threads add very little value to the final render target. Keep this ratio below 1:4.
Implement efficient CPU occlusion culling to avoid submitting hidden geometry. This approach can save both CPU (draw submission), and GPU (render) time—we suggest using our highly optimized
Masked Software Occlusion Culling
for this. It can eventually be used in combination to finer grain GPU culling.
Define input geometries as a structure of arrays for vertex buffers. Try to group position information vertex data in its own input slot to assist the tile binning engine for tile-based rendering.
The Xᵉ-LP vertex cache does not cache instanced attributes. For instanced calls, consider loading attributes explicitly in your vertex shader.
For full-screen post-processing passes, use a single triangle that covers entire screen space instead of two triangles to form a quad. The shared triangle edge in the two tri case results in sub-optimal utilization of compute resources.
Optimize transformation shaders (that is, vertex to geometry shader) to output only attributes that will be used by later stages in the pipeline. For example, avoid defining unnecessary outputs from a vertex shader that will not be consumed by a pixel shader. This enables better use of bandwidth and space with the L3 cache.
Metal 2 Specific—Tessellation factors are calculated in the compute engine, then returned to the 3D engine to get rendered. Calculate tessellation factors back to back for multiple draws to reduce the amount of context switching between 3D and compute workloads.