Optimizing DirectX* 8.0 Vertex Shaders
Once again, welcome back to Maximum FPS! This month Ronen Zohar will provide us with a thorough understanding of how to take advantage of vertex shaders on Intel processors. Ronen is an Intel engineering manager from Haifa, Israel, and has worked closely with Microsoft* on optimizing and maximizing the performance of vertex shaders on Intel processors.
Microsoft* DirectX* 8.0 introduced a new method of processing (transforming) vertices, known as "vertex shaders." When using this method, the application programmer can program a dedicated virtual machine (known as the vertex virtual machine or VVM) to perform any desired algorithm on the vertices. The vertex shader mechanism was designed to run on dedicated hardware within the 3D graphics hardware. However, most of the current 3D hardware in end-user systems lacks this feature, so software emulation of the vertex shader mechanism must be used. Intel worked with Microsoft to optimize the software implementation of this mechanism, and in this column, I will share some tips on how to achieve maximum performance from software vertex shaders when running on Intel processors.
Introduction to Vertex Shaders
A vertex shader is a small program that operates on vertices and runs on a specific "virtual" machine. The vertex shader program is executed once per vertex, using the vertex data and constants shared between all vertices as inputs, while its outputs are homogeneous clip coordinates of the "transformed" vertex and additional vertex properties such as colors and texture coordinates.
The vertex shader program is not responsible for primitive based operations such as polygon assembly and clipping. Therefore it can't add or delete vertices, and there is no memory shared between vertices that are being processed under the same primitive.
The vertex shader includes two components: a vertex shader declaration, which defines how the vertex input stream is mapped into the vertex virtual machine registers, and a vertex shader program, which defines the program to execute for each vertex. When executing a vertex shader, each vertex is mapped to the VVM input registers (according to the declaration) and then the vertex shader program is executed. When the vertex shader program finishes, the outputs are collected from the VVM output registers and sent to the rendering engine.
The Vertex Virtual Machine
The vertex virtual machine is built from 4-wide single precision floating point registers (made from x, y, z and w components), divided into several register files:
- Input registers - these read-only registers hold per-vertex input data, as mapped from the vertices input stream.
- Constant registers - these read-only registers hold constants shared between all the vertices in a primitive (such as transformation matrices, light attributes, etc.), The constant values can be changed between primitives.
- Scratch registers - these registers hold temporary results needed during the execution of the vertex shader program
- Address registers - these registers are used to index entries in the constant register file
- Output registers - these write-only registers hold the results of the vertex virtual machine execution, these values are passed on to the rendering engine.
DirectX 8.0 defines 16 vertex shader instructions:
- mov - move data between registers
- add - component-wise add
- mul - component-wise multiply
- mad - component-wise multiply and add
- sge - component-wise compare (set to 1.0f if greater or equal, else set to 0.0f)
- slt - component-wise compare (set to 1.0f if lower, else set to 0.0f)
- min - component wise minimum
- max - component wise maximum
- dp3 - dot product x, y and z components
- dp4 - dot product all 4 components
- lit - given N dot L, N dot h, and specular power, returns diffuse and specular lighting factors
- dst - given a distance, return light attenuation factors
- logp - partial precision logarithm (base 2)
- expp - partial precision exponent (base 2)
- rcp - reciprocal of input value
- rsq - reciprocal of square root of input value
Each input register can be swizzled before the instruction executes (for instance replace z and y components), and the values can be also negated before execution of the instruction (The instruction add r0,r1,-r2, subtracts r2 from r1 and store the result in r0). When writing results, not all components of the output register must be written. The programmer can specify which components to write, leaving previous values that are not written unchanged.
DirectX* 8.0 also defines some macros that use the basic instruction-set, such as m4x4 (vector by 4x4 matrix multiply - which is expanded to four dp4 instructions). More information on these macros can be found in the DirectX 8.0 SDK.
Using Vertex Shaders
To create a vertex shader, use the IDirect3DDevice8::CreateVertexShader API method. The parameters to this method are byte code representations of the vertex shader declaration and program and the return value is a handle used to identify the generated vertex shader. The byte code formats of the vertex shader declaration can be assembled from an assembly like text file using D3DX's methods D3DXAssembleShader or D3DXAssembleShaderFromFile.
To render a primitive using an already created vertex shader, use the IDirect3DDevice8::SetVertexShader method (using the vertex shader handle as input value), set the proper constant register values (using the IDirect3DDevice8::SetVertexShaderConstant method) and then render the primitive using one of the Draw... API methods (such as DrawIndexedPrimitive). The vertices are processed by the vertex shader mechanism before they are rendered.
The Vertex Shader Compiler
When using software vertex shaders on Intel® Pentium® III or Pentium® 4 processor-based systems, a special mechanism, known as the vertex shader compiler, is used. The vertex shader compiler is built into the DirectX 8.0 runtime and runs automatically when the application programmer tries to create and use a vertex shader in software vertex processing mode.
This compiler takes the shader program and declaration at shader creation time (IDirect3DDevice8::CreateVertexShader method), and compiles them to equivalent Intel architecture machine code in some location in memory, using all the extensions available on the host processor (e.g. Intel Streaming SIMD Extensions (SSE) and Intel Streaming SIMD Extensions 2 (SSE2) ). The compiler tries to minimize memory traffic, by rescheduling the generated assembly instructions and exploiting all the different instructions available to the processor, to achieve maximum performance. When using a vertex shader (Any of the Draw... Methods after IDirect3DDevice8::SetVertexShader), the code that was generated by the compiler gets executed. The shader compiler exploits the parallelism in the SSE and SSE2 extensions by generating code that can process four vertices in one call to the generated code (i.e. for n vertices the generated code gets executed n/4 times). Every VVM register is mapped to four XMM physical registers, where each of the physical XMM registers holds one logical component (x, y, z or w) of four continuous vertices. For example, the code generated for mul r0, r0, v0 vertex shader instruction (component-wise multiply of r0 with v0), will have four SIMD multiplies, one for each VVM register component. Each SIMD multiply will operate on data of four continuous vertices.
Optimizing Vertex Shader Programs for the Vertex Shader Compiler
Like any other compiler, the vertex shader compiler can produce shaders that perform differently depending on the different coding alternatives chosen for a given program. This section includes some hints on how to help the compiler generate code that achieves maximum performance. The end result will have an immediate effect on the overall performance of the application that uses the vertex shader.
- Write only needed results - Use the partial write feature of the vertex shader language to write only needed outputs (for example, when calculating the difference between 2 3D vectors, write only x, y, and z components - as the w component is meaningless). When using this guideline the compiler generates less unused arithmetic.
- Use macros when possible - The code generated for the vertex shader macros (such as m4x4, frc, etc.) is better optimized than the code that is generated if these macros are expanded to a series of equivalent vertex shader instructions.
- "Squeeze" dependency chains - Order instructions that have direct dependencies between them as close as possible. It will keep more data "alive" in the physical registers and will reduce the memory traffic in the generated code.
- Write final arithmetic instructions directly to an output register - Writing to a temporary register and then copying its contents to the output register will add unnecessary assembly instructions.
- Re-use the same temporary register if possible - This will yield better physical register allocation, and reduce memory traffic in the generated code.
- Don't implicitly saturate color and fog values - The compiler will saturate for you.
- When an exponent or logarithm function is needed, try to use the lowest acceptable accuracy. expp.x and logp.x are the fastest, but with lowest accuracy; expp.z and logp.z have better accuracy - but they are more expensive computationally.
- Try to avoid using the address register - Because of the SIMD nature of the generated code, when using the address register, different constant registers might be retrieved for each of the four vertices being processed at parallel. This will cause a data re-arrangement function to be called and re-arrange the data according to the address register values. If the address values of the four vertices being processed are equal, this function will not be called and the code will run much faster.
- If you must use the address register - Try to order your vertices in such a way that the values used for the address register are equal for each group of four vertices. If the address values of the four vertices being processed are equal, this function will not be called and the code will run much faster.
Vertex Shader Program Optimization Example
Original vertex shader program (transformation, object space point light and one texture coordinate copy)
The code generated for the original program takes 56 cycles per vertex on a Pentium® III processor-based system. The code generated for the optimized program only takes 48 cycles per vertex -- a 17% performance boost!
Vertex shaders give the application programmer the freedom to use his own custom vertex processing algorithms within the framework of a standard API. Vertex shaders have excellent performance even with no DirectX 8 compatible hardware, and by following the simple guidelines presented here, this already excellent performance can be increased even further, allowing developers to start using this exciting technology in their applications today.