Through Bundles, the PSO, Descriptor Heaps & Tables we have seen how Microsoft Direct3D* (D3D) 12 improves CPU efficiency and gives more control to developers. The PSO and descriptor model allow for bundles, which in turn are used for common and repeated commands. A simpler “closer to the metal” approach that reduces overhead and allows more efficient use of CPUs for “Console API efficiency and performance”. In Part 1 we discussed how in the world of PC gaming thread 0 often does most, if not all the work while the other threads only handle other OS or system tasks. Efficient use of multiple cores or threads in gaming is tough. Often enough the work required to make a game multithreaded is expensive both in man power and resources. The D3D development team wants to change that with D3D 12.
Command Creation Parallelism:
Several times during this D3D 12 overview deferred command execution has been discussed. How the game thinks that each command is being executed immediately when in reality commands are queued up and run at a later time. This function remains in D3D 12 but is transparent to the game. There is no immediate context everything is a deferred context now. So threads can generate commands in parallel to complete a list of commands that are then fed into an API object called the Command Queue. The GPU will not execute commands until they are submitted via the Command Queue. The queue is the ordering of the commands and the Command List is the recording of said commands. How are command lists different from bundles? Command lists are designed and optimized for one time generation of commands, so threads can simultaneously generate commands. Command lists are used once, then you can delete the list from memory and record a new list in its place. Bundles are designed for multiple use of commonly reused rendering commands within a frame or in multiple frames.
Command parallelism was attempted in D3D 11, it is called the deferred context. However it did not achieve the D3D team’s performance goals due to the required overhead. Further analysis showed many places where there was a lot of serial overhead, resulting in poor scaling across the CPU cores. Some of the serial overhead was removed in D3D 12 with the CPU efficiency designs previously discussed.
Lists and the Queue:
Imagine 2 threads generating a list of rendering commands, 1 sequence of commands is meant to run before the other. If there are hazards, say 1 thread uses a resource as a texture but the other thread is using that resource as a render target. The driver needs to look at the resource usage at render time then has to resolve the hazards, ensuring coherent data. This hazard tracking is one area of serialized overhead in D3D 11, with D3D 12 the game is responsible for hazard tracking not the driver.
D3D 11 allows for any number of deferred contexts, but that comes with a cost. The driver tracks the state per resource, so as you start recording commands for the deferred context the driver needs to allocate memory to track the state of every resource used. This memory is kept around while the deferred context is generated, then when done has to delete all these tracking objects from memory. Obviously this is quite a bit of overhead, so for D3D 12 at the API level the game declares the maximum number of command lists that can be generated in parallel. The driver then can setup and allocate all the tracking objects up front in a coherent single piece of memory.
It is common with D3D 11 to use dynamic buffers (context, vertex, etc.), however behind the scenes there are multiple instances of memory tracking each discarded buffer. Say you have 2 command lists being generating in parallel and you call MapDiscard. Once the list is submitted the driver has to patch into the second command list to correct discarded buffer information. Like the hazard example earlier this requires a lot of overhead. D3D 12 has given that renaming control to the game, the dynamic buffer is gone. Instead the game has fine grain control, it can build its own allocators and sub divide the buffer as needed. The commands can then point to the explicit point in memory.
As discussed in Part 3 the runtime and driver tracks the resource lifetime in D3D 11. So a lot of resource counting and tracking is required and it all must be resolved at submit time. Resource lifetime & hazard control has been given to the game in D3D 12 , removing the serial overhead for more CPU efficiency. After optimizing these 4 areas parallel command generation is more efficient in D3D 12, allowing for improved CPU parallelism. In addition to these changes the D3D development team is building a new driver model, WDDM 2.0. WDDM 2.0 has further optimizations to reduce the command list submission cost.
Command Queue Flow:
Above we have the bundle diagram from Part 5 but now it is multithreaded. On the left is the Command Queue, this is the sequence of events submitted to the GPU. The middle has 2 command lists generated in parallel. On the right we have 2 bundles recorded before this scenario started. We start with the command lists, here we have 2 command lists being generated in parallel for different parts of the scene. Command List 1 completes recording and it gets submitted to the command queue so the GPU can start executing it. Thus in parallel we have the Command Queue control flow starting and Command List 2 recording on Thread 2. While the GPU is executing Command List 1 thread 2 completes generating Command List 2 and submits it to the Command Queue. When the Command Queue completes executing Command List 1 it will then continue on to Command List 2 in a serial order. The Command Queue list the serial order the GPU needs to execute commands. So even though Command List 2 was generated and submitted to the Command Queue before the GPU finished executing Command List 1 it still was not executed until execution of Command List 1 was complete. As you can see D3D 12 offers more efficient parallelism across the entire flow, from core and thread usage to the API and driver. More work is done in less time and with less overhead in a more efficient manor.
Next up in Part 7: Dynamic Heaps
Diagrams and code samples from BUILD 2014 presentation. Created by Max McMullen, D3D Development Lead at Microsoft.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804