In Part 7 we discussed Dynamic Heaps, and how they help better enable CPU parallelism. Now it is time to bring it all together and show how all of Microsoft Direct3D* (D3D) 12’s exciting new features can create a truly multi-threaded game on PC, for that all important “Console API efficiency and performance”. D3D 12 allows for several parallel tasks. Command Lists and Bundles allow for parallel command generation and execution. Bundles allow for oft repeated commands to be recorded and run in multiple command lists several times in a frame or across multiple frames. Command Lists can be generated across multiple threads then fed to the Command Queue for GPU execution. Finally using persistently mapped buffers you can generate dynamic data in parallel. So both D3D 12 and WDDM 2.0 are designed for parallelism, and it is up to the developer what to make parallel. D3D 12 removes the constraints of past D3D versions, allowing the developer to parallelize their game or engine in whatever way makes sense for them.
The diagram above shows a typical game workload on D3D 11. We see the application logic, D3D runtime, UMD, DXGKernel, KMD and Present usage across a CPU with 4 threads. We can see that Thread 0 is doing all the heavy lifting. Threads 1-3 are not really doing much save for application logic and the D3D 11 runtime generating rendering commands. Due to the D3D 11 design the User Mode Driver isn’t even generating commands on these threads.
Now let’s take a look at the same workload but with D3D 12. Again we have the application logic, D3D runtime, UMD, DXGKernel, KMD and Present usage across a CPU with 4 threads. However with the D3D 12 optimizations the work is evenly split across all 4 threads. Thanks to true command generation we see the D3D runtime running in parallel. With the kernel optimizations in WDDM 2.0 the kernel overhead is drastically reduced. The UMD is now working on all the threads, not just Thread 0, showing true command generation parallelism. Lastly bundles replace the redundant state change logic of D3D 11 and reduce the application logic time.
Here we have the numbers and a side by side comparison. With true parallelism we see a relatively even CPU usage between Thread 0 and Threads 1-3. Threads 1-3 do more work so we see an increase there when looking at GFX only. Moreover with the reduced workload on Thread 0 and the new runtime and driver efficiencies the overall CPU usage is reduced by ~50%. Looking at the application plus GFX we see a more even split across threads and a reduced CPU usage of ~32%.
This wraps up our overview of the new features in D3D 12. We see greater CPU efficiency and great CPU scalability. D3D 12 provides greater developer control over memory usage, reproduce lifetime and of course CPU parallelism. Through D3D 12 we can get ‘Closer to the Metal’, a thinner API and driver with fewer layers for increased efficiency and performance.
Diagrams and code samples from BUILD 2014 presentation. Created by Max McMullen, D3D Development Lead at Microsoft.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804