We have finished covering the new Render Context in Microsoft Direct3D* (D3D) 12. We have seen how D3D 12 gives control back to the game, getting it ‘closer to the metal’. Our last discussion covered Descriptor Heaps and Tables, how they replace bind points and create a more effecting render command pipeline. Yet there is more D3D 12 does to remove or streamline API churn. There is more overhead in the API lowering performance, and more ways to efficiently use the CPU. What about command sequences? How many repeated sequences are there and how can that be made more efficient?
Redundant Render Commands:
Looking at render commands frame-over-frame the Microsoft D3D team saw not only a pattern, but an opportunity. Modern games have a staggering 90-95% coherence. If you look at command sequences across an entire frame only 5% -10% are deleted or added. The rest are reused frame–over-frame, so the CPU is repeating the same command sequences over and over again, frame by frame 90-95% of the time! The CPU seems to spend most of its cycle repeating the same thing over and over again. How can this be more efficient? Why has D3D not tried this until now? Max McMullen, D3D Development lead at Microsoft, had this to say at BUILD 2014 this past April. “It’s very hard to build a way to record commands that is both conformant and reliable. So it behaves the same way, across multiple different GPUs on multiple different drivers, and simultaneously with that, make it performant.” It needs to be reliable and fast. The game needs to count on any recorded command sequences get executed as quickly as they would be if they were run as individual commands. What changed? D3D changed. With the new PSO, descriptor heaps and tables the state required to record and playback commands is greatly simplified.
Bundles are a small list of commands that are recorded once, yet reused multiple times. They can be reused across frames or in a single frame, there are no restrictions on reuse. Bundles can be created on any thread and used an unlimited amount of times. But the bundles are not tied to the PSO state. Meaning the PSO can update the descriptor table then when the bundle is run again with the different bindings the game gets a different result. Like a formula in an Excel spreadsheet, the math is always the same, but the result is different based on the source data. To ensure the driver can implement bundles efficiently certain restrictions exist. Specifically no commands that change the render target, but that still leaves quite a lot of commands that can be recorded and played back.
The left side of the above diagram is a rendering context sample, a series of commands generated by the CPU and passed to the GPU for execution. On the right are 2 bundles containing a command sequence recorded on different threads for reuse. As the GPU runs the commands it eventually reaches an Execute Bundle command. It then plays back the recorded bundle. When done it returns to the command sequence, continues on and finds a different bundle execute command. The second bundle is then read and played back before continuing on. This is an example of how bundles can be recorded and used to issue the same commands on the GPU many times.
We have gone through the control flow in the GPU, now we will see how bundles simplify the code.
Example code without Bundles:
Here we have a setup stage to begin with that sets the pipeline state and descriptor tables. Next we have 2 object draws. Both use the same command sequence only the constants are different. This is typical D3D 11 and older code.
// Setup pContext->SetPipelineState(pPSO); pContext->SetRenderTargetViewTable(0, 1, FALSE, 0); pContext->SetVertexBufferTable(0, 1); pContext->IASetPrimitiveTopology(D3D_PRIMITIVE_TOPOLOGY_TRIANGLELIST);
// Draw 1 pContext->SetConstantBufferViewTable(D3D12_SHADER_STAGE_PIXEL, 0, 1); pContext->SetShaderResourceViewTable(D3D12_SHADER_STAGE_PIXEL, 0, 1); pContext->DrawInstanced(6, 1, 0, 0); pContext->SetShaderResourceViewTable(D3D12_SHADER_STAGE_PIXEL, 1, 1); pContext->DrawInstanced(6, 1, 6, 0);
// Draw 2 pContext->SetConstantBufferViewTable(D3D12_SHADER_STAGE_PIXEL, 1, 1); pContext->SetShaderResourceViewTable(D3D12_SHADER_STAGE_PIXEL, 0, 1); pContext->DrawInstanced(6, 1, 0, 0); pContext->SetShaderResourceViewTable(D3D12_SHADER_STAGE_PIXEL, 1, 1); pContext->DrawInstanced(6, 1, 6, 0);
Example code with Bundles:
Now let’s see the same command sequence with bundles The first call below is creating a bundle, again this can happen on any thread. In the next stage the command sequence is recorded, it is the same commands we saw in the above example.
// Create bundle pDevice->CreateCommandList(D3D12_COMMAND_LIST_TYPE_BUNDLE, pBundleAllocator, pPSO, pDescriptorHeap, &pBundle);
// Record commands pBundle->IASetPrimitiveTopology(D3D_PRIMITIVE_TOPOLOGY_TRIANGLELIST); pBundle->SetShaderResourceViewTable(D3D12_SHADER_STAGE_PIXEL, 0, 1); pBundle->DrawInstanced(6, 1, 0, 0); pBundle->SetShaderResourceViewTable(D3D12_SHADER_STAGE_PIXEL, 1, 1); pBundle->DrawInstanced(6, 1, 6, 0); pBundle->Close();
The code sample above accomplishes the same thing as the non-bundle code. We can see now how bundles dramatically reduce the number of calls needed to perform the same task. The GPU is still executing the same commands and getting the same result, just more efficiently.
We finished covering how D3D 12 improves CPU efficiency through the use of Bundles, Descriptor Heaps, Tables and the PSO. More control is given to the game and fewer layers exist between the game and HW. Now we will discuss what the D3D development team is doing to increase parallelism in D3D 12. A key component of “Console API efficiency and performance”.
Next up in Part 6: Command Lists
Diagrams and code samples from BUILD 2014 presentation. Created by Max McMullen, D3D Development Lead at Microsoft.
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.