This Intel® Media SDK tutorial sample illustrates the most simplistic way of implementing HW decode using system memory surfaces.
Throughout this tutorial we will use the Intel® Graphics Performance Analyzers (Intel® GPA) to capture overall GPU load and detailed GPU operations by capturing traces. In this and some of the following tutorial samples we will explain how to use Intel GPA to analyze and identify potential performance bottlenecks and what can be done to utilize the GPU more efficiently and thus improve the performance of the workload.
Let’s take a look at the overall CPU utilization for this tutorial sample using the Intel GPA “Media Performance” dialog. As can be seen, the GPU is only utilized to ~72%. For tutorial snapshot benchmarks comparing all workloads analyzed with Intel GPA, navigate to this page. It’s clear that we are not using the GPU to its full capacity. Let’s explore why.
The next screen-shot is a close-up segment of the trace captured for this workload using the Intel GPA's “Analyze Application” feature.
We can draw the following conclusions by analyzing the decode workload trace:
- The call to DecodeFrameAsync() can be seen in the “MSDK app” track. The call results in a series of operations (D3D9_BeginFrame, D3DCompBuffer(several), DXVA2_Execute, D3D9_EndFrame) in the connected sub-track representing the required DXVA decode operations including slice handling. The D3D9_EndFrame call finalizes decode task submission to HW and as can be seen in the trace the call is closely followed by the actual frame decoding on the GPU represented by DECODE operation on the “GPU DECODE” track bar
- Since the decode workload is implemented in a simplistic synchronous fashion the application calls SyncOperation() directly after completing the DecodeFrameAsync() call. A previously decoded frame is ready, so SyncOperation() will fetch it, however, since we are using system memory surfaces the decoded frame must be copied from D3D memory to system memory. The copy operation is represented by “FastCopySSE” in the “simple_decode.exe” track. The copy causes processing overhead, impacting both CPU utilization and performance. The performance impact of the copy operation is partly hidden because Intel Media SDK decode implicitly buffers many decode operations at the start of a decode workload. Note that for workloads that only supply one compressed frame at a time to the decoder, the performance impact of the surface copy operation is much greater
- It is also clear that the GPU is not fully utilized from the visible gaps in the “GPU MFX Queue” and “GPU DECODE” tracks
Based on the above analysis we should be able to improve the performance of the workload by using D3D memory surfaces instead of system memory surfaces. The next tutorial sample will explore such scenario.
This tutorial sample is found in the tutorial samples package under the name "simple_2_decode". The code is extensively documented with inline comments detailing each step required to setup and execute the use case.