This Intel® Media SDK tutorial sample operates in the same way as the previous "sample_2_decode" sample except that it uses D3D memory surfaces instead of system memory surfaces.
This tutorial sample introduces the utilization of Microsoft DirectX*. DirectX is required to enable allocation of D3D (GPU memory) surfaces. Similarly, we must use DirectX to gain access to the HW device handle connected to the GPU adapter. Device creation, adapter detection and D3D surface management is illustrated in the D3D tutorial sample package via code located in the tutorial "common" folder. Note that the Intel® GPA traces and screen captures presented throughout this tutorial were gathered using DirectX9 workload execution.
Tutorial samples illustrating use of D3D surfaces (such as in this sample) have two Microsoft Visual Studio* solution/project (sln/prj) files. sln/prj for DirectX9 usage created using Microsoft Visual Studio 2010 and sln/prj for DirectX11 usage created using Microsoft Visual Studio 2012. Microsoft Visual Studio* 2012 is used for DirectX11 to ensure full DirectX 11.1 environment support.
Since the introduction of Microsoft Windows* 8, Intel Media SDK can be used with DirectX11 devices and surfaces. Note that Intel Media SDK relies on the features part of DirectX 11.1, and can therefore not be used on Microsoft Windows 7. If your target application must run on Microsoft Windows 7, use the DirectX 9 path via Intel Media SDK.
Analyzing the workload using the Intel® GPA “Media Performance” dialog shows a much improved overall GPU utilization, ~95%, resulting in better performance. For tutorial snapshot benchmarks comparing all workloads analyzed with Intel GPA, navigate to this page.
Let’s explore what is going on by studying the Intel GPA workload trace below.
The workload trace looks very different, primarily due to the elimination of system memory surfaces:
- The GPU is now utilized to a much greater extent, indicated by the queue size in “GPU MFX Queue” track. The GPU decode task submitted in the “MSDK app” via the DecodeFrameAsync() call is queued and does not get processed by the GPU, the “GPU DECODE” track, until scheduled based on the queue order.
- As in the previous workload, SyncOperation() is called directly after DecodeFrameAsync() to retrieve the fully decoded frame. Since there is no need to copy from D3D memory to system memory, the call SyncOperation() is more efficient but there is still a delay since we need to wait for the frame to be completely decoded. Frame completion is indicated by the MFX Async Task and DXVA2_Execute calls on the “simple_decode-d3d.exe” track.
- It is also clear that the GPU is close to fully utilized from the lack of large gaps in the “GPU MFX Queue” and “GPU DECODE” tracks.
Are there further opportunities to improve overall performance of decode? The GPU is already utilized to a very large extent so there is not much more head-room, but further improvement may be achieved by making the decode pipeline asynchronous. We'll explore this approach further when we discuss encoding workloads in the following tutorial sections. Improved GPU utilization can also be achieved by executing several decode workloads concurrently.
This tutorial sample is found in the tutorial samples package under the name "simple_2_decode_d3d". The code is extensively documented with inline comments detailing each step required to setup and execute the use case.