This Intel® Media SDK tutorial sample illustrates the most simplistic way of implementing HW encode using system memory surfaces.
Analyzing the workload using the Intel® GPA “Media Performance” dialog shows an overall GPU utilization of ~70%, indicating that we are clearly not using the GPU to its full extent. For tutorial snapshot benchmarks comparing all workloads analyzed with Intel GPA, navigate to this page.
Let’s look at the Intel GPA workload trace to analyze what is going on
- The call to EncodeFrameAsync() in the “MSDK app” track is very short and leads to a set of activities in the “simple_encode.exe” track, required to submit the encode operation to the GPU. The first thing to notice is the surface copy from system memory to D3D memory illustrated by the “ippCopyManaged” calls (see “A” in the trace above). This naturally has a CPU performance and utilization impact. The copy is followed by DXVA2_Execute which effectively submits the D3D surface planes to the GPU.
- Since this encode workload operates in a synchronous fashion, the SyncOperation() call waits for the a frame to be completely encoded. The current approach is for Intel Media SDK to use polling (the EncodeQuery operation), which queries the GPU to check if the encode operation has completed. The GPU is queried every 1 ms until the compressed frame is ready (polling approach will be eliminated in future Intel® HD Graphics driver releases). It is clear that we have another performance bottleneck here since the frame is ready ahead of the next polling event, thus introducing a non optimal delay (see “B” in the trace above). After EncodeQuery determines that the frame is ready, the encoded bit stream will be copied to the bit stream buffer in system memory (a very small delay due to the relative small size of the buffer).
- It is also clear that the GPU is not fully utilized from the gaps in the “GPU MFX Queue”, “GPU EU Queue”, “04: GPU ENCODE” and “06: GPU ENCODE” tracks. “C” and “D” clearly indicate that we are not using the GPU efficiently. The MFX (encoding) and EU GPU (motion estimation) units are not fully utilized since operation is purely serial. For a single frame, the encoding stage is dependent on the former motion estimation stage so we cannot parallelize these operations. Instead, to achieve better GPU utilization we must explore encoding several frames simultaneously.
Before moving on note that the GPU EU frame encode processing time for this workload varies between 6 – 9 ms (see “E” in the graph). We will make a comparison to this benchmark later.
Based on the above analysis we should be able to improve the performance of the workload. First let’s enhance the workload by using D3D memory surfaces instead of system memory surfaces. The next tutorial section will explore this workload.
This tutorial sample is found in the tutorial samples package under the name "simple_3_encode". The code is extensively documented with inline comments detailing each step required to setup and execute the use case.