In this part of the Intel® Media SDK tutorial we will explore transcode workloads, starting with the most simplistic transcode sample using system memory frame surfaces.
The overall GPU load captured using the Intel® GPA “Media Performance” dialog shows that the GPU is not fully utilized, ~70%. For tutorial snapshot benchmarks comparing all workloads analyzed with Intel GPA, navigate to this page.
Let’s investigate the Intel GPA transcode workload trace below to identify potential bottlenecks
- Like the workloads we studied earlier it is clear that the GPU is not utilized efficiently as can be seen from the large gaps in the GPU EU, MFX and Queue tracks. The encode and decode tracks are mostly idle due to the strict serial behavior (decode -> encode motion compensation -> encode) imposed by the fact that this workload operates in a synchronous fashion with no task concurrency
- As expected, DecodeFrameAsync() (via submit of DVXA2_Endframe) is directly followed by the decode operation in the “GPU Decode” track. Since we are using system memory surfaces we must copy the decoded surface to system memory first (FastCopySSE), then before encode, in the “04 GPU Encode” track, the surface will be copied (ippCopyManaged) to D3D memory again. Both copy operations can be seen in the “simple_transcode.exe” track and have a large impact on CPU load and performance
- As noted when exploring the encode workloads, the “Encode Query” polling method also introduces a slight inefficiency in the pipeline after the GPU has completed the encoding task.
- The performance is also indirectly degraded by the fact that the GPU remains in lower frequency states due to the relatively low GPU activity.
In the following sections we will explore how to enhance Intel Media SDK transcode pipelines for improved GPU utilization leading to better performance.
This tutorial sample is found in the tutorial samples package under the name "simple_5_transcode". The code is extensively documented with inline comments detailing each step required to setup and execute the use case.