In this part of the Intel® Media SDK tutorial we will explore transcode workloads, starting with the most simplistic transcode sample using system memory frame surfaces.
In this simple implementation there are barriers to full GPU utilization, as with the other examples:
- System memory adds implicit copies when using hardware acceleration
- Synchronous implementation means less efficient internal scheduling of decode/encode stages. Not as many opportunities to keep hardware pipeline fully loaded.
Since we are using system memory surfaces we must copy the decoded surface to system memory first, then before encode the surface will be copied to video memory again. Both copy operations have a large impact on CPU load and performance
As noted when exploring the encode workloads, the “Encode Query” polling method also introduces a slight inefficiency in the pipeline after the GPU has completed the encoding task.
The performance is also indirectly degraded by the fact that the GPU remains in lower frequency states due to the relatively low GPU activity.
In the following sections we will explore how to enhance Intel Media SDK transcode pipelines for improved GPU utilization leading to better performance.
This tutorial sample is found in the tutorial samples package under the name "simple_5_transcode". The code is extensively documented with inline comments detailing each step required to setup and execute the use case.