This Intel® Media SDK tutorial sample keeps multiple encode tasks “in flight” simultaneously, and SyncOperation() is not called until absolutely necessary (when all surface input buffers have been exhausted).
Like "simple_2_decode" tutorial sample this sample supports both Microsoft DirectX* 9 and DirectX* 11. For more details on this topic please refer to "simple_2_decode" sample description.
Let’s first take a look at the overall GPU load using the Intel® GPA “Media Performance” dialog. For tutorial snapshot benchmarks comparing all workloads analyzed with Intel GPA, navigate to this page.
As can be seen, the GPU is now almost fully utilized indicating that the asynchronous encoding workload is more efficient as can be verified via the shorter workload time to completion.
By studying the captured Intel GPA trace we can explore what is going on:
- The first observation that tells us that the GPU is highly utilized is the fact that the “GPU MFX Queue” and “GPU EU Queue” tracks have no gaps, indicating a full GPU queue.
- Another observation is that the GPU EU execution of encode motion estimation algorithms (the “04: GPU ENCODE” track) now overlaps with execution of GPU MFX encode (the “06: GPU ENCODE” track). This is of central importance for improved performance and illustrates the benefits of processing several frame encodes simultaneously.
- The SyncOperation() call still awaits completion of a frame for a relatively long time, but since the GPU is continuously busy encoding frames this is not a performance issue.
- You can see another very important reason for the improved performance by comparing the GPU EU execution time for one frame in the “04: GPU ENCODE” track. Looking back at the “simple_3_encode” benchmark, the execution time varied between 6 – 9 ms. Compare this to the optimized workload that executes the same amount of work in just 3.5 - 4.5 ms (see “A” in the trace above). How can that be?
The performance gain is due to Intel® Turbo Boost technology1. For instance, let's say a platform has a base graphics frequency of 650MHz and maximum graphics frequency of 1.3Ghz. If an application is not using the GPU continuously, then it remains at the base frequency. But as the GPU usage increases, Intel Turbo Boost technology kicks in and increases the GPU frequency, thus the improved performance.
Are there more opportunities to improve overall performance of encode? Yes, marginal throughput improvements may be achieved by executing several encode workloads concurrently.
This tutorial sample is found in the tutorial samples package under the name "simple_3_encode_d3d_async". The code is extensively documented with inline comments detailing each step required to setup and execute the use case.
1 Requires a system with Intel® Turbo Boost Technology. Intel Turbo Boost Technology and Intel Turbo Boost Technology 2.0 are only available on select Intel® processors. Consult your PC manufacturer. Performance varies depending on hardware, software, and system configuration. For more information, visit http://www.intel.com/go/turbo