In Intel®media SDK samples, sample_multi_transcode can be used for single pipeline transcoding or multiple transcoding pipelines. In each Media SDK transcoding pipeline, there are at least one decoding and encoding session. More complicated transcoding can also contains several sessions of video pre and post processing (VPP) between decoding and encoding. In Media SDK, only one VPP is allowed in one session, multiple VPP stages have to be split into multiple sessions. In media SDK, there are a few APIs that can control pipeline performance, such as AsyncDepth and join parameters that can be used to optimize the performance in the pipelines. In programming, there are also threading API that can be used to adjust the number of threads, the priority and scheduler of sessions. This article investigates AsyncDepth and join effects in simple transcoding case in Windows, where no VPP involves in the pipelines.
The asynchronous operation allows Media SDK to process multiple tasks without syncing, so more tasks can be processed in parallel before syncing to free resources. AsyncDepth is the parameter that controls the level of asynchronous operation before syncing. With larger AsyncDepth, Media SDK would allocate more surfaces to support asynchronous processing of multiple tasks. As such, although larger AsyncDepth allows more tasks to be processed in parallel, it can also cause the shortage of the resources that are available for further processing.
Join operation allows Media SDK to reuse resources for different sessions, therefore reducing resource allocation for each session. However, with joined operation, multiple sessions share common surfaces which could decrease asynchronous operation among multiple sessions.
This article will show some experiments on how AsyncDepth and join operation affect transcoding performance in three scenarios: single pipeline transcoding, N to N transcoding and 1 to N transcoding. We will see if there is any balance or compromise between AsyncDepth and join operation in each case.
All the experiments done below are based on Intel® Core™ i7-4770R with Iris™ Pro Graphics 5200 in Windows 8.1 Professional. We used sample_multi_transcode.exe in Samples 18.104.22.168, with Media Server Studio 2015 R6 installed.
In single pipeline transcoding without VPP, there is only one session, so join operation doesn’t apply here. The command looks like this:
-hw -i::h264 park_joy_1080p.h264 -o::h264 temp_1.out -async 4 -f 30 -b 10000 -u 7
For various AsyncDepth, we ran the same transcoding tests, the results are shown in Figure 1.
As shown in above figure, when AsyncDepth =1, the single transcode has worst performance, as AsyncDepth increases, the single transcode generally getting better performance. However, When AsyncDepth > 5, there is not too much performance improvement as asyncDepth becomes larger. AsyncDepth = 4 or 5 is a good tradeoff for single session transcode, as larger AsyncDepth would require more surfaces, but won’t win much performance gain.
The reason to use AsyncDepth >1 is that, for single session transcode, there is only one decoding session. To fully parallelize the GPU engines, the only option is to submit multiple decoding tasks, and to do that, AsyncDepth needs to be set larger.
In N to N pipeline transcoding, the par file for joined operation looks like this:
-hw -i::h264 park_joy_1080p.h264 -o::h264 temp_1.out -async 2 -f 30 -b 10000 -u 7 -join -hw -i::h264 park_joy_1080p.h264 -o::h264 temp_2.out -async 2 -f 30 -b 10000 -u 7 -join -hw -i::h264 park_joy_1080p.h264 -o::h264 temp_3.out -async 2 -f 30 -b 10000 -u 7 -join -hw -i::h264 park_joy_1080p.h264 -o::h264 temp_4.out -async 2 -f 30 -b 10000 -u 7 -join
We first test the case where all N sessions are joined. In the N pipelines transcoding, N decoding sessions are submitted at the beginning, the decoding engine would be busy processing multiple tasks in parallel. If the pipelines are not asynchronous, or when the AsyncDepth is very small, such as AsyncDepth=1, we can imagine that it is possible that the encoding engine could idle for a while at the very beginning, which could cause performance deprecation. As shown in Figure 2, when AsyncDepth is equal or larger than 2, there is obvious performance improvement especially when the number of pipelines N is less than 12. For 12 or more pipelines, there is still continuous performance improvement as AsyncDepth increase to 4 or 5, but not much improvement any more when AsyncDepth is larger than 5.
As a recommendation, for joined multiple transcoding piplines, AsyncDepth=2 gives good enough performance. Although setting AsyncDepth>2 could slightly improve the performance, it is recommended that setting AsyscDepth<5.
In non-joined N to N transcoding tests, results are shown in Figure 3. For less than 12 pipelines, similar performance improvement is seen when AsyncDepth is 2 and larger. For 12 or more pipelines, AsyncDepth>2 can cause obvious performance deprecation. Since for non-joined sessions, they don’t share resources, and each pipeline has its own resources, when a lot of asynchronous pipelines are submitted at the beginning, there could be limited resources left available for Media SDK to allocating more surfaces for larger asynchronous depth.
To see how join operation affect the performance, we test both join and non-joined transcoding at AsyncDepth=5, the results are shown in Figure 4.
For smaller number of sessions, when N<10, joined sessions outperform non-joined sessions, while when N>10, because of resource limitation, joined sessions do not have advantage over non-joined sessions anymore.
In the 1 to N pipelines transcoding, there is only one decoding sessions, and N encoding sessions. The parfile looks like this:
-hw -i::h264 park_joy_1080p.h264 -o::sink -async 2 -i::source -o::h264 temp_1.out -f 30 -b 10000 -u 7 -async 2 -i::source -o::h264 temp_2.out -f 30 -b 10000 -u 7 -async 2 -i::source -o::h264 temp_3.out -f 30 -b 10000 -u 7 -async 2 -i::source -o::h264 temp_4.out -f 30 -b 10000 -u 7 -async 2
Similar tests are done for 1 to N transcoding as for N to N transcodings. Results are shown in Figure 5 and Figure 6. In 1 to N transcoding, for both joined and non-joined sessions, AysncDepth has more dynamic effect than in N to N transcoding, but with AsyncDepth=2, it generates reasonably good performance in both cases.
For join operation in 1 to N transcoding, as shown in Figure 7, non-joined sessions outperform joined sessions when the number of sessions N < 10, and the advantage disappear as N > 10. The reason that non-joined over joined operation is that, in 1 to N transcoding, only one decoding session is processed at beginning, as decoding is generally faster than encoding, to speed up the following N encodings, it makes sense to allocate more independent resources, which means non-joined operation, to allow N independent encodings happens simultaneously.
So for non-joined 1 to N pipeline transcoding, we suggest use non-joined operation, and set AsyncDepth at a small number, such as AsyncDepth=2, to avoid dynamics in performance.
From the experiments above, we see that, in the simple transcoding without VPP, 3 different transcoding scenarios require different settings to achieve best performance. In single pipeline transcode, AsyncDepth >1 gives better performance, and AsyncDepth=4 or 5 provides a good balance in resource utilization. In N to N multiple transcoding, joined operation performance better than non-joined sessions when there are less than 10 sessions, and AsyncDepth=2 is a good compromise in both joined and non-joined operation. In 1 to N multiple transcoding, non-joined operation works better than joined, and the AsynDepth can cause dynamics in both cases, still AsyncDepth=2 provides good balance in performance.
The experiments in this article are based on Intel® Haswell architecture, for Broadwell, there are additional GPU engines for both decoding, encoding and VPP. The AsyncDepth recommendation will be relatively the same: don’t expect large AsyncDepth would bring performance jump, instead, small AsyncDepth would keep more resources for joined larger number of asynchronous N to N pipeline processing. For 1 to N pipelines, small AsyncDepth would also help in non-joined processing.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804