Exploring Video Decoder Parallelism on MTL

Last year we published a paper, entitled “Mapping and optimization of the AVS Video Decoder on a high-performance chip multiprocessor” [1]. Its main purpose was, as its title denotes, to optimize the Chinese “Audio Video Standard” (AVS) [2] decoder, on Intel’s quad-core i7. In the context of this work we evaluated the performance of different code versions, in a variable number of cores (including/excluding Hyper-threading and Turbo Boost features).

Unfortunately, at that time, the biggest number of cores we had access to, was four. Figure 1 depicts the results of our experiments, on three different full high definition videos (1920x1080), at various bitrates. While real-time FullHD video decoding was achieved, what we observed was deterioration of performance after Hyper-threading was enabled (i.e. execution with more than four threads on four SMT cores). Intel VTune and its performance counters helped us identify the data and instruction load unit of the cores as the main culprit. But, as we only had the opportunity to test the code in a quad-core machine, we could not be sure if this was the only cause, or how much better performance we could yield with more physical cores.

The opportunity Intel gave us, by granting access to the Many-core Testing Lab, allows us to better explore the parallel nature of the AVS video decoder, and video decoding in general. As initial experiments on the Many-core Testing Lab show (Figure 2), deterioration of performance is observed, even now that the application runs on more than four physical cores. Now that previous limitations are not present, we would expect better performance with more physical cores. This is an interesting finding, which shows that our code does not actually scale that well as we had expected, leading us to re-examine the parallelization strategy we used in general and particular parallelization decisions, and try to make it more efficient. Since the trend shows that resolutions higher than 1080p are becoming prevalent in the very near future, and while more parallelism is inherently available in larger frames, more work needs to be done to overcome the additional computational workload.

Initial changes performed, after these recent measurements, yielded execution times similar to the previous ones for up to four cores, but this time we observed a small speedup for up to eight cores. Still, though, things need to be done to exploit currently available parallelism and use known/find new ways to achieve more [3].

With this first blog post, a big “thank you” goes to Intel for giving us the opportunity to get our hands on the Many-core Testing Lab so far. Given the aforementioned results and our observations, we are planning to reexamine our code and proceed with the next available steps in optimizing the AVS video decoder, with the purpose of achieving greater thread scalability. Another interesting area of research would be exploring the application’s behavior in varying combinations of cores, taking into account the NUMA architecture of the MTL. Given the fact that most video standards are based on the same principles, our final results and findings on the AVS may be possibly generalized to other video standards.


Figure 1: Sensitivity analysis


Figure 2: Test runs on the MTL for Rush Hour, 20Mbps


References:
[1] Krommydas, Konstantinos; Tsoublekas, George; Antonopoulos, Christos D.; Bellas, Nikolaos; , "Mapping and optimization of the AVS video decoder on a high performance chip multiprocessor," Multimedia and Expo (ICME), 2010 IEEE International Conference on , vol., no., pp.896-901, 19-23 July 2010
doi: 10.1109/ICME.2010.5582558
[2] AVS Workgroup website: www.avs.org.cn/en
[3] Refer to paper in [1] for relevant references

Para obter informações mais completas sobre otimizações do compilador, consulte nosso aviso de otimização.