Exploring Video Decoding Parallelism on MTL

Exploring Video Decoding Parallelism on MTL

Last year we published a paper, entitled Mapping and optimization of the AVS Video Decoder on a high-performance chip multiprocessor [1]. Its main purpose was, as its title denotes, to optimize the Chinese Audio Video Standard
(AVS) [2] decoder, on Intels quad-core i7. In the context of this work
we evaluated the performance of different code versions, in a variable
number of cores (including/excluding Hyper-threading and Turbo Boost
features).

Unfortunately, at that time, the biggest number of cores we had
access to, was four. Figure 1 depicts the results of our experiments, on
three different full high definition videos (1920x1080), at various
bitrates. While real-time FullHD video decoding was achieved, what we
observed was deterioration of performance after Hyper-threading was
enabled (i.e. execution with more than four threads on four SMT cores).
Intel VTune and its performance counters helped us identify the data and
instruction load unit of the cores as the main culprit. But, as we only
had the opportunity to test the code in a quad-core machine, we could
not be sure if this was the only cause, or how much better performance
we could yield with more physical cores.

The opportunity Intel gave us, by granting access to the Many-core
Testing Lab, allows us to better explore the parallel nature of the AVS
video decoder, and video decoding in general. As initial experiments on
the Many-core Testing Lab show (Figure 2), deterioration of performance
is observed, even now that the application runs on more than four
physical cores. Now that previous limitations are not present, we would
expect better performance with more physical cores. This is an
interesting finding, which shows that our code does not actually scale
that well as we had expected, leading us to re-examine the
parallelization strategy we used in general and particular
parallelization decisions, and try to make it more efficient. Since the
trend shows that resolutions higher than 1080p are becoming prevalent in
the very near future, and while more parallelism is inherently
available in larger frames, more work needs to be done to overcome the
additional computational workload.

Initial changes performed, after these recent measurements, yielded
execution times similar to the previous ones for up to four cores, but
this time we observed a small speedup for up to eight cores. Still,
though, things need to be done to exploit currently available
parallelism and use known/find new ways to achieve more [3].

With this first blog post, a big thank you goes to Intel for giving us
the opportunity to get our hands on the Many-core Testing Lab so far.
Given the aforementioned results and our observations, we are planning
to reexamine our code and proceed with the next available steps in
optimizing the AVS video decoder, with the purpose of achieving greater
thread scalability. Another interesting area of research would be
exploring the applications behavior in varying combinations of cores,
taking into account the NUMA architecture of the MTL. Given the fact
that most video standards are based on the same principles, our final
results and findings on the AVS may be possibly generalized to other
video standards.

Figure 1: Sensitivity analysis

Figure 2: Test runs on the MTL for Rush Hour, 20Mbps


References:
[1] Krommydas, Konstantinos; Tsoublekas, George; Antonopoulos, Christos
D.; Bellas, Nikolaos; , "Mapping and optimization of the AVS video
decoder on a high performance chip multiprocessor," Multimedia and Expo
(ICME), 2010 IEEE International Conference on , vol., no., pp.896-901,
19-23 July 2010
doi: 10.1109/ICME.2010.5582558
[2] AVS Workgroup website: www.avs.org.cn/en
[3] Refer to paper in [1] for relevant references

1 post / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.