Intel® processor architecture is becoming more and more GPU-centric, unlocking amazing opportunities for much faster performance simply by off-loading media processing from the CPU to the GPU. There are a variety of tools that developers can use to get better performance out of their media applications, including ones that are free and fairly easy to use.
This blog covers:
If you're looking for performance improvements with your multimedia processing but are not sure where to start, experiment with FFmpeg first. Benchmark using software processing, then simply change to hardware acceleration and check the performance difference. Then add use of the Intel Media SDK and again compare running against different codecs and configurations.
In order to put the relevance of the GPU evolution into perspective, let's reflect back on advances made in CPU architecture.
Stepping back into the 1990s, the first major advance came with Superscaler architecture, providing greater throughput by running instruction-level parallelism within a single processor.
Figure 1. Superscalar Architecture
Then, the early 2000's brought Multicore Architecture (1 CPU with more than one Processing Unit). Homogeneous cores (all identical) could run multiple threads at once (Thread Level Parallelism).
Meanwhile multicore, was hitting "walls":
Figure 2. Multicore Architecture
Heterogeneous architecture, allows multiple processors to have a pipelined data flow where each can be optimized for separate functions of encoding, decoding, transforming, scaling, interlacing, and more.
In other words, there are significant performance and power advantages not available in the past. Figure 3 shows how the GPU has progressed over the past 5 generations, becoming more and more relevant. Whether using h.264 or moving to the latest h.265 codec, the GPU is providing the processing headroom to make 4k and higher resolution processing possible and fast.
Figure 3. 6 Evolution of Heterogeneous Architecture
Figure 4 shows the dramatic increase in processing power in just the few generations that the GPU has been on the CPU die. If your application includes any media processing, you need to be offloading to the GPU to take advantage of the ~5X or faster improvement (depending on the age of your system and configuration).
Figure 4. Rapid Graphics Improvements with each Intel Processor Generation
Step 1 usually involves running a benchmark for H.264 so you can compare performance improvements as you tweak your code. FFmpeg is commonly used to benchmark and to do the initial hardware acceleration comparison. FFmpeg is a powerful yet fairly easy tool to use.
Step 2 involves testing with different codecs and configurations. You can switch to hardware acceleration just by changing the codec (switch from "libx264" to "h264_qsv") to one that uses Intel® Quick Sync Video.
Note: This blog focuses on running these tools on Windows*. If you are interested in learning about Linux* implementations, see Accessing Intel Media Server Studio for Linux Codecs with FFmpeg.
Start with H.264 (AVC) since h264: libx264 is the default software implementation in FFmpeg and produces good quality using just software. Create your benchmark and then test again with the codec switch from libx264 to h264_qsv. Later we'll cover H.265 codecs.
It is important to note that when working with video streams, there are tradeoffs between quality and speed. Faster processing will almost always reduce the quality as well as the file size. You will need to find your quality threshold based on the amount of time spent. There are 11 "presets" for tuning quality vs. speed, from Ultra-fast to Very slow. There are also several rate control algorithms:
Intel Quick Sync Video allows decoding and encoding using the Intel CPU and integrated GPU1. Note the Intel processor needs to be compatible with both Quick Sync Video and OpenCL*. For more information see the Intel SDK for OpenCL* Applications Release Notes. Support for decoding and encoding is integrated in FFmpeg through codecs having by the _qsv suffix. Current support for Quick Sync Video is : MPEG2 video, VC1 (decoding only), H.264 and H.265.
If you want to experiment with Quick Sync Video in FFmpeg, you must include libmfx. The easiest way to install the library is to use the libmfx version packaged by lu_zero.
Encoding example with Quick Sync Video hardware acceleration:
>ffmpeg -I INPUT -c:v h264_qsv -preset:v faster out.qsv.mp4
FFmpeg can also take advantage of hardware acceleration for decoding using the -hwaccel option. (https://trac.ffmpeg.org/wiki/HWAccelIntro).
While h264_qsv leaves a lot of performance on the table, you should see that even the slowest hardware acceleration is significantly faster than the fastest/least quality software-only encode.
In the case of testing with the H.265 codecs, you will need to get access to a build that has libx265 enabled, or build a version according to the instructions in the FFmpeg and H.265 Encoding Guide or X265 documentation.
>ffmpeg -I input -c:v libx265 - preset medium -x265-params crf=28 -c:a aac -strict experimental -b:a 128k output.mp4
If you want more performance boost than that available using FFmpeg, then the next step is to optimize your app with the Intel Media SDK. The Media SDK is a cross-platform API for developing and optimizing media applications to take advantage of Intel's Fixed Function hardware acceleration.
To get started using the Intel Media SDK, here are some simple steps to follow:
Commands are similar to those in FFmpeg. Examples:
>VideoTranscoding_folder\_bin\x64>\sample_multi_transcode.exe -hw -i::h264 in.mpeg2 -o::h264 out.h264
>VideoTranscoding_folder\_bin\x64>\sample_multi_transcode.exe -hw -i::h265 in.mpeg2 -o::h265 out.h265
There are many other options to specify in the command line. The -u <quality, speed, balanced> option allows you to set the Target Usage (TU) which is much like the tuning presets that FFmpeg uses. TU = 4 is the default setting. Figure 5 charts typical tradeoffs experienced in the various TU settings.
Figure 5. Example of H264 Performance Characteristics With Respect to Target Usages
Augment with other Advanced Intel Software Tools
To help you further tune your code, Intel produces other optimization and profiling tools that may be helpful including Intel® Graphics Performance Analyzer (GPA) and Intel® VTune™ Amplifier. Plus Intel® Video Pro Analyzer and Intel® Stress Bitstreams and Encoder can help you get brilliant video quality and streaming, improve encoders/decoders, and reduce validation time so you can bring your solutions to market faster.
Computer architecture has changed tremendously over the past 20 years but even just the last 5 years have brought major performance gains. Intel CPUs are now being designed to process multimedia directly on the GPU, which opens doors to many new consumer and business usages.
See for yourself with FFmpeg, and optimize code more completely by using the free Intel Media SDK APIs. Moving from software to hardware acceleration will improve the system's performance and reduce power usage (and costs) and provide headroom to consider that eventual move to H.265 codecs.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Kronos
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804