Improve Media App Performance with Hardware Acceleration

Intel® processor architecture is becoming more and more GPU-centric, unlocking amazing opportunities for much faster performance simply by off-loading media processing from the CPU to the GPU. There are a variety of tools that developers can use to get better performance out of their media applications, including ones that are free and fairly easy to use. 

This blog covers:

  • An overview of computing architectures and current Intel GPU capabilities
  • How to implement hardware acceleration using FFmpeg
  • How to implement hardware acceleration using the Intel® Media SDK, or the same component in Intel® Media Server Studio (depending on your platform target)

If you're looking for performance improvements with your multimedia processing but are not sure where to start,  experiment with FFmpeg first. Benchmark using software processing, then simply change to hardware acceleration and check the performance difference. Then add use of the Intel Media SDK and again compare running against different codecs and configurations.
 

Computing Architecture:  From Superscaler to Heterogeneous

In order to put the relevance of the GPU evolution into perspective, let's reflect back on advances made in CPU architecture.

Stepping back into the 1990s, the first major advance came with Superscaler architecture, providing greater throughput by running instruction-level parallelism within a single processor.

Figure 1. Superscalar Architecture

 

Then, the early 2000's brought Multicore Architecture (1 CPU with more than one Processing Unit). Homogeneous cores (all identical) could run multiple threads at once (Thread Level Parallelism). 

Meanwhile multicore, was hitting "walls":

  • Memory wall: the increasing gap between processor and memory speeds
  •  ILP wall: the increasing difficulty of finding enough parallelism in a single instruction stream to keep a high-performance, single core busy
  • Power wall: the trend of consuming exponentially increasing power with each factorial increase of operating frequency

Figure 2. Multicore Architecture
 

Today's Heterogeneous Architecture: GPU is expanding 

Heterogeneous architecture, allows multiple processors to have a pipelined data flow where each can be optimized for separate functions of encoding, decoding, transforming, scaling, interlacing, and more. 

In other words, there are significant performance and power advantages not available in the past. Figure 3 shows how the GPU has progressed over the past 5 generations, becoming more and more relevant. Whether using h.264 or moving to the latest h.265 codec, the GPU is providing the processing headroom to make 4k and higher resolution processing possible and fast.

Figure 3. 6 Evolution of Heterogeneous Architecture
 

Generations of GPU Performance

Figure 4 shows the dramatic increase in processing power in just the few generations that the GPU has been on the CPU die. If your application includes any media processing, you need to be offloading to the GPU to take advantage of the ~5X or faster improvement (depending on the age of your system and configuration). 

Figure 4.  Rapid Graphics Improvements with each Intel Processor Generation
 

Getting Started with GPU Programming

Step 1 usually involves running a benchmark for H.264 so you can compare performance improvements as you tweak your code. FFmpeg is commonly used to benchmark and to do the initial hardware acceleration comparison. FFmpeg is a powerful yet fairly easy tool to use.

Step 2 involves testing with different codecs and configurations.  You can switch to hardware acceleration just by changing the codec (switch from "libx264" to "h264_qsv") to one that uses Intel® Quick Sync Video.  

Step 3 adds use of the Intel Media SDK. (Attend an upcoming GPU Acceleration webinar on March 30 for more insight into these topics.) 

Note: This blog focuses on running these tools on Windows*. If you are interested in learning about Linux* implementations, see Accessing Intel Media Server Studio for Linux Codecs with FFmpeg.

Encoding and Decoding with FFmpeg

Start with H.264 (AVC) since h264: libx264 is the default software implementation in FFmpeg and produces good quality using just software. Create your benchmark and then test again with the codec switch from libx264 to h264_qsv. Later we'll cover H.265 codecs.  

It is important to note that when working with video streams, there are tradeoffs between quality and speed. Faster processing will almost always reduce the quality as well as the file size. You will need to find your quality threshold based on the amount of time spent. There are 11 "presets" for tuning quality vs. speed, from Ultra-fast to Very slow. There are also several rate control algorithms:

  • 1-pass constant bitrate (set -b:v)
  • 2-pass constant bitrate
  • Constant Rate Factor (CRF)

Intel Quick Sync Video allows decoding and encoding using the Intel CPU and integrated GPU1. Note the Intel processor needs to be compatible with both Quick Sync Video and OpenCL*. For more information see the Intel SDK for OpenCL* Applications Release NotesSupport for decoding and encoding is integrated in FFmpeg through codecs having by the _qsv suffix. Current support for Quick Sync Video is : MPEG2 video, VC1 (decoding only), H.264 and H.265.

If you want to experiment with Quick Sync Video in FFmpeg, you must include libmfx. The easiest way to install the library is to use the libmfx version packaged by lu_zero.

Encoding example with Quick Sync Video hardware acceleration:

>ffmpeg -I INPUT -c:v h264_qsv -preset:v faster out.qsv.mp4

FFmpeg can also take advantage of hardware acceleration for decoding using the  -hwaccel option. (https://trac.ffmpeg.org/wiki/HWAccelIntro). 

While h264_qsv leaves a lot of performance on the table, you should see that even the slowest hardware acceleration is significantly faster than the fastest/least quality software-only encode.

In the case of testing with the H.265 codecs, you will need to get access to a build that has libx265 enabled, or build a version according to the instructions in the FFmpeg and H.265 Encoding Guide or X265 documentation.

H.265 example:

>ffmpeg -I input -c:v libx265 - preset medium -x265-params crf=28 -c:a aac -strict experimental -b:a 128k output.mp4

For more information on using FFmpeg and Quick Sync Video, see Cloud Computing Intel QuickSync Video and FFmpeg.
 

Using the Intel Media SDK (sample_multi_transcode)

If you want more performance boost than that available using FFmpeg, then the next step is to optimize your app with the Intel Media SDK. The Media SDK is a cross-platform API for developing and optimizing media applications to take advantage of Intel's Fixed Function hardware acceleration.

To get started using the Intel Media SDK, here are some simple steps to follow:

  1. Download the Intel Media SDK  for your target device
  2. Download the Tutorials and read through them to get an idea of how to set up your software using the SDK
  3. Install the Intel Media SDK. If using Linux, access the Linux Installation Guide
  4. Download the SDK sample code so you can experiment with the pre-built sample applications
  5. Build and run the Video Transcoding sample: s ample_multi_transcode 

Commands are similar to those in FFmpeg. Examples:

>VideoTranscoding_folder\_bin\x64>\sample_multi_transcode.exe -hw  -i::h264 in.mpeg2 -o::h264 out.h264

>VideoTranscoding_folder\_bin\x64>\sample_multi_transcode.exe -hw  -i::h265 in.mpeg2 -o::h265 out.h265

  • Note that in order to take advantage of hardware acceleration, you must specify the -hw option in the argument list.
  • This sample also works with the HEVC (h.265) Decoder and Encoder, butmust be installed from the Intel Media Server Studio Pro edition.

There are many other options to specify in the command line. The -u <quality, speed, balanced> option allows you to set the Target Usage (TU) which is much like the tuning presets that FFmpeg uses. TU = 4 is the default setting. Figure 5 charts typical tradeoffs experienced in the various TU settings.

Figure 5.  Example of H264 Performance Characteristics With Respect to Target Usages

Augment with other Advanced Intel Software Tools

To help you further tune your code, Intel produces other optimization and profiling tools that may be helpful including Intel® Graphics Performance Analyzer (GPA) and Intel® VTune™ Amplifier. Plus Intel® Video Pro Analyzer and Intel® Stress Bitstreams and Encoder can help you get brilliant video quality and streaming, improve encoders/decoders, and reduce validation time so you can bring your solutions to market faster.

Summary

Computer architecture has changed tremendously over the past 20 years but even just the last 5 years have brought major performance gains. Intel CPUs are now being designed to process multimedia directly on the GPU, which opens doors to many new consumer and business usages.

See for yourself with FFmpeg, and optimize code more completely by using the free Intel Media SDK APIs. Moving from software to hardware acceleration will improve the system's performance and reduce power usage (and costs) and provide headroom to consider that eventual move to H.265 codecs.

More Resources

  1. Install and Run Intel® Media SDK on Windows
  2. FFMPEG.ORG
  3. Integrate Intel Media SDK with FFMPEG for Mux-demuxing and Audio Encode Decode Usages
  4. Intel Media SDK Tutorials for Client and Server
  5. Intel Graphics Performance Analyzers 
  6. Intel VTune Amplifier 
  7. Intel Media Server Studio 
  8. Accelerate your FFmpeg-based Applications with Intel Quick Sync Video
  9. Intel QuickSync Video and FFmpeg*
  10. Intel QuickSync Video and FFmpeg Installation and Validation
  11. Accessing Intel Media Server Studio for Linux Codecs with FFmpeg
  12. Learn about the Significance of HEVC (H.265)

About the Authors/Contributors

  • Gael Hofemeier is an Senior Software Engineer enabling Business Client and Consumer applications and Technical Content writer
  • Jeff McAllister is a Senior Software Engineer/Technical Consulting Engineer enabling Intel Media SDK for OpenCL Applications

 

 

1See Technical specifications.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Kronos

For more complete information about compiler optimizations, see our Optimization Notice.