Google VP9 Optimization

Published on March 25 , 2016

 

 

 

 

Introduction

We’ve all watched a movie or music video and become frustrated as it stalled or buffered right at the best part. Portals including YouTube*, Netflix*, and Amazon* work constantly to give customers the highest quality with the fastest streaming available. Toward that end, Intel gathered a team of senior engineers with very strong performance optimization backgrounds early in 2015 to tune the VP9 video codec for greater performance on the Intel® Atom™ platform.

What is VP9?

  • It is a royalty-free open video format that Google is developing. Compare it to High Efficiency Video Coding (HEVC), which requires a license.
  • It is used for 4K-resolution content on YouTube*, other video services, and some smart TVs.
  • It supports HTML5 playback in Chrome*, Chromium*, Firefox*, and Opera*.

How does it work?

VP9 uses different-sized pixel blocks for encoding and decoding video content. Using a sophisticated algorithm, it systematically compresses raw data compactly to broadcast over an Internet stream. Specifically, VP9 can operate on a combination of block sizes from 64x64 down to 4x4 for various levels of detail, which makes it efficient for recreating everything from an open blue sky to the details of your loved one’s face.

Further, VP9 includes 10 different prediction models to efficiently rebuild pictures during decoding and has a range of features to enhance reliability for streaming video content. This is critical for high-definition video. A full HD image has about 2 million pixels and potentially millions of colors making up an individual frame, with hundreds of thousands of frames making up a movie. Google has already announced plans to use VP9 for 4K content on YouTube*. The Google Play* store will also likely use VP9 for its streaming video service.

Planning the Approach

One of the first objectives for our team was to define our test case. VP9 is relevant across various types of streaming video, but we noted that generally a videoconference changes less, frame-to-frame, than a YouTube video. We decided to start with the simpler videoconference case. Our plan was to verify key optimizations in videoconferencing and then test them against more complex video usages.

Our first architecture target was Bay Trail under 32-bit Android*. Once we achieved the best decode performance we could on Bay Trail, we would then focus on other platforms.

Meeting the Challenge

Our team, which included a principal engineer and an architect, used a CPU simulator to identify hotspots in the existing code. We used an iterative approach in optimization – possible issue identification using internal tools and micro-architectural knowledge, coding and measuring performance for new solution, and getting back to code review. Specifically, during WebM/libvpx optimization we came across of a lot of front-end related issues that we describe further below. These issues are easy to spot and fix and might provide substantial performance improvements to your application.

Front-End starvation due to MSROM instruction flow

One of the performance issues we found on Silvermont microarchitecture was excessive usage of pshufb instructions. According to the optimization manual [1], pshufb requires decoder assistance from MSROM and has 5-cycle throughput and latency.

MSROM lookup creates a delay in the front-end and limits the number of instructions decoded per cycle. In many cases MSROM lookup penalty might be fine (when the back-end is not able to consume and execute uops at the higher rate), but if the number of such instructions is high, performance might suffer when there are no uops in the IDQ.

We found that excessive pshufb instructions in the “vp9_filter_block1d16_h8_ssse3” function [2] were creating the issue explained above. In general, pshufb instructions rearrange bytes in the vector register based on an arbitrary mask.

We drilled down to the actual operation that was required (see diagram below) using two pshufb instructions.

We realized that exactly the same operation can be done using just four simple operations (punpcklbw, punpckhbw, and two palignr instructions) as shown on the next diagram.

Optimized code can be found in [3] with 15% measured speedup on the function level.

8+ bytes instructions

We identified another front-end feature that was limiting performance. However, unlike the previous example, it was caused by a characteristic of the microarchitecture -- “The Silvermont microarchitecture can only decode one instruction per cycle if the instruction exceeds 8 bytes” [1].

In this particular case, the mb_lpf_horizontal_edge_w_sse2_16 [4] code suffered from the heavy vector register pressure and multiple loads/stores to the stack, making operations with mem source necessary.

Using rsp register added an additional SIB byte to the instruction encoding and generated a large number--exceeding 8 bytes--of instructions. Therefore, the frond-end throughput was limited (just slightly higher than 1 instruction decoded per cycle), and so couldn't achieve a good back-end utilization of 2 vector instructions per cycles.

This issue was fixed by using an rbp register to address the stack, which does not add a SIB byte to the instruction and allowed us to achieve 20% function level speedup. Patch submission is still pending.

4th prefix

And the last, but not the least important, microarchitectual feature we had to work around in the code was the 3-cycle decoding delay dealing with instructions that have 4 or more prefixes. As you might know Silvermont architecture can decode up to 2 instructions per cycle, or up to 6 instructions in 3 cycles. Therefore, decoding just one instruction in 3 cycles quite often can create the case performance issues.

Quite often a 4th prefix is added when any of the upper 8 xmm registers (xmm7-xmm15) are used with some of the vector instructions, as using those registers will add an extra REX prefix to the instruction. pmaddubsw instruction is a good example of such a case [5]. In particular, “pmaddubsw xmm1, xmm2” will result in the following encoding “66 0F 38 04 …”, and “pmaddubsw xmm1, xmm16” will add the REX prefix – “66 41 0F 38 04 …”

We were able to achieve 25% speedup on the function level by limiting the vector registers usage for those particular instructions.

Performance Results

The overall results were outstanding. The team improved user-level performance by up to 16 percent (6.2 frames per second) in 64-bit mode and by about 12 percent (1.65 frames per second) in 32-bit mode. This testing included evaluation of 32-bit and 64-bit GCC and Intel® compilers, and concluded that the Intel compilers delivered the best optimizations by far for Intel® Atom™ processors. When you multiply this improvement by millions of viewers and thousands of videos, it is significant. The WebM team at Google also recognized this performance gain as extremely significant. Frank Gilligan, a Google engineering manager, responded to the team’s success: “Awesome. It looks good. I can’t wait to try everything out.” Testing against the Intel Atom Goldmont platform, the VP9 optimizations delivered additional gains.

While the video decoder was already highly optimized on the Intel® Core™ processor, it wasn’t very efficient on Intel® Atom™ architecture. Now, watching YouTube content or video conferencing can lead to a better user experience with higher quality and no lag. This effort was also important because of the increasing number of YouTube users that will enjoy watching video on Intel® architecture-based devices. The team also identified and fixed possible performance issues on 64-bit Intel Atom platforms before anyone was able to experience it on the end-user devices.

The success has proved durable. Our performance team delivered significant, long-lasting improvements in end-user experience. After meeting and exceeding goals on 32-bit Android, our team expanded our horizons and extended our scope to include the 64-bit environment in anticipation of 64-bit Android.

About the Authors

Tamar Levy and Ilya Albrekht are software engineers on the Client Platform Enabling Team in the Software and Services Group working with individual customers on projects for Windows* and Android media performance-enabling applications.

Related References

[1http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

[2https://chromium.googlesource.com/webm/libvpx/+/v1.3.0/vp9/common/x86/vp9_subpixel_8t_ssse3.asm#778

[3https://chromium.googlesource.com/webm/libvpx/+/v1.5.0/vpx_dsp/x86/vpx_subpixel_8t_ssse3.asm#364

[4https://chromium.googlesource.com/webm/libvpx/+/v1.3.0/vp9/common/x86/vp9_loopfilter_intrin_sse2.c#373

[5https://chromium.googlesource.com/webm/libvpx/+/v1.4.0/vp9/common/x86/vp9_subpixel_8t_intrin_ssse3.c#237

[6https://software.intel.com/en-us/intel-stress-bitstreams-and-encoder/reviews

[7https://software.intel.com/content/www/us/en/develop/articles/video-quality-caliper-quick-overview.html

[8http://www.webmproject.org/vp9

[9http://www.tomsguide.com/us/what-is-vp9-4k-streaming,news-18221.html

1

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserverd for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804