Tencent (Tencent Technology Company Ltd) integrated the Intel® Media SDK to optimize performance and reduce power consumption of its video conferencing app, QQ*. The app went from a max resolution of 480p with low frames per second (fps) to 720p resolution at 15-30 fps while consuming only 35% of the original amount of power. And it now supports 4-way conferencing while lowering CPU utilization from 80% to <20%, reducing power consumption from 14w to 6w and cutting RAM usage in half. Z
These techniques to optimize the entire pipeline using the hardware acceleration of Intel® graphics from camera capture through decoding, encoding, and final display can also be used by other media applications.
Tencent QQ is a popular instant messaging service for mobile devices and computers. QQ boasts a worldwide base of more than one billion registered users and is particularly popular in China. QQ has more than 100 million people logged in at any time and offers not only video calls, voice chats, rich texting, and built-in translation (text) but also file and photo sharing.
Like all video on the Internet, QQ performs best when there’s plenty of data bandwidth available, but video conferencing is bi-directional so both uplink and download speeds are important. Unfortunately in many countries, including China, uplink speed may only be 512kbps. So to please customers, Tencent needed good compression and low latency while still leaving CPU and RAM bandwidth available for multitasking. Plus the devices need to remain cool and power efficient while balancing high quality with available bandwidth.
So Tencent engineers worked with Intel engineer, Youwei Wang, to first diagnose the bottlenecks and power consumption of their app and then improve performance of the data flow pipeline. The main changes involved using the CPU and GPU in parallel to increase performance while making major memory handling changes to decrease memory usage, both of which provided a significant decrease in power consumption.
This article details how the improvements were accomplished using the special features of Intel® processors by integrating the Intel Media SDK and using Intel® Streaming SIMD Extensions (Intel® SSE4) instructions.
Performance and Power Analysis Tools
Significant data capture and analysis can be done using tools currently available free on the Internet. From Microsoft, the team used the Windows* Assessment and Deployment Kit (Windows ADK) (available at http://go.microsoft.com/fwlink/p/?LinkID=293840), which includes:
- Windows Performance Analyzer (WPA)
- Windows Performance Toolkit (WPT)
- Windows Performance Recorder (WPR)
The Intel® tools used were:
- Intel® Performance Bottleneck Analyzer https://software.intel.com/en-us/articles/intel-performance-bottleneck-analyzer
- Graphics Performance Analyzers https://software.intel.com/en-us/vcsource/tools/intel-gpa
- Intel® Power Gadget https://software.intel.com/en-us/articles/intel-power-gadget-20
- Battery Life Analyzer https://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&DwnldID=19351
The Intel® Media Software Development Kit (Intel® Media SDK)
The Intel® Media SDK is a cross-platform API that includes features for video editing and processing, media conversion, streaming and playback, and video conferencing. The SDK makes it easy for developers to optimize applications for Intel® HD Graphics hardware acceleration, which is available starting with the 2nd generation Intel® Core™ processors as well as the latest Intel® Celeron® and Intel® Atom™ processors.
Features of the Intel® Media SDK include:
- Low Latency Encode and Decode
- Allows dynamic control of bit rate via filter settings shown in the UI including
mfxVideoParam::AsyncDepth(limits internal frame buffering and forces per frame sync)
mfxInfoMFX::GopRefDist(stops use of B frames)
mfxInfoMFX::NumRefFrame(can set to only use previous P-frame)
(extends buffer, can set to show frame immediately)
- Dynamic Bit Rate and Resolution Control
- Adapts target and max Kbps to actual bandwidth at any time, OR customizes bit rate encoding per frame with the Constant Quantization Parameter (CQP) DataFlag.
- Reference List Selection
- Uses client side frame reception feedback to adjust reference frames, can improve robustness and error resilience.
- Provides 3 types of Lists: Preferred, Rejected, and Long Term.
- Reference Picture Marking Repetition SEI Message
- Repeats the decoded reference picture marking syntax structures of earlier decoded pictures to maintain status of the reference picture buffer and reference picture lists - even if frames were lost.
- Long Term Reference
- Allows temporal scalability through use of layers providing different frame rates.
- MJPEG decoder
- Accelerates H.264 encode/decode and video processing filters. Allows delivery of NV12 and RGB4 color format decoded video frames.
- Blit Process
- Option to combine multiple input video samples into a single output frame. Then post-processing can apply filters to the image buffer (before display) and use de-interlacing, color-space conversion, and sub-stream mixing.
- Hardware-accelerated and software-optimized media libraries built on top of Microsoft DirectX*, DirectX Video Acceleration (DVXA) APIs, and platform graphics drivers.
Understanding the Video Pipeline
Sending video data between devices is more complex than most people imagine. Figure 3 shows the key steps that the QQ app takes to send video data from a camera (device A) to the user’s screen (device B).
Figure 3: Serial processing
As you can see, many steps that require data format conversion or ‘data swizzling’. When these are handled serially in the CPU, significant latency occurs. The pre-optimized solution of QQ had limited pre and post processing. But since each packet of data is independent of the next, the Intel Media SDK can parallelize the tasks, split them between CPU and GPU, and optimize the flow.
Figure 4: Optimized multi-thread flow
Changing SIMD instructions
Another major improvement came from replacing the older Intel SIMD instruction set (MMX) with the Intel® Streaming SIMD Extensions (Intel® SSE4) instructions. This provided double throughput capabilities by moving from 64-bit standard floating point registers (where 2 32-bit integers can be swizzled simultaneously) to 128-bit registers (using the
_mm_store_si128 functions). Besides the larger registers, Intel SSE also separates the floating point registers from the data point registers. This means the processor can work on multi set data within one single CPU cycle, which greatly improves the data throughput and execution efficiency. Just the change from MMX to SSE4 calls increased QQ performance 10x. (See Additional References at the end of this article for more information on how to rewrite copy functions using Intel SSE4 and conversion of SIMD instructions.)
Additionally, Tencent was using C libraries to do the many large memory copies for each frame, which was too slow for HD video. The code was changed to use system memory only for the software pipeline and the hardware pipeline was changed so that D3D surfaces handle all the sessions/threads. For copies between system memory and the D3D surface, the engineers used the Intel SSE and Intel® Advanced Vector Extensions 2 (Intel® AVX2) instructions to decrease any unnecessary memory copies in the pipeline.
Using Dynamic Features of the Intel® Media SDK
Another improvement was to use the proper codec level when encoding (doing MJPEG decoding in the GPU). The team used the Intel Media SDK dynamic buffer and dynamic bit and frame rates, which decreased latency and reduced buffer use. Adding the pre and post processing in hardware improved compression helping the performance on low bandwidth networks.
For the user experience, the teams added de-noise in the preprocessing and used post processing to adjust colors (hue/saturation/contrast). By also using the integrated skin tone detection and face color adjustment, user experience was greatly improved.
Figure 5: Optimized Skin Tones
Changing Reference Frames
Regardless of the efficiency of the encode and decode processing, the user’s experience in a video conference will suffer if the network connection can’t consistently deliver the data. Without data, the decoder will skip ahead to a new reference frame (since the incremental frames come in late or was missing). Both frame type selection and accurate bit-rate control is necessary for a stable bit-stream transfer. Tencent found that setting I-frames to 30% of bandwidth gave the best balance. Plus the Intel Media SDK allowed to the elimination of B frames and allows changes to the max frame size and the buffer size.
Moving away from only using I-Intra frames and P-Inter frames, the new SP frames in H.264 allow switching between different bit rate streams without requiring an intra-frame. Tencent moved to using SP frames between P frames (reducing the importance of the P frame) and allowed dynamic adjustment to get the best balance between network conditions and video quality.
Reducing Power Consumption
In addition to improving the performance of QQ, the changes to memory copies, reference frames, and post processing also reduced the power consumption of the app. This is a natural consequence of doing the same amount of work in less time. But Tencent further reduced the amount of power required by throttling down the power states of the processor cores when they weren’t actually processing. Using the findings from the power tests, the engineers reworked areas that were keeping the processor unnecessarily active. Video conferencing apps don’t need to run the CPU continuously since data supplied by the network is never continuous and because there is no value in drawing new frames faster than the screen refresh rate. Tencent added short, timed lower power states, using the Windows API
WaitforSingleObject functions. The latter is triggered by events such as data arriving on the network. The resulting improvements can be seen in Figure 8:
Figure 8: Power savings per release
Summary of QQ Improvements
Using the Intel Media SDK and changing to the Intel SSE4 instruction set, Tencent made the following improvements to the QQ app:
- Offloaded H.264 and MJPEG encode and decode tasks to GPU
- Moved pre and post process tasks (when possible) to hardware
- Used both CPU and GPU simultaneously
- Reduced memory copies
- Reduced processor high power states (sleep calls, WaitForSingleObject, and timers)
- Changed MMX to Intel SSE4 instructions
- Optimized reference frame flow
Figure 9: Pre optimized 640x480 versus optimized 1280x720
The performance of Tencent QQ was dramatically increased by using key features of the Intel Media SDK. QQ was transformed from an app that could deliver 480p resolution images at low frame rate over a DSL connection into an app that could deliver 720p resolution images at 30 fps over that same DSL connection and support 4-way conferencing.
After moving key functions into hardware using the Intel Media SDK, the power consumption of QQ was reduced to almost 50% of its initial value. Then, by optimizing processor power states, power usage was further reduced to about 35% of its initial value. This is a remarkable power savings that permits QQ users to run the optimized app for more than twice as long as the preoptimized app. While improving customer satisfaction, QQ became a more capable (and greener) app by integrating the Intel Media SDK.
If you are a media software developer, be sure to evaluate how the Intel Media SDK can help increase performance and decrease memory usage and power consumption of your app by providing an efficient data flow pipeline with improved video quality and user experience even with limited bandwidth. And don’t forget the Intel tools available to help find problem spots and bottlenecks.
- Intel® Media SDK: https://software.intel.com/en-us/vcsource/tools/media-sdk-clients
- Intel® Media SDK sample code: https://software.intel.com/en-us/articles/media-sdk-tutorial-tutorial-samples-index
- Intel® Media SDK features: https://software.intel.com/en-us/articles/video-conferencing-features-of-intel-media-software-development-kit
- Intel® Media SDK video conferencing sample: https://software.intel.com/en-us/vcsource/samples/video-conferencing-using-media-sdk
- Copying Accelerated Video Decode Frame Buffers: https://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers
- Intel® SSE4 instructions guide: https://software.intel.com/sites/landingpage/IntrinsicsGuide
- ooVoo* Video Conferencing case study: https://software.intel.com/en-us/articles/oovoo-intel-enabling-hd-video-conferencing
- Windows Performance Analyzer: http://go.microsoft.com/fwlink/?LinkID=293840.
- QQ Official Site: http://www.imqq.com
About the Authors
Colleen Culbertson is an Application Engineer in Intel’s Developer Relation Division Scale Enabling in Oregon. She has worked for Intel for more than 15 years. She works with various teams and customers to enable developers to optimize their code.
Youwei Wang is an Application Engineer in Intel’s Developer Relation Division Client Enabling in Shanghai. Youwei has worked at Intel for more than 10 years. He works with ISVs on performance and power optimization of applications.
Some performance results were provided by Tencent. Intel performance results were obtained on a Lenovo* Yoga 2 Pro 2-in-1 platform with a 4th generation Intel® CoreTM mobile processor and Intel® HD Graphics 4400.
Intel, the Intel logo, Intel Atom, Intel Celeron, and Intel Core are trademarks of Intel Corporation in the U.S. and/or other countries.
Copyright © 2014 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.