Author: Zvi Danovich, Senior SW Application Engineer, Intel
Most Android applications, even those based only on scripting and managed languages (Java*, HTML5,…), eventually use middleware features that would benefit from optimization.
This paper will discuss optimization needs and approaches on Android and walk through a case study of how to optimize a multimedia and augmented reality application.
Intel offers several different Intel® Atom™ processors for Android platforms (smartphones and tablets) that have at least Intel® Supplemental Streaming SIMD Extensions (Intel® SSSE3)-level of vectorization capability and usually two cores with hyperthreading.
So take a hint and use these optimization capabilities!
iOnRoad* Case Study
iOnRoad* is the augmented reality app that provides real-time personal driving assistance functions, like collision warning using the power of modern computer vision algorithms on smartphones.
Being real-time, the iOnRoad app needs to transform each input frame produced by the phone’s camera in YUV420/NV21 form into RGB form before processing.
The original iOnRoad code for this transform took about 40% of the CPU’s runtime, restricting use of image processing and making optimization mandatory.
The only available optimized software we found was the Intel® Integrated Performance Primitives routine YUV420ToRGB, but it doesn't have the necessary combination of input and output formats the iOnRoad application requires. And besides, this routine is NOT multithreaded!
So we made the decision to create new optimized code to perform the needed transformation.
YUV420/NV21 to RGB Transformation
YUV420/NV21 format has 8-bit luminance (black/white) Y and two chrominance (color) U & V components.
Each quad of Y-components needs only one pair of appropriate V & U components to produce the standard RGB format with its 3 color components for one pixel.
The figure above shows appropriate quads of Y’s and, colored with the same color, are the pairs of V & U serving them.
This format (widely referred to as YUV) provides double compression compared to RGB.
YUV to RGB Transformation – Integer Look-Up Table Approach
The YUV to RGB transformation uses simple linear formulas.
To avoid casting to float, we used the following well-known integer approximation:
These formulas have intermediate results >216, which we refer to later in our vectorizing discussion.
For scalar computations iOnRoad used a look-up table (LUT) approach: since Y, U and V are 8 bit, multiplications in the above formulas can be pre-computed in five 256-entries of 32-bit-out LUTs.
YUV to RGB Transformation – Fixed Point Approach Using Intel SSE
Intel® SSE doesn’t have vectorized LUT (gather) instructions. Packed 16-bit multiplication is usually faster than the combination of scalar LUT operations with subsequent packing.
But the basic Intel SSE 16-bit multiplication (PMULLW) cannot be used because the expected intermediate results are >216.
Intel® SSSE3 has the _mm_mulhrs_epi16() instruction that combines full 16-bit multiplication with shifting right the intermediate 32-bit result, giving a final 16-bit result with rounding.
To provide the maximal number of significant bits in the final result, the original multiplication operands should be shifted left (in our case we can achieve a 13-bit final result).
YUV to RGB Transformation – Implementation of Fixed Point SSE Approach
The process begins by loading two portions of 16x8-bit Y’s and 8 (U,V) 8-bit pairs.
By the end, this data will be converted to 16x32-bit RGB pixels (in the FRGB format, the highest byte is 0xff).
16 is subtracted from 16x8-bit Y’s using the 8-bit subtraction operation with saturation, so that we avoid having to check and correct that the result is not negative.
8 (U,V) pairs are “serving” 2 lines of 16 Y’s.
To unpack input data, the byte shuffle operation is used, producing 2 portions of:
- 2 sets of 8x16-bit Y’s
- 1 set of 4x16-bit doubled U’s
- 1 set of 4x16-bit doubled V’s
Here is the detailed scheme for producing a portion:
Before using U & V, 128 is subtracted using the packed 16-bit _mm_sub_epi16() instruction.
After subtractions, all 8x16-bit Y, U & V packed data are shifted to the left to optimally fit the _mm_mulhrs_epi16()instruction, which is used with appropriately packed coefficients
Note: The preparation steps above (subtractions and shifts) are being performed instead of LUT operations in a scalar algorithm.
Multiplication results are summed to receive final 16-bit packed values that are clipped between 0 and 213-1 (8191) using _mm_min_epi16() and _mm_max_epi16().
When all computations are completed, the results are presented in the form of packed 13/16-bit separated R, G, & B values.
Repacking these values into FRGB form (where F is an alpha-channel filled with all ones, which the iOnRoad app demands) is done in two phases.
In the first phase we repack 13/16-bit separated R, G, & B into 16-bit pairs FR & GB, using an additional register filled with 16-bit <0xff00> value.
This repacking phase consists of logical left and right shifts and logical OR/AND operations, as the following diagram shows:
At the second phase, FR & GB intermediate results are finally packed into FRGB using the interleaving _mm_unpacklo_epi16() and _mm_unpackhi_epi16() instructions:
Intel SSE intrinsic-based code for the YUV-to-RGB transform described above shows about a 4x improvementΩ compared to the original scalar LUT-based code.
Using Intel® Cilk™ Plus for Parallelization
Most Intel Atom processors used in smartphones and tablets have at least two logical cores (while there are some dual core and hyperthreading models available now). The number of cores will be increase in the future definitely, so parallelization of algorithms will only become more relevant.
The most simple parallelization approach is provided by Intel® Cilk™ Plus extensions in the Intel® Compiler, available for C and C++ code (while the Intel® Threading Building Blocks only works with C++!)
Intel Cilk Plus’s simplest parallelization operator “cilk_for” (used for the outer loop of YUV-to-RGB transform instead of standard C/C++ “for”) provides additional 2x performanceΩ boost on a 2-core Intel Atom processor Z2760-based device (code name Clover Trail).
Using Intel SSE intrinsics for vectorization together with Intel Cilk Plus parallelization provides 8x overall improvement.Ω
ΩSoftware and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Configurations: [dual-core Clover Trail Atom 2.00Ghz Z2760 processor-based engineering sample system with 1Gb RAM memory, running Android 4.1.2 (Jelly Bean) OS and special iOnRoad test application]. For more information go to http://www.intel.com/performance
Conclusions and Call to Action
Intel® SSE intrinsics (Intel SSSE3 level) provides amazing performance improvements, and Intel® Cilk™ Plus is great for parallelizing apps on Intel Atom processor-based devices running Android.
We recommend that Android developers writing apps targeted for Intel Atom processor-based devices optimize their multimedia apps and games using Intel® SSE and Intel® Cilk™ Plus. These trusted tools can provide amazing performance boosts!
For a description of the company, please go to the iOnRoad web site.
Intel, the Intel logo, Atom, and Cilk are trademarks of Intel Corporation in the U.S. and/or other countries.
Copyright © 2014 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.