Finer Points of using SSE Instructions for Android* media apps on the Intel® Atom™ Platform

By Michael Stoner, Sr. Application Engineer, Intel Corporation
Victoria Zhislina, Sr. Application Engineer, Intel Corporation
Jackie Lee, Application Engineer, Intel Corporation

Abstract

Power efficiency has always been an important factor in designing handheld devices like smartphones and tablets. As device capabilities have evolved, we’ve come to rely on them for entertainment, e.g., playing music and movies for long periods of time, which steadily drains the battery. Anyone stuck on an overseas flight can attest to this (especially on older planes with no seat-back entertainment system!). Intel has responded by providing highly-efficient hardware support for media codecs in the Intel® Atom™ processor, which is now featured in several Android-based phone and tablet designs, including the newly-released Samsung Galaxy* Tab 3. However, some vendors still require a software implementation, for instance if they want to run a customized post-processing effects chain before sending the audio to the output device. For these developers, software optimizations using the Intel® Streaming SIMD Extensions (SSE) instruction set is essential to maximize media processing speed, which will allow the CPU to spend more time in “sleep” states and extend battery life. While establishing a working SSE implementation is a great first step, this paper goes further to outline several high-impact optimizations that can greatly improve performance of existing SSE code on the Intel Atom processor. By following these guidelines you can ensure that users will enjoy a long and fruitful experience with their device on that 18-hour flight from Newark to Singapore!

Introduction

In 2012, Intel launched the LAVA phone, which was the first Android-based smartphone using the Intel Atom processor. Since then, many device manufacturers (OEM/ODM’s) have followed suit, requiring them to port their media processing frameworks from ARM to Intel® x86 architecture.

Elsewhere on this site you can find valuable tutorials describing how to build a native Android app for the Intel Atom processor, using the Android NDK. Here is one example:
http://software.intel.com/en-us/articles/android-application-development-and-optimization-on-the-intel-atom-platform

The above article provides some direction on how to port an existing ARM NEON vectorized algorithm to SSE code that will run on an Intel Atom processor. Additionally, we often see that some extra instruction-level tweaking can yield even bigger leaps forward in performance, beyond the initial SSE implementation. In the rest of this paper you’ll find tips on how to extract these additional SSE performance benefits on an Intel Atom processor, based on some real-life examples that we encountered in tuning a reference mp3 decoder and other media algorithms.

Intel® SSE Instructions

First, a quick overview of SIMD instructions. They first emerged in Intel architecture in 1996 with MMX™ technology. MMX technology provides SIMD computations on packed byte, word, and doubleword integer datatypes. The Pentium® III processor extended the SIMD computation model with the introduction of Streaming SIMD extensions (SSE). SSE allows SIMD computations to be performed on operands that contain four packed single-precision floating-point data elements.

Figure 1 shows a typical SIMD computation. Two sets of four packed data elements (X1, X2, X3, and X4 and Y1, Y2, Y3, and Y4) are operated on in parallel, with the same operation being performed on each vertical pair of data elements (X1 and Y1, X2 and Y3, X3 and Y3, and X4 and Y4). The results of the four parallel computations are delivered as a set of four packed data elements.


Figure 1. Typical SIMD Operation
(Source: Intel® 64 and IA-32 Architectures Optimization Reference Manual, Intel Corporation, 2013)

The Pentium® 4 processor further extended the SIMD computation model with the introduction of Streaming SIMD Extensions 2 (SSE2) and Streaming SIMD Extensions 3 (SSE3). And the Intel® Xeon® processor 5100 series introduced Supplemental Streaming SIMD Extensions 3 (SSSE3). The Intel Atom processor supports up to SSSE3 (see Figure 2).

The SIMD extensions operate the same way in Intel64 architecture as in IA-32 architecture, with the following enhancements:

  • 128-bit SIMD instructions referencing XMM registers can access 16 XMM registers in 64-bit mode. In 32-bit mode only 8 XMM registers are available.
  • In 32-bit mode, instructions can access 8 general purpose registers, 32 bits wide. In 64-bit mode, there are 16 GPRs, 64-bits wide.


Figure 2. Intel® Atom™ processor supports up to SSSE3

Microarchitectural quirks

Blocked Store-forward

A “blocked store-forward” describes a code sequence when a store to memory cannot be forwarded to a subsequent load that accesses some part of the recently stored data. When this occurs, the load stalls for several cycles while the data is fully written to the processor caches and then the portion of interest is reloaded. Check out chapter 2 of the Intel Architecture Optimizations manual for a detailed graphic showing the different forwarding and non-forwarding scenarios on Intel® Core™ microarchitecture.

The Intel Atom processor supports a very limited number of cases when the data can be forwarded efficiently without a stall. This only can happen when the load and store are to the same address, for the same size operand, and only if it is from a general-purpose register (e.g., eax, ebx, …). Stores of SSE operands (from xmm registers) are never forwarded to subsequent loads, so programmers should avoid doing this by whatever means. Usually the data can be manipulated within the registers to continue the computation without incurring a stall.

We saw an example of this issue in the subband synthesis filter of the mp3 decoder, where a “windowing” loop accumulates the results of a series of multiplications between the frequency samples and the window coefficients. An SSE-optimized routine would use ADDPS instructions to compute four intermediate sums in one xmm register. Then to complete the calculation, the four values are summed in a horizontal sense across the register. One possible implementation might save the 16-byte packed SSE operand to memory and then sum the four individual elements from memory:

// windowing code

__m128 sum4;
float pSum[4];

sum4 = _mm_mul_ps(_mm_load_ps(&window[0], &b0[0]);
sum4 = _mm_add_ps(sum4,
    _mm_mul_ps(_mm_load_ps(&window[4], &b0[4]));
...			

_mm_store_ps (pSum, sum4);	// sum3 sum2 sum1 sum0
int pcm_sample = 
    (int)(pSum[0]+pSum[1]+pSum[2]+pSum[3]);

pcm_out[i] = CLIP(pcm_sample);

This will incur a blocked store-forward stall between the 16-byte store to the pSum array and the following 4-byte loads from pSum. To avoid this, we could draw up a different version that computes the horizontal sum in the xmm registers:

// horizontal add of elements from sum4
	// sum4 =  sum4[0]+sum4[1]+sum4[2]+sum4[3];
	sum4 = _mm_hadd_ps( sum4, sum4);  // --- --- sum3+sum2 sum1+sum0
	sum4 = _mm_hadd_ps( sum4, sum4);  // --- --- -------- sum3+2+1+0

	//	Clipping to 16-bit integer range
	sum4 = _mm_max_ss(sum4, lowerBound4);		
	sum4 = _mm_min_ss(sum4, upperBound4);		

//	Convert to integer and store to pcm buffer
	pcm_out[i] = _mm_cvtt_ss2si(sum4);

The horizontal sum can be computed with HADDPS instructions as shown above, or with a series of adds and shuffles. The HADDPS sequence is faster on Intel Atom processors, but slower on many variants of Intel Core microarchitecture. Regardless, both will run faster than the code that exhibited the store-forwarding problem.

Note that we can also take advantage of SSE min and max instructions to perform clipping on samples that exceed the 16-bit range. In our experiment with a reference decoder, this overall change generated a 15% improvement on the SSE-optimized subband synthesis filter.

Misaligned loads

Memory alignment is a very important consideration in getting the most out of your SSE routine. While the Intel Core microarchitecture has steadily improved in the ability to manage misaligned loads (and resulting cache-line splits), on the Intel Atom processor, we still see a pretty significant performance hit. In either case you will find value in reducing misaligned accesses, but especially on the more simple Intel Atom cores.

In most cases with SSE code, we’ll be shooting to ensure 16-byte alignment on a data structure, so that we can use aligned moves (i.e., MOVAPS, MOVNTA) when loading from that structure into XMM registers. Figure 3 shows an example where we have a “hybrid structure-of-arrays” format containing four sets of 3D coordinates. If we can guarantee 16-byte alignment on the structure, all four coordinate sets can be accessed with MOVAPS instructions.


Figure 3. Loading 16-byte aligned data

The programmer can control the alignment using several methods, depending on whether the data is allocated statically or dynamically (see Figure 4).

        __declspec(align(16)) float a[N];                                       // static or auto
        _MM_ALIGN16 float a[N];                                                    // compiler macro for static alignment
        int* b = _mm_malloc(N * sizeof(int), 16);                        // dynamic allocation
        F32vec4 c, d[N / 4];                                                              // Vector Classes are always aligned

Figure 4. Methods for aligning static and dynamic data structures


A more complex scenario arises when the SSE algorithm has an inherently misaligned access pattern. This was another problem we faced in tuning the subband synthesis windowing routine. Though the windowing coefficient buffer can be 16-byte aligned, the pointer to the buffer is incremented in a circular fashion such that the alignment changes with each pass of the code. Since these are 32-bit float values loaded in chunks of 4, the offset relative to a 16-byte boundary can be 0, 4, 8, or 12 bytes.

Several options were considered to handle this. One was to execute unaligned moves (MOVUPS) across the board and take the full performance hit. A second idea was to check for the aligned case (zero offset) and use MOVAPS or MOVUPS accordingly. This would mitigate the problem 25% of the time.

The fastest (and slightly more complex) solution was to use the PALIGNR instruction, which concatenates two XMM registers and shifts by a specified number of bytes to obtain the desired operand. This required us to write separate code paths to support all four possible offsets. Figure 5 shows the case where the offset is known to be 8 bytes, so we load consecutive 16-byte chunks from aligned locations surrounding the data of interest and use PALIGNR (actually the _mm_alignr_epi8 intrinsic) to shift in the desired values.


Figure 5. Using PALIGNR to avoid cache-line splitting loads

OK, this may be more than slightly complex, but it was a reasonable amount of work to implement and worth another 15% gain over the original SSE routine.

Partial register dependencies

Be aware that some instructions load only part of an XMM register (e.g., MOVLPS, MOVHPS, PINSRW). So the machine has to keep track of previous register contents to complete the operation. The programmer can remove false dependencies by zeroing the full register ahead of time, either by XOR-ing or subtracting the register from itself:

PXOR xmm0, xmm0
PSUBD xmm0, xmm0

The execution pipeline interprets these as dependency-chain breaking idioms; so no stalls will occur on subsequent partial accesses to the register. In other words, even if the register has a read-after-write (RAW) dependency from some earlier instruction, the machine doesn’t need to check for that because all of the bits are going to be set to zero irregardless of what values resided there previously.

Long-latency instructions

You can gain some performance advantages by reviewing the optimization guide for latency/throughput information of each instruction as you are implementing an SSE algorithm. Especially note that divide and sqrt operations take more than 30 cycles due to the iterative nature of the circuitry that computes the result. Consider whether these can be replaced with a table-lookup operation, a reciprocal approximation (RCPPS instruction), or a Newton-Raphson sequence.

In the mp3 decoder, the dequantization phase had a computational step that required dividing by a set of values that were all powers of 2. In integer code, this would translate to a fast right-shift operation. However, since the x86 algorithm used floating-point representation, the code was executing full-precision divides. We improved this by replacing the divide with multiplication by a set of precomputed reciprocal values (e.g., y = x * 0.125, instead of y = x / 8.0). This generated over a 5% speedup on the entire decoder, not too bad for one small change!

Streaming stores

One last tip related to memory optimization. Note that unlike Intel Core microarchitecture, Intel Atom processors have no L3 cache. The L2 cache is typically only 512 KB, so consider whether your data set will fit. If not, then your routine may benefit from using streaming stores (aka, non-temporal store instructions, MOVNTPS, or MOVNTQ) that write directly to main memory without polluting the caches. Normally a store will allocate a cache line in the cache hierarchy (if the line isn’t already there), and when it is evicted, a “dirty writeback” occurs to update main memory, further consuming bus bandwidth.

Recently we worked on a colorspace conversion routine operating on an image size of 960x540. The input pixels were in 16-bit format (rgb565), and the code converted them to 32-bits (rgb8888). That summed to a total of 3 MB in the working set, much bigger than the 512 KB L2 cache on the Intel Atom processor. We implemented streaming stores via the _mm_stream intrinsic and saw performance leap by over 1.4x ! Figure 6 shows a code snippet:

for(j=0; j<w0_8; j+=8)
{
  ………………………………………..
  _mm_stream_si128((__m128i*)&rgba[j], rgba8888_0);
  _mm_stream_si128((__m128i*)&rgba[j+4], rgba8888_1);
}

Figure 6. Streaming store example

Table 1 shows the progression of performance improvements as we modified the code. After setting 16-byte alignment on the buffers, the kernel was bottlenecked on memory traffic. Streaming stores eliminated all the “dirty writebacks,” which greatly freed up the bus, allowing the input buffer to be loaded much more quickly. Also, note that the Intel® Compiler was able to vectorize the CSC algorithm, hence the minimal difference when intrinsics were first used.

Color Space ConversionTime(ms)
Baseline Visual Studio* 2010 compiler 25,430
Intel® Compiler /QxSSSE3_ATOM switch 9,350
SSE2 intrinsics 9,520
16-byte alignment 9,000
Non-temporal stores 6,210

Table 1. Progression of Color Space Conversion optimizations

Conclusion

So, we’ve outlined a handful of tips you can apply to squeeze more performance out of code that already has been optimized with SSE instructions. The chart below shows the progression of mp3 decode performance improvements from removing store-forward issues, mitigating misalignments, replacing divides with multiply-by-reciprocal, and in the last phase recoding the iMDCT routine to use SSE intrinsics.


Figure 7. Incremental improvement of MP3 decode performance

For further study, we recommend reviewing the Intel Architecture optimization guide, among the other links listed in the References section below. Also, use the Intel® VTune™ Amplifier to identify hotspots in your workload and experiment with Intel Compiler options for turn-key performance improvements.

Performance disclaimer

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

The mp3 decoder was compiled for the Intel® Atom™ microarchitecture with Intel Compiler 13.0 and executed on a reference Android phone (engineering sample) based on the Intel Atom Z2760 processor (codename Cloverview). The mp3 decode test used a 4-minute song as input and performance was evaluated on how fast the track could be decoded offline (i.e. as fast as possible regardless of real-time playback speed). The subband synthesis and colorspace conversion kernels were compiled in the same fashion and run on a Samsung 500T tablet also based on Cloverview, running Windows 7 32-bit. The kernels were measured by processing time to execute them for a set number of repeats.

References

Intel® Intrinsics Guide
http://software.intel.com/en-us/articles/intel-intrinsics-guide

Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2
http://download.intel.com/products/processor/manual/325383.pdf

Intel® 64 and IA-32 Architectures Optimization Reference Manual
https://www-ssl.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

Store Forward Block
http://software.intel.com/en-us/forums/topic/333586

Reducing the impact of Misaligned Memory Accesses
http://software.intel.com/en-us/articles/reducing-the-impact-of-misaligned-memory-accesses

SSE (Streaming SIMD Extensions)
http://www.songho.ca/misc/sse/sse.html

Notices

Intel, the Intel logo, Atom, Core, Pentium, VTune, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.
Copyright © 2013 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.