Porting Low-Level Parts of Android* Native Applications to Intel® Architecture-based Platforms


There are two types of applications for Android. One is the Dalvik application, which is a Java*-based application that can run smoothly on any architecture without any modification. The second type is an NDK application that has part of the code written in C/C++ or ASM and must be recompiled for a specific CPU. 

This discussion focuses on NDK applications. For these, we generally just need to modify the APPABI parameter in application.mk and compile the NDK part, which then allows it to be run on the corresponding device. However, some parts of some NDK applications cannot simply be recompiled if they have certain types of code, such as assembly language code or single instruction, multiple data (SIMD) code. This article explains how to handle these issues and introduces some of the major considerations developers should know about when porting a non-Intel® Architecture (x86) platform app to an Intel Architecture platform. We also discuss Endian conversion between x86 and non-x86 platforms.

Single Instruction Multiple Data (SIMD)

SIMD is a class of parallel computers, as described in Flynn's taxonomy, with multiple processing elements that perform the same operation on multiple data points simultaneously. Such machines exploit data-level parallelism as there are simultaneous (parallel) computations. SIMD is particularly applicable to common tasks such as adjusting the contrast in a digital image or adjusting the volume of digital audio. Most modern CPU designs include SIMD instructions to improve the performance of multimedia use. For the x86 mobile platform, SIMD is called Intel®  Streaming SIMD Extensions (Intel® SSE, SSE2, SSE3, etc.). For the ARM* platform, SIMD is called NEON* technology. If you need any background on NEON, refer to the manufacturer’s documentation.

Intel® Streaming SIMD Extensions (Intel® SSE)

First, what is Intel SSE? Basically, it is a collection of 128-bit CPU registers. These registers can be packed with four 32-bit scalars that allow an operation to be performed on each of the four elements simultaneously. In contrast, it may take four or more operations in regular assembly to do the same thing. In the following diagram, you can see two vectors (Intel SSE registers) packed with scalars. The registers are multiplied with MULPS, which then store the result. That's four multiplications reduced to a single operation. The benefits of using Intel SSE are too significant to ignore.

Two vectors packed with scalars
Figure 1: Two vectors (Intel® SSE registers) packed with scalars

Now with the basic idea of Intel SSE in mind, let's take a look at some of the more common instructions.

Data Movement Instructions


Move 128 bits of data to an SIMD register from memory or SIMD register. Unaligned.


Move 128 bits of data to an SIMD register from memory or SIMD register. Aligned.


Move 64 bits to upper bits of an SIMD register (high).


Move 64 bits to lower bits of an SIMD register (low).


Move upper 64 bits of source register to the lower 64 bits of destination register.


Move lower 64 bits of source register to the upper 64 bits of destination register.


Move sign bits of each of the four packed scalars to an x86 integer register.


Move 32 bits to an SIMD register from memory or SIMD register.

Arithmetic Instructions  NOTE:  A scalar will perform the operation on the first elements only. The parallel version will perform the operation on all of the elements in the register.




ADDSS - Adds operands


SUBSS - Subtracts operands


MULSS - Multiplies operands


DIVSS - Divides operands


SQRTSS - Square root of operand


MAXSS - Maximum of operands


MINSS - Minimum of operands


RCPSS - Reciprocal of operand


RSQRTSS - Reciprocal of square root of operand

Comparison Instructions




Compares operands and return all 1s or 0s

Logical Instructions


Bitwise AND of operands


Bitwise AND NOT of operands


Bitwise OR of operands


Bitwise XOR of operands

Shuffle Instructions


Shuffle numbers from one operand to another or itself


Unpack high order numbers to an SIMD register


Unpack low order numbers to a SIMD register

Other instructions not covered here include data conversion between x86 and MMX registers, cache control instructions, and state management instructions.

How to Convert NEON to Intel SSE

Intel SSE instructions and NEON do not translate 1:1. Although they are based on the same design principle, the method of implementation is different. You have to translate each instruction one by one. Following are two code segments showing the same function, one using NEON and the other using Intel SSE instructions.

The following code uses NEON instructions:

int16x8_t q0 = vdupq_n_s16(-1000), q1 = vdupq_n_s16(1000);
int16x8_t zero = vdupq_n_s16(0);
for( k = 0; k < 16; k += 8 )
    int16x8_t v0 = vld1q_s16((const int16_t*)(d+k+1));
    int16x8_t v1 = vld1q_s16((const int16_t*)(d+k+2));
    int16x8_t a = vminq_s16(v0, v1);
    int16x8_t b = vmaxq_s16(v0, v1);
    v0 = vld1q_s16((const int16_t*)(d+k+3));
    a = vminq_s16(a, v0);
    b = vmaxq_s16(b, v0);
    v0 = vld1q_s16((const int16_t*)(d+k+4));
    a = vminq_s16(a, v0);
    b = vmaxq_s16(b, v0);
    v0 = vld1q_s16((const int16_t*)(d+k+5));
    a = vminq_s16(a, v0);
    b = vmaxq_s16(b, v0);
    v0 = vld1q_s16((const int16_t*)(d+k+6));
    a = vminq_s16(a, v0);
    b = vmaxq_s16(b, v0);
    v0 = vld1q_s16((const int16_t*)(d+k+7));
    a = vminq_s16(a, v0);
    b = vmaxq_s16(b, v0);
    v0 = vld1q_s16((const int16_t*)(d+k+8));
    a = vminq_s16(a, v0);
    b = vmaxq_s16(b, v0);
    v0 = vld1q_s16((const int16_t*)(d+k));
    q0 = vmaxq_s16(q0, vminq_s16(a, v0));
    q1 = vminq_s16(q1, vmaxq_s16(b, v0));
    v0 = vld1q_s16((const int16_t*)(d+k+9));
    q0 = vmaxq_s16(q0, vminq_s16(a, v0));
    q1 = vminq_s16(q1, vmaxq_s16(b, v0));
q0 = vmaxq_s16(q0, vsubq_s16(zero, q1));
// first mistake it produce wrong result
//q0 = vmaxq_s16(q0, vzipq_s16(q0, q0).val[1]);
// may be someone knows faster/better way?
int16x4_t a_hi = vget_high_s16(q0);
q1 = vcombine_s16(a_hi, a_hi);
q0 = vmaxq_s16(q0, q1);

// this is _mm_srli_si128(q0, 4)
q1 = vextq_s16(q0, zero, 2);
q0 = vmaxq_s16(q0, q1);

// this is _mm_srli_si128(q0, 2)
q1 = vextq_s16(q0, zero, 1);
q0 = vmaxq_s16(q0, q1);

// read the result
int16_t __attribute__ ((aligned (16))) x[8];
vst1q_s16(x, q0);
threshold = x[0] - 1;

The following code uses Intel SSE:

__m128i q0 = _mm_set1_epi16(-1000), q1 = _mm_set1_epi16(1000);
for( k = 0; k < 16; k += 8 )
    __m128i v0 = _mm_loadu_si128((__m128i*)(d+k+1));
    __m128i v1 = _mm_loadu_si128((__m128i*)(d+k+2));
    __m128i a = _mm_min_epi16(v0, v1);
    __m128i b = _mm_max_epi16(v0, v1);
    v0 = _mm_loadu_si128((__m128i*)(d+k+3));
    a = _mm_min_epi16(a, v0);
    b = _mm_max_epi16(b, v0);
    v0 = _mm_loadu_si128((__m128i*)(d+k+4));
    a = _mm_min_epi16(a, v0);
    b = _mm_max_epi16(b, v0);
    v0 = _mm_loadu_si128((__m128i*)(d+k+5));
    a = _mm_min_epi16(a, v0);
    b = _mm_max_epi16(b, v0);
    v0 = _mm_loadu_si128((__m128i*)(d+k+6));
    a = _mm_min_epi16(a, v0);
    b = _mm_max_epi16(b, v0);
    v0 = _mm_loadu_si128((__m128i*)(d+k+7));
    a = _mm_min_epi16(a, v0);
    b = _mm_max_epi16(b, v0);
    v0 = _mm_loadu_si128((__m128i*)(d+k+8));
    a = _mm_min_epi16(a, v0);
    b = _mm_max_epi16(b, v0);
    v0 = _mm_loadu_si128((__m128i*)(d+k));
    q0 = _mm_max_epi16(q0, _mm_min_epi16(a, v0));
    q1 = _mm_min_epi16(q1, _mm_max_epi16(b, v0));
    v0 = _mm_loadu_si128((__m128i*)(d+k+9));
    q0 = _mm_max_epi16(q0, _mm_min_epi16(a, v0));
    q1 = _mm_min_epi16(q1, _mm_max_epi16(b, v0));
q0 = _mm_max_epi16(q0, _mm_sub_epi16(_mm_setzero_si128(), q1));
q0 = _mm_max_epi16(q0, _mm_unpackhi_epi64(q0, q0));
q0 = _mm_max_epi16(q0, _mm_srli_si128(q0, 4));
q0 = _mm_max_epi16(q0, _mm_srli_si128(q0, 2));
threshold = (short)_mm_cvtsi128_si32(q0) - 1;

For more information about converting NEON to Intel SSE, refer to the blog listed in the Reference section[1].  It provides a head file that can be used to auto map NEON and Intel SSE instructions. 

Enable Assembly Language

Complex Instruction Set Computer (CISC) processors, like Intel’s, have a rich instruction set capable of performing complex actions with a single instruction (as opposed to RISC architectures that aim for more generalized instructions and efficiency). CISCs have a comparatively large number of general-purpose registers and data instructions that generally use three registers: one destination and two operands. If you need any background on ARM instructions/architecture, refer to the manufacturer’s documentation.

Intel processors (i.e., 386 and beyond) have eight 32-bit general purpose registers, as shown in following diagram. The register names are mostly historical. For example, EAX used to be called the accumulator since it was used by a number of arithmetic operations, and ECX was known as the counter since it was used to hold a loop index. While most of the registers have lost their special purposes in the modern instruction set, by convention, two are reserved for special purposes: the stack pointer (ESP) and the base pointer (EBP).

: Intel® x86 processors with eight 32-bit general purpose registers
Figure 3: Intel® x86 processors with eight 32-bit general purpose registers

For the EAX, EBX, ECX, and EDX registers, subsections may be used. For example, the least significant two bytes of EAX can be treated as a 16-bit register called AX. The least significant byte of AX can be used as a single 8-bit register called AL, while the most significant byte of AX can be used as a single 8-bit register called AH. These names refer to the same physical register. When a 2-byte quantity is placed into DX, the update affects the value of DH, DL, and EDX. These sub-registers are mainly hold-overs from older, 16-bit versions of the instruction set. However, they are sometimes convenient when dealing with data that are smaller than 32-bits (e.g., one-byte ASCII characters).

Because of the differences between ARM and x86 assembly language, ARM ASM code cannot be used directly on x86 platforms. However, there are two ways for ARM ASM code to be used when porting an ARM Android application to x86:

  1. Implement the same function with x86 assembly language.
  2. Replace the ARM code with C code.

In many open sources, ASM code has been overwritten to improve performance, but performance is not an issue in this case because processors are more robust than ever before. However, unlike the overwritten ASM code, C code that implemented the same function was retained in source code, and we can compile C code for an x86 platform.

As an example, an ISV developed a game that used Vorbis audio compression format, an open source program that contains some segmented ARM ASM code. So the ISV could not convert it to x86 NDK. Instead of having to rewrite this section of code with x86 ASM, the ISV rebuilt it in C, so it ran smoothly on x86. The issue was resolved by disabling Macro _ARM_ASSEM_ and enabling Macro _LOW_ACCURACY_

Endian Conversion in ARM and x86

In cross-platform projects, we often face an age old problem—Endian conversion. If the file is generated on a little Endian machine, an integer 255 may be stored like the following:

ff 00 00 00

But when it's read into the memory, the value will be different for different platforms, and this will cause a porting problem.

int a;
fread(&a, sizeof(int), 1, file);
// on little endian machine, a = 0xff;
// but on big endian machine, a = 0xff000000; 

A very simple and effective way to solve this issue is to write a function called readInt():

void readInt(void* p, file)
    char buf[4];
    fread(buf, 4, 1, file); 
    *((uint32*)p) = buf[0] << 24 | buf[1] << 16 
                   | buf[2] << 8 | buf[3];} 

The function has the advantage of working on both large and small Endian platforms. But it is inconsistent with the common way of reading a structure.

fread(&header, sizeof(struct MyFileHeader), 1, file);

If MyFileHeader contains a lot of integers, it will result in many read()s. It’s not only cumbersome to code, but also runs slowly due to increased IO operation. So I propose another method: let’s leave the code unchanged and use several macros to post-process the data.

fread(&header, sizeof(struct MyFileHeader), 1, file);
CQ_NTOHL_ARRAY(&header.box, 4); // box is a RECT structure 

If the computer’s endian-ness doesn’t match with that of the data file, these macros are defined to execute certain functions or else they are defined empty:

#    define CQ_NTOHL(a) {a = ((a) >> 24) | (((a) & 0xff0000) >> 8) | 
	(((a) & 0xff00) << 8) | ((a) << 24); }
#    define CQ_NTOHL_ARRAY(arr, num) {uint32 i; 
	for(i = 0; i < num; i++) {CQ_NTOHL(arr[i]); }}
#    define CQ_NTOHL(a)
#    define CQ_NTOHL_ARRAY(arr, num)

This approach has the advantages of not wasting any CPU cycles if ENDIAN_CONVERSION is not defined, and the code can be preserved in its natural form of reading a whole structure at a time.


ARM and x86 have different architecture and instruction sets, so there are some differences in the low-level functionality. I hope the information in this article will help you address these differences when developing Android NDK apps for multiple platforms.

About the Author

Peng, Tao (tao.peng@intel.com) is a software apps engineer in the Intel Software and Services Group. He currently focuses on gaming and multimedia applications enabling and performance optimization, in particular on Android mobile platforms


[1] From ARM NEON to Intel SSE- The Automatic Porting Solution, Tips and Tricks