Intel for Android* Developers Learning Series #7: Creating and Porting NDK-based Android* Applications for Intel® Architecture

Android* applications can incorporate native code using the Native Development Kit (NDK) toolset. It allows developers to reuse legacy code, code to low-level hardware, or differentiate their applications by taking advantage of features otherwise not optimal or possible.

This article is a basic introduction on how to create NDK-based applications for Intel architecture from start to finish, and also on simple use cases for porting existing NDK-based applications to devices based on Intel architecture. We will walk through a simple step-by-step app development scenario to demonstrate the process.

1. Introduction to Native Development Kit

Building native application should be considered only when the performance is an issue. The problem with developing a native application is that it requires a lot of effort to support the application on multiple architectures and on different generations of the same architecture.

We assume that you have set up the environment for Java application development correctly and that you are familiar with creating simple Java applications. Before proceeding further you need to install NDK from

2.    Building a “Hello, world!” Application with NDK

The purpose of this chapter is to build the first sample application from NDK for x86 target with both the GNU* compiler and Intel® C++ Compiler. You will learn how to tweak the build system to change optimization flags for the application, module, specific file, and so on. Intel C++ Compiler is supported only on a Linux host, so we restrict attention to Linux.

2.1.               Prepare and Validate the Environment

To start experimenting with samples from NDK distribution, you need to become familiar with ndk-build utility and with the syntax of 2 makefiles: and

The ndk-build utility abstracts out the details of build system; and contain variable definitions to configure the build system. For example, these makefiles specify the list of source files from which to build the application, the external components upon which the application depends on, and so on.

Make sure that the ndk-build and android utilities from NDK and SDK correspondingly are in your PATH:


The easiest way to validate environment is to rebuild the hello-jni sample from NDK. Copy hello-jni somewhere and run:

ndk-build –B in the new directory.

Check that you have at least one target supporting x86 ABI[1] by running:

android list targets

If the one is missed, install it using:

android sdk


2.2.  Building with GNU* Compiler

In the previous section we validated setup and built in libs/armeabi. The new library can be run on ARM* only because ARM target is the default target for native applications. As our primary interest is developing for x86, we need to configure the application properly.

Create the file hello-jni/jni/ and add the following definition into this file: 

APP_ABI := x86

This indicates that a cross-compiler and other binary tools targeting x86 should be used to create Now you are ready to build the for x86. This time we will enable verbose ndk-build output with V=1 option off to see commands and their parameters as they are invoked by make. Run ndk-build –B V=1. The –B option causes complete rebuild of the libhello-jni.solibrary. Once you finished creating a native library, finalize the creation of test APK

The name of the x86 target in NDK version r8b is android-15. You can always check the list of available targets by using the command android list targets.

At this point HelloJni will appear in the list of the installed applications on the device or emulator, as shown in Figure 1.

Figure 1: HelloJni appears in the list of the installed applications on the device or emulator.

There are several important variables from that are relevant to performance optimization.

The first one is APP_OPTIM. It should be set to the value release to obtain optimized binary. By default the application is built for debugging. To check which optimization options are enabled by default, update APP_OPTIMand run ndk-build –B V=1.

If you are not satisfied with the default optimization options, there are special variables to add additional options during compilation. They are APP_CFLAGS and APP_CPPFLAGS. Options specified in the APP_CFLAGS variable are added during compilation of both C and C++ files and the APP_CPPFLAGS variable applies only to C++ files.

To increase optimization level to –O3, add APP_CFLAGS:=-O3 to

2.3  Building with the Intel® C++ Compiler

The Intel compiler is not part of the NDK and should be integrated into NDK before it can be used with the ndk-build utility. Essentially you need to create new tool chain.

  • Create the directory for the new toolchain: mkdir –p <NDK>/toolchains/icc-12.1/prebuilt/.
  • Create the directory for the new toolchain: mkdir –p <NDK>/toolchains/icc-12.1/prebuilt/.
  • The icc-12.1 name was chosen arbitrarily. It is used as a value of the ndk-build’s parameter NDK_TOOLCHAIN.
  • Copy the top level directory <ICC_ROOT> containing the installed Intel compiler to the directory <NDK>/toolchains/icc-12.1/prebuilt/intel. To save space you may create a symbolic link from <ICC_ROOT> to <NDK>/toolchains/icc/prebuilt/intel.
  • Copy and from the GCC directory:
  • cp <NDK>/toolchains/x86-4.6/{,}  <NDK>/toolchains/icc-12.1
  • Change the TOOLCHAIN_NAME variable in the new file to anything convenient, for example, to icc.
  •  Modify TOOLCHAIN_PREFIX variable in to point to the new tool chain:
  • TOOLCHAIN_PREFIX := $(TOOLCHAIN_ROOT)/prebuilt/intel/bin/
  • Note the trailing ‘/’ in the TOOLCHAIN_PREFIX value.
  •  Specify path to the x86 compiler GCC in
  •  export ANDROID_GNU_X86_TOOLCHAIN=$(TOOLCHAIN_ROOT)/../x86-4.6/prebuilt/linux-x86/
  • Specify the path to the target system root. The system root is required to locate system libraries and header files:
  •  export ANDROID_SYSROOT=$(call host-path,$(SYSROOT))[1]
  • Specify the version of the GCC used by Intel compiler:
  • Specify paths to Intel compiler components in the <NDK>/toolchains/icc /
  • Create the directory for the new toolchain: mkdir –p<NDK>/toolchains/icc-12.1/prebuilt/.
  • The icc-12.1 name was chosen arbitrarily. It is used as a value of the ndk-build’s parameter NDK_TOOLCHAIN.
  •  Copy the top level directory <ICC_ROOT> containing the installed Intel compiler to the directory <NDK>/toolchains/icc-12.1/prebuilt/intel. To save space you may create a symbolic link from <ICC_ROOT> to <NDK>/toolchains/icc/prebuilt/intel.
  • Copy and from the GCC directory:
  • cp <NDK>/toolchains/x86-4.6/{,}  <NDK>/toolchains/icc-12.1
  • Change the TOOLCHAIN_NAME variable in the new file to anything convenient, for example, to icc.
  • Modify TOOLCHAIN_PREFIX variable in to point to the new tool chain:
  • TOOLCHAIN_PREFIX := $(TOOLCHAIN_ROOT)/prebuilt/intel/bin/
  •  Note the trailing ‘/’ in the TOOLCHAIN_PREFIX value.
  • Specify path to the x86 compiler GCC in
  • export ANDROID_GNU_X86_TOOLCHAIN=$(TOOLCHAIN_ROOT)/../x86-4.6/prebuilt/linux-x86/
  • Specify the path to the target system root. The system root is required to locate system libraries and header files:
  • export ANDROID_SYSROOT=$(call host-path,$(SYSROOT))[1]
  • Specify the version of the GCC used by Intel compiler:
  • Specify paths to Intel compiler components in the <NDK>/toolchains/icc /
  • TARGET_STRIP:=$(ANDROID_GNU_X86_TOOLCHAIN)/i686-android-linux/bin/strip
  • TARGET_LIBGCC:=$(shell env ANDROID_GNU_X86_TOOLCHAIN=$(ANDROID_GNU_X86_TOOLCHAIN) $(TARGET_CC) -print-libgcc-file-name)

Runndk-build -B V=1 NDK_TOOLCHAIN=icc-12.1. Note that the shared library is built by the Intel compiler. This time, however, the application will not work because it depends on supplementary shared libraries from the Intel compiler distribution. The easiest solution is turn on static linking of Intel libraries. Add the following line to <WORK_DIR>/hello-jni/jni/

LOCAL_LDFLAGS := -static-intel

Now rebuild and reinstall the application. It should work as expected on the emulator or device.

There are several warnings generated about unsupported options '-funwind-tables' and '-funswitch-loops'. Warnings can be safely ignored. We will cover option compatibility in a later section.

2.4. Packaging Intel® C++ Compiler Shared Libraries

If the application is big enough, then static linking is not justified. In this case you need to package libraries with your application. The application built with the Intel compiler depends on the following libraries[2]:

  • containing optimized string and memory routines;
  • and containing optimized mathematical functions.

To package Intel compiler libraries, add the following lines into <WORK_DIR>/hello-jni/jni/ 

	include $(CLEAR_VARS)

	LOCAL_MODULE    := libintlc




You also need similar configuration for and Then copy, and libraries into <WORK_DIR>/hello-jni/jni. Finally rebuild and reinstall the package.

3.    Intel® C++ Compiler Options

The Intel compiler supports most of the Gnu compiler options but not all. When the Intel compiler encounters unknown or unsupported option it issues a warning like in the case with '-funswitch-loops'. Always review warnings.

3.1. Compatibility Options

There are a number of incompatibilities related to warnings. It is a matter of much debate just which constructs are dangerous and which are not. In general, the Intel compiler produces more warnings than the GNU compiler. We noticed that the GNU compiler also changed warning setup between releases 4.4.3 and 4.6. The Intel compiler tries to support the GNU options format for warnings, but the support may be incomplete as the GNU compiler evolves.

The GNU compiler uses various mnemonic names with –W<diag name >and –Werror=<diag name>. The first option enables an additional warning and the second option makes the GNU compiler treat it as an error. In that case, the compiler does not generate an output file and produces only diagnostic data. There is a complementary option Wno-<diag name> that suppresses a corresponding warning. GNU compiler option -fdiagnostics-show-option helps to disable options: for each warning emitted there is a hint added explaining how to control the warning.

The Intel compiler does not recognize some of these options and they can be ignored or, even better, fixed. Sometimes all warnings are turned into errors with the –Werror option. In this case the build with the Intel compiler may break. There are two ways to avoid this problem: either to fix a warning in the source code, or to disable it with –diag-disable <id>, where <id>is the unique number assigned to the warning. This unique <id>is a part of warning’s text. Fixing the source code is the preferred way if you feel that the reported language construct is dangerous.

We built the whole Android OS image with the Intel compiler and found several options that are not supported and are not related to warnings. Most of them can be ignored as explained in Table 1. For several options we used equivalent Intel compiler options. 

GNU compiler option

Equivalent option of Intel compiler

-mbionic, makes compiler aware that the target C library implementation is Bionic

Not needed, it is the default mode of Intel compiler for Android

-mandroid, enables code generation according to Android ABI

Not needed, it is the default mode of Intel compiler for Android

-fno-inline-functions-called-once, inline heuristics override

Not needed.

-mpreferred-stack-boundary=2, align stack pointer on 4 byte


-mstackrealign, align stack to 16 byte in each prologue. Needed for compatibility between old code assuming 4-byte alignment and new code with 16-byte stack alignment


-mfpmath=sse, use SSE instruction for scalar FP arithmetic

Not needed. When the target instruction set is at least SSE2, Intel compiler generates SSE instructions for FP arithmetic[3].

-funwind-tables, turns on the generation of the stack unwinding tables

Not needed. Intel compiler generates these tables by default.

-funswitch-loops, overrides GNU compiler heuristics and turns on loop unswitching optimization at –O2and–O1.

Not needed. Allow Intel compiler to use its own heuristics.

-fconserve-stack, disables optimizations that increase stack size

Not needed.

-fno-align-jumps, disables optimization that aligns branch targets.

Not needed.

-fno-delete-null-pointer-checks, removes assumptions needed to implement some optimizations.

Not needed.

-fprefetch-loop-arrays,enables generation of prefetch instructions. Prefetching may lead to performance degradation. Use with care.

Not needed, but if you want to experiment use -opt-prefetch

-fwrapv. According to C standard the result of integer arithmetic is unspecified if overflow occurs. This option complete specification and states that the result should be as if wrapping around takes place. This option may disable some optimizations.

Not needed. It is the default mode for Intel compiler.

-msoft-float, implements the  floating point arithmetic in software. This option is used during kernel build.

Not implemented. During kernel build the generation of processor instructions for the floating point operations is disabled. Intel compiler would generated error if the code contains operations on floating point data. We did not encountered such errors.

-mno-mmx, -mno-3dnow,disable generation of MMX* and 3DNow* instructions

Not needed. Intel compiler does not generate them.

-maccumulate-outgoing-args,enables optimization that allocates space for calls’ output arguments on the stack in advance.

Not needed.

Table 1: Compiler Option Comparison

Additional details about Intel and GNU compiler compatibility can be found in the white paper at

3.2. Performance Options

When you work on the performance, there is always a tradeoff. X86 processors differ by the microarchitecture and for optimal performance, processor-specific optimization is required. Before you start tuning your code, you should decide whether the application is going to run on Intel or AMD processors, whether the application is targeted for smartphones or tablets, and so on.

Most of the optimization in the Intel compiler is tuned for Intel processors. We assume that the target is the Intel® Atom™ processor, because this is the Intel processor for mobile devices. In this case, for the best performance you need to add –xSSSE3_ATOM options during compilation. If you expect that the code will run on AMD-based devices then use –march=atom instead. In this case the application will run on any processor that support Intel Atom instructions, but some aggressive optimizations may be turned off.

To enable –xSSSE3_ATOM for all files in “Hello, world!” application, add the following line tohello-jni/jni/ 


After you have chosen the target processor, it is up to you to modify optimization level. By default, the build system disables optimizations with –O0completely in debug mode and sets default –O2 optimization level in release mode. You may try aggressive optimizations with –O3, but it may result in increased code size. On the other hand, if the code size is an important parameter, then try –Os.

Optimization level for the whole application can be changed by adding –O0--–O3toAPP_CFLAGS:


Remaining optimization options will be covered in separate sections: “Vectorization” and “Interprocedural optimization.”

4.    Vectorization

The Intel compiler supports advanced code generation including auto-vectorization. For Intel C/C++ compiler, vectorization is loop unrolling with generation of SIMD instructions operating on several elements a time. The developer can unroll loops manually and insert appropriate function calls corresponding to SIMD instructions. This approach is not forward-scalable and incurs high development costs. The work should be redone when the new microprocessor with advanced instruction support is released. For example, early Intel Atom microprocessors did not benefit from vectorization of loops processing double-precision floating point while single-precision was processed by SIMD instruction effectively.

Vectorization simplifies programming tasks by relieving the programmer of learning instruction sets for a particular microprocessor, because the Intel compiler always supports the latest generations of Intel microprocessors.

-vec options turn on vectorization at the default optimization level for microprocessors supporting IA32 architecture: both Intel and non-Intel. To improve the quality of vectorization, you need to specify the target microprocessor on which the code will execute. For optimal performance on Android smartphones based on Intel architecture, it is advised to use the –xSSSE3_ATOM option. Vectorization is enabled with the Intel C++ Compiler at optimization levels of -O2 and higher.

Many loops are vectorized automatically and the time compiler generates optimal code on its own, but sometimes it may require guidance from the programmer. The biggest problem with efficient vectorization is to make the compiler estimate data dependencies as precise as possible.

To take full advantage of Intel compiler vectorization the following techniques are useful:

  • Generate and understand a vectorization report
  • Improve performance by pointer disambiguation
  • Improve performance using interprocedural optimization
  • Compiler pragmas

4.1. Vectorization Report

We will start with the implementation of memory copying. The loop has the structure commonly used in Android sources: 

	// It is assumed that the memory pointed to by dst

	// does not intersect with the memory pointed to by src

	void copy_int(int *dst, int *src, int num) 


	    int left = num;

	    if(left<=0) return;

	    do { 


	        *dst++ = *src++; 

	    } while (left > 0); 



For experiments with vectorization we will not create a separate project but instead will reuse the hello-jni project. Add the function to the new file jni/copy_cpp.cpp. Add this file to the list of source files in jni/

LOCAL_SRC_FILES := hello-jni.c copy_int.cpp

To enable detailed vectorization report, add the –vec-report3 option to the APP_CFLAGS variable in jni/

APP_CFLAGS := -O3 -xSSSE3_ATOM  -vec-report3

If you rebuild the, you will notice several remarks generated: 

jni/copy_int.cpp(6): (col. 5) remark: loop was not vectorized: existence of vector dependence.                             

jni/copy_int.cpp(9): (col. 10) remark: vector dependence: assumed ANTI dependence between src line 9 and dst line 9.     

jni/copy_int.cpp(9): (col. 10) remark: vector dependence: assumed FLOW dependence between dst line 9 and src line 9.

Unfortunately auto-vectorization failed, because too little information was available to compiler. If the vectorization were successful, then the assignment 

*dst++ = *src++;

would be replaced with

*dst = *src;

*(dst+1) = *(src+1);

*(dst+2) = *(src+2);

*(dst+3) = *(src+3);

dst += 4; src += 4;

and the first four assignments would be performed in parallel by SIMD instructions. But parallel execution of assignments is invalid if the memory accessed on the left sides is also accessed on the right sides of assignment. Consider, for example, the case when dst+1 is equal to src+2; in this case the final value at dst+2address would be incorrect.

The remarks indicate which kinds of dependencies are conservatively assumed by the compiler preventing vectorization:

  • FLOW dependence is a dependence between earlier store and later load from the same memory location
  • ANTI dependence is a dependence, on the contrary, between earlier load and later store to the same memory location
  • OUTPUT dependence is the third kind of dependencies between 2 stores to the same memory location

From the code comment we know that the author required memory pointed to by dst and src not to overlap. To communicate information to the compiler it is sufficient to add restrict qualifiers to dst and src arguments: 

void copy_int(int * __restrict__ dst, int * __restrict__ src, int num)


The restrict qualifier was added to the C standard published in 1999. To enable support of C99 you need to add –std=c99to options. Alternatively you may use –restrictoption to enable it for C++ and other C dialects. In the code above we inserted the __restrict__keyword that is always recognized as a synonym for the restrict keyword.

If you rebuild the library again you will notice that the loops will be vectorized:

jni/copy_int.cpp(6): (col. 5) remark: LOOP WAS VECTORIZED.

In our example, vectorization failed due to compiler conservative analysis. There are other cases when the loop is not vectorized:

·      Instruction set does not allow for efficient vectorization; the following remarks indicate this type of issues:

    • “Non-unit stride used” -
    • “Mixed Data Types”
    • “Operator unsuited for vectorization”
    • “Contains unvectorizable statement at line XX”
    • “Condition may protect exception”

·      Compiler heuristics prevent vectorization; vectorization is possible but may actually lead to slow down; diagnostic will contain:

    •  "Vectorization possible but seems inefficient"
    • “Low trip count”
    • “Not Inner Loop”

·      Vectorizer’s short-comings:

    • “Condition too Complex”
    • “Subscript too complex”
    • “Unsupported Loop Structure”

The amount of information produced by vectorizer is controlled by –vec-reportN. You may find additional details in compiler documentation.

4.2. Pragmas

As we saw, the restrict pointer qualifier can be used to avoid conservative assumptions about data dependencies. But sometimes it might be tricky to insert restrict keywords. If many arrays are accessed in the loop, it might also be too laborious to annotate all pointers. To simplify vectorization in these cases, there is an Intel-specific pragma “simd”. It is used to vectorize inner loops assuming there are no dependencies between iterations.

Pragma simd applies only to for-loops operating on native integer and floating point types[4]:

  • The for-loop should be countable with the number of iterations known before loop starts
  • The loop should be innermost
  • All memory references in the loop should not fault (it is important for masked indirect references)

To vectorize our loop with a pragma, we need to rewrite it into a for-loop:

	void copy_int(int *dst, int *src, int num)


	    #pragma simd

	    for (int i = 0; i < num; i++) {

	        *dst++ = *src++;




Rebuild the example and note that the loop was vectorized.

Simple loop restructuring for “pragma simd” and insertions of “#pragma simd” in Android OS sources allowed us to improve the performance of Softweg benchmark by 1.4x without modification of the benchmark itself.

4.3.   Interprocedural Optimizations

In previous sections we described approaches when you have good understanding of the code before you start to work on performance. If you are not familiar with the code, then you can help the compiler to analyze the code by extending the scope of the analysis. In the example with copying, the compiler should take conservative assumptions because it knows nothing about the copy_int routine’s parameters. If call sites are available for analysis then the compiler can try to prove that the parameters are safe for vectorization.

To extend the scope of the analysis, you need to enable interprocedural optimizations. Few of these optimizations are already enabled by default during single file compilation. Interprocedural optimizations will be described in a separate section.

4.4.  Limitations of Auto-Vectorization

Vectorization cannot be used to speed up the Linux kernel code, because SIMD instructions are disabled in kernel mode with the –mno-sse option. It was made intentionally by kernel developers.

4.5. Interprocedural Optimizations

The compiler can perform additional optimizations if it is able to optimize across function boundaries. For example, if the compiler knows that some function call argument is constant then it may create special version of the function specifically tailored for this constant argument. This special version later can be optimized with knowledge about parameter value.

To enable optimization within a single file, specify –ip option. When this option is specified, the compiler generates a final object file that can be processed by the system linker. The disadvantage of generating an object file is almost complete information loss; compiler does not even attempt to extract information from the object files.

Single file scope may be insufficient for the analysis due to the information loss. In this case you need to add the –ipo option. When this option is given, the compiler compiles files into intermediate representation that is later processed by special Intel tools: xiar and xild.

The xiar tool should be used for creating static libraries instead of the GNU archiver ar, and xild should be used instead of the GNU linker ld. It is only required when linker and archiver are called directly. A better approach is to use compiler driver icc or icpc for final linking[5]. The downside of the extended scope is that advantage of separate compilation is lost: each modification of the source requires relinking and relinking causes complete recompilation.

There is an extensive list of advanced optimization techniques that benefit from global analysis. Some of them are listed in reference documentation. Note that some optimizations are Intel-specific and are enabled with –x* options[6].

Unfortunately things are slightly more complicated on Android with respect to shared libraries. By default all global symbols are preemptable. Preemptability is easy to explain by example. Consider two libraries linked into the same executable


	int id(void) {

	  return 1;


	int id(void) {

	  return 2;


	int foo(void) {

	  return id();



Assume that the libraries were created simply by executing icc –fpic –shared –o <libname>.so <libname>.c. Only strictly required options –fpic[7] and –shared[8] are given. 

If system dynamic linker loads the library before the library, andthen the call to the function id() from the function foo() is resolved in the library.

When the compiler optimizes the function foo(),it cannot use its knowledge about id() from the library. For example, it cannot inline the  id() function. If the compiler inlined the id() function, it would break the scenario involving and

As a consequence when you write shared libraries you should carefully specify which functions can be preempted. By default all global functions and variables are visible outside a shared library and can be preempted. The default setup is not convenient when you implement few native methods. In this case you need to export only symbols that are called directly by Dalvik* Java* Virtual machine.

Symbol’s visibility attribute is a means to specify whether a symbol is visible outside the module and whether it can be preempted:

  •  “default” visibility makes a global[9] symbol visible outside the shared library and to be preemptable
  •  “protected” visibility make a symbol visible outside the shared library, but the symbol cannot be preempted
  •  “hidden” visibility makes a global symbol visible only within the shared library and forbids preemption

Returning to hello-jni application, we need to specify that the default visibility is hidden and that the functions exported for JVM have protected visibility.

To set default visibility to hidden, add -fvisibility=hiddento the APP_CFLAGS variable in jni/

APP_CFLAGS := -O3 -xSSSE3_ATOM  -vec-report3 -fvisibility=hidden -ipo

To override the visibility of the Java_com_example_hellojni_HelloJni_stringFromJNI, add the attribute to the function definition:

Jstring __attribute__((visibility("protected")))

  Java_com_example_hellojni_HelloJni_stringFromJNI(JNIEnv* env, jobject thiz)

Rebuild and reinstall.

[1]This variable will not be required for NDK integration with Intel® C/C++ compiler version 13.0

[2]Intel C++ Compiler version 13.0 has additional library containing an optimized function for pseudo-random number generation.

[3]Scalar SSE instructions do not perform arithmetic in extended precision. If your application depends on extended precision for intermediate calculations, use –mp option.

[4]There are other minor limitations; please check references. The compiler will warn if syntax was violated or if the vectorization with simd pragma failed.

[5]During NDK configuration we made TARGET_AR point to xiar and TARGET_LD to point to xild

[6]For example, for Intel Atom target specify –xSSSE3_ATOM.

[7]The –fpic option specifies that the code should be compiled in a way allowing loading at the arbitrary address. PIC stands for position independent code.

[8]The –shared option specifies that the linked module is shared library.

[9]Global symbol is a symbol that is visible from other compilation units. Global functions and variables can be accessed from any object files that compromise given shared library. Visibility attributes specify relations between function and variables in different shared libraries. 

[1] Application binary interface. Android follows the System V ABI for i386 architecture:

[2]We did not use the ant release command, because it requires properly configured package signing.

Para obter informações mais completas sobre otimizações do compilador, consulte nosso aviso de otimização.