How to use Intel Cilk Plus extension, to speed up your Android application with threading features

Intel(R) CilkTM Plus is a set of language extension of Intel(R) compiler, which can help to quickly and easily implement parallelism and vectorization for C/C++ code. You may refer to its main page to know the basics of Cilk Plus:

Intel Cilk Plus: https://software.intel.com/en-us/intel-cilk-plus

The Open Source Cilk Plus: http://www.cilkplus.org/

This article is mainly to introduce the Cilk Plus support from Intel C/C++ Android compiler, I will guide you how to use Cilk Plus to speedup your applicatoin's performance on Android.

The three simple keywords:

Intel Cilk Plus provides 3 simple keywords, to help you quickly threading your application. The keywords are cilk_spawn, cilk_sync, cilk_for, but, these are not all of Cilk Plus. To use Cilk Plus, you will also need to link to Cilk runtime library (libcilkrts.so).

How to use Cilk Plus in an Android NDK project:
1. first of all, your project needs to be built with Intel Compiler, you need to add below in "jni/Application.mk" file (if not exists, you may create one):

NDK_TOOLCHAIN = x86-icc
APP_ABI = x86

2. Add relevant STL in your project's Application.mk file

APP_STL:=stlport_shared
OR
APP_STL:=gnustl_static
OR
APP_STL:=gnustl_shared

Notes: After you specific the STL implement, Intel Compiler will automatically linked to Cilk Plus runtime. Without adding STL, you may get lots of "undefined reference" linking errors when building the code.

3. Loading the runtime libraries in your Java code

System.loadLibrary("gnustl_shared"); 
System.loadLibrary("cilkrts");
System.loadLibrary("yourJNILib");
OR
System.loadLibrary("stlport_shared");
System.loadLibrary("cilkrts");
System.loadLibrary("yourJNILib");

"yourJNILib" is the library name of your JNI. Note: in latest Android system, loadLibrary have fixed the dependency issue. In previous Android versions, when using System.loadLibrary, if libyourJNILib.so depends on libcilkrts.so and libcilkrts.so depends on libgnustl_shared.so, you must load them according to above orders. But, from my testing, the latest Android may have already fixed the issue to load dependency libraries automatically, thus, you only need to load your JNI library for above code (only need one line of to load "yourJNILib").

4. Using Cilk Plus in C/C++ code
Now, you can use Cilk Plus in your code. Taking calculation of fib (fibonacci sequence) as an example,  the two implementations are as below:

int fib(int n) {
	if (n == 0 || n == 1)
		return n;
	int x = fib(n - 1);
	int y = fib(n - 2);
	return x + y;
}

#include <cilk/cilk.h>
int fib_cilk(int n) {
	if (n == 0 || n == 1)
		return n;
	int x = cilk_spawn
	fib_cilk(n - 1);
	int y = fib_cilk(n - 2);
	cilk_sync;
	return x + y;
}

To use Cilk Plus, you will need to include the headers, for the 3 simple keywords, including cilk/cilk.h is enough. The meaning of cilk_spawn is, spawn a new strand (it can be understood as a task of Cilk Plus), Cilk Plus runtime will dynamically manager the actual working therads (workers) (which can be understood as the physical threads), and the runtime will map the strands to the working threadins accordingly (the default working threads number is same with the process cores). 

Notes: for the case above, you can easily understand it as, "2 threads exists, one to execute fib_cilk(n-1) and another will execute the rest of the code a.k.a continuation code fib_cilk(n-2)".

The meaning of cilk_sync is "synchronization", that is, it will wait for all the working threads to finish, and then continue to execute. Obviously, for the calculation of fib, we need to wait for fib(n-1) and fib(n-2) and then we can continue to execute.

Testing results for your reference:

From the testing results, the code without Cilk Plus takes 2831ms, and the code with Cilk Plus takes 10315ms. why is it even much more slow when using Cilk Plus? Obviously, this is not the reason I write this article.

5. Optimize the workload

In above example, every time when we calculate fib(N), we will have 2 workers to run fib(N-1) and fib(N-2), but when N is very small (examle: N=3), the workload of fib(N-1) and fib(N-2) will be very small, thus, the overhead of creating the threads/workers and context switch between the threads/workers may impact the performance badly. As this is a recursive, such overhead will be introduced for many times, this is the reason that the performance is much worse than none cilk plus code.

How to fix it? We just need to invoke serial version when N is small enough, example, we can write like below:

int fib_cilk_optimize(int n) {
	if (n == 0 || n == 1)
		return n;
	if (n < 20) {
		int x = fib(n - 1);
		int y = fib(n - 2);
		return x + y;
	} else {
		int x = cilk_spawn
		fib_cilk_optimize(n - 1);
		int y = fib_cilk_optimize(n - 2);
		cilk_sync;
		return x + y;
	}
}

Now, the cilk threads will be invoked only for the workloads where N>=20. Testing the results for modified cilk code above, we get the following results:

Below is the chart for above results:

Note: the full source code (Android project) is attached at the end of the article, you can download it and test it out on your own device.



Other resources:

https://software.intel.com/en-us/android/articles/using-advanced-intelr-c-compiler-features-for-android-applications

Testing Device: Dell Venue 8 (2 core, with 4 HT threads) (Refer to http://software.intel.com/en-us/android/articles/mobile-development-kit-for-android on how to get the dev-edition of the image for the device)

This article applies to:
Products: Intel(R) INDE; INTEL(R) System Studio
Host OS/platforms: Windows (IA-32, Intel(R) 64), Linux* (IA32, Intel(R) 64)
Target OS/platforms: Android* (IA-32)

There are downloads available under the Intel Sample Source Code License Agreement license. Download Now
For more complete information about compiler optimizations, see our Optimization Notice.