Intel® Cilk™ Plus is a new method for implementing SIMD (SSEs) vectorization and parallel programs. It is provided by the Intel Parallel Composer 2011 product and the Intel C++ Composer XE (previously called Intel C++ Compiler Professional Edition) for Windows and Linux*.
When adding parallelism to the application for better performance or scaling, do not forget the hardware level data parallelism - vectorization. Ideally it should benefit most if using it with multi-threading technology like the Intel Cilk plus keywords below.
So below is the steps for adding parallelism with vectorization:
Example of using array notation and cilk_for together:
When adding parallelism to the application for better performance or scaling, do not forget the hardware level data parallelism - vectorization. Ideally it should benefit most if using it with multi-threading technology like the Intel Cilk plus keywords below.
So below is the steps for adding parallelism with vectorization:
- identify hotspot loops or functions in the application: using Intel Parallel Advisor, Intel Parallel Amplifier, or Intel Vtune™ Amplifier XE.
- to take advantage of the hardware level SIMD support: using Intel Parallel Composer 2011 or Intel C++ Composer XE.
- Build your application at -O2 to enable auto-vectorization. This option is available for both Intel® and non-Intel microprocessors but it may result in more optimizations for Intel microprocessors than for non-Intel microprocessors. You may use processor-specific options to take advantages of additional features of your processor like
/QxSSE2,SSE3,SSE4.1or/QaxSSE2,SSE3,SSE4.1(-xSSE.. or -axSSEfor Linux) or /arch:SSE2,SSE3,SSE4.1 (-msse2,sse3,sse4.1 for Linux). For more details see /en-us/articles/performance-tools-for-software-developers-intel-compiler-options-for-sse-generation-and-processor-specific-optimizations.- use
/Qvec-report[n](-vec-report[n]for Linux) to check if the loops are vectorized
- use
- apply array notation or elemental functions for hotspot loops and functions where auto-vectorization didn't work
- use
/Qvec-report[n](-vec-report[n]for Linux) to check if the loops are vectorized
- use
- Build your application at -O2 to enable auto-vectorization. This option is available for both Intel® and non-Intel microprocessors but it may result in more optimizations for Intel microprocessors than for non-Intel microprocessors. You may use processor-specific options to take advantages of additional features of your processor like
- add the Intel Cilk Plus keywords for paralleling the application: using Intel Parallel Composer 2011 or Intel C++ Composer XE.
- testing and debugging various tools: using Intel Parallel Debugger Extension to Visual Studio or Intel Parallel Inspector 2011 or XE on Windows, and Intel Debugger on Linux*.
- Note: the Intel Cilk Plus feature is supported by Intel C++ Composer XE and Intel Parallel Composer 2011 at the time of this writing. In order to keep the source code to be compilable by other compilers, the following methods could be used:
- include following in the header file where the Intel Cilk plus keywords are used:
#ifdef __cilk #include <cilk.h> #else #define cilk_spawn #define cilk_sync #define cilk_for #endif - use "
#ifdef __cilk" where the array notation or elemental functions are used:#ifdef __cilk c[:] = foo_calc(a[:], b[:]) #else for (int i = 0; i < ARRSIZE; i++) c[i] = foo_calc(a[i], b[i]); #endif
- include following in the header file where the Intel Cilk plus keywords are used:
- For adding SIMD vectorization to the program:
- Array Notation - a language extension for data parallelism for whole arrays or sections of arrays and operations thereon. It maps parallel constructs to the underlying SIMD hardware.
- single dimension:
A[<lower bound>:<length>:<stride>]; <stride>is optional, it has a default value of 1.A[:]- refers the entire arrayA[10:5]- refersA[10], A[11], A[12], A[13], A[14]A[1:4:2]- refersA[1], A[3], A[5], A[7]
- multiple dimensions:
A[<lower bound>:<length>:<stride>][<lower bound>:<length>:<stride>]A[:][:]- refers the entire 2 dimensional array
- single dimension:
- Elemental Functions - data parallelism of whole functions or operations which can then be used with array notation.
__declspec (vector [clauses]) return_type function_name (arguments)
Example of array notation and elemental function used together:__declspec (vector) double foo_calc(double a, double b) { if (a > b) return a; else return a+b; } void foo() { int a[ARRSIZE], b[ARRSIZE]; int c[ARRSIZE]; .......... c[:] = foo_calc(a[:], b[:]) } - Array Notation - a language extension for data parallelism for whole arrays or sections of arrays and operations thereon. It maps parallel constructs to the underlying SIMD hardware.
- For adding data or task parallelism to the program:
- New C/C++ language keywords:
- cilk_for
- cilk_spawn
- cilk_sync
- Following hyper-objects (Reducers) to be used with above keywords for creating multi-threaded code:
- reducer_min<Type>
- reducer_max<Type>
- reducer_opadd<Type>
- reducer_opand<Type>
- reducer_opor<Type>
- reducer_opxor<Type>
- reducer_basic_string<Elem, Traits, Alloc>
- reducer_list_append<Type, Allocator>
- reducer_list_prepend<Type, Allocator>
- class reducer_ostream
Note: It is possible to create a customized reducer to suit specific needs. Use the provided templates and class as examples. - New C/C++ language keywords:
Example of using array notation and cilk_for together:
__declspec (vector) double foo_calc(double a, double b)
{
if (a > b)
return a;
else
return a+b;
}
void bar()
{
int a[ARRSIZE], b[ARRSIZE];
int c[ARRSIZE];
..........
cilk_for (int i = 0; i < ARRSIZE; i++)
c[i] = foo_calc(a[i], b[i])
}

Notice revision #20110804
