Using Intel® Cilk™ Plus to Achieve Data and Thread Parallelism: A Case Study for Visual Computing

Abstract

Multicore systems with advanced vector instructions are moving into many domains: from servers, workstations, and desktops to laptops and mobile devices. There is potential for improving the performance of user applications by exploiting both data and thread parallelism. With Intel® Cilk™ Plus, simple extensions are provided to the C and C++ languages that allow an easy way to express both data and thread parallelism in the source code. The intent is to provide a natural way of expressing an algorithm in source code that provides reliable and predictable parallelization while keeping the source code readable, maintainable, and scalable to future hardware. Furthermore, the extensions work seamlessly with existing C and C++ data structures and code.

In this paper, we demonstrate the use of Intel Cilk Plus to express data and thread parallelism to achieve significant performance improvements for a visual computing benchmark: AOBench (Ambient Occlusion Benchmark), compared to its original sequential C form.

1.     Introduction

Intel is providing new programming models for parallel programming as part of the Intel Parallel Building Blocks (Intel PBB). At the same time Intel continues to support and invest in standards like OpenMP* and MPI. One of the components of Intel PBB is Intel Cilk Plus, which is a set of language extensions that provides both data and task parallel constructs. Intel Cilk Plus is available in both Intel Parallel Composer 2011 and Intel C++ Composer XE 2011[3] and beyond, which contains the Intel C/C++ Compiler. First, we give an overview of the parallel language extensions available in Intel Cilk Plus, then an introduction to the application Ambient Occlusion Benchmark (AOBench), and finally a description of the porting effort to Intel Cilk Plus and the results achieved.]

2.     Intel Cilk Plus

Intel Cilk Plus provides language extensions to explicitly express task- and data-level parallelism in the source code. By simply adding the new keywords to existing code makes it possible to achieve task parallelism. For data parallelism or vectorization it is possible to write an algorithm using the array notation extensions of Intel Cilk Plus to ensure maintainable and predictable vectorization. Furthermore a scalar function can be declared as an elemental function or vector function which instructs the compiler to create a vector version for the function when called in a vector context.

2.1 Task parallelism

The Intel Cilk Plus language provides two constructs for task level parallelism. The first one provides two new keywords: _Cilk_spawn and _Cilk_sync. Figure 1 compares the recursive implementation of the well-known Fibonacci code in serial C against a parallel implementation, which was derived from the serial code by adding the new keywords: 


Figure 1. On the left is a serial implementation of Fibonacci. On the right is the same code made parallel using Intel Cilk Plus keywords. Note that no original code was changed. Only the Intel Cilk Plus keywords were inserted.

Changing the serial program to a parallel program requires only insertion of the minimally intrusive keywords as appropriate.  It does not require changes to the underlying program.  The insertion also does not impact the readability and maintainability of the program.  The parallel code is easy to see and the program logic is as evident as the original serial code.  The Intel Cilk framework dynamically load-balances the potentially large number of tasks caused by the recursion. This observation highlights one of the main design principles of Intel Cilk Plus.  It is designed for parallelizing existing, serial C/C++ code.  Thus, a benefit of Intel Cilk Plus is the ability to easily add parallelism to code by making minimal changes while providing strong guarantees for serial equivalence.

2.1.1 The _Cilk_spawn Keyword

The _Cilk_spawn keyword applies to a function call, such as fib(n-1) in Figure 1 above.  It means that the caller can continue to execute while the callee is executing.  Except for this concurrency, the situation is identical to a normal function call. The _Cilk_sync keyword means that the calling function, the parent, has to wait until all child tasks have completed and returned control back to the parent.  Intel Cilk Plus does not introduce any new parallel region concepts but reuses the concept of a function as the parallel region seen from the parent and as the unit which can be spawned. Arguments to the spawned function must be evaluated before the parent task detaches the child, similar to how the C/C++ language requires the arguments to be evaluated before the call. In the example shown in Figure 2 the function calls g() and h() will be evaluated before the call to f() is spawned. The calls to g() and h() will therefore execute serially before the call to f().

 

2.1.2 The _Cilk_sync Keyword

 

The _Cilk_sync keyword in the example of the fib() function (Figure 1) is necessary to guarantee that the values in the variables x and y are updated by the asynchronous execution of fib(n-1) and fib(n-2) before they are used.  In addition to the explicit form of the keyword, there is an implicit _Cilk_sync at the end of every spawning function.  The effect of the implicit _Cilk_sync is that the spawning function waits for its child tasks to complete before it returns. Thus a called function never leaves dangling tasks after it returns.

2.1.3 The _Cilk_for Keyword

The second tasking construct provides the keyword _Cilk_for.  It enables programmers to parallelize C/C++ for loops.  The key to using _Cilk_for is that the number of iterations in the for loop can be computed before the loop is executed. Consider the statement in Figure 3.



Figure3: The _Cilk_for keyword

The restrictions on the _Cilk_for construct include:

 

  • The index variable (i in this example) appearing in all three expressions of the loop must be initialized in the first instance
  • The index variable must be compared to a value that does not change within the loop
  • The index variable must be incremented by a value that does not change inside the loop
  • The index variable cannot be changed within the body of the loop

The compiler will detect violations and enforce these restrictions in most cases. Exceptions are when the comparison value and index variable are updated through pointers.

2.2 Data-Parallelism with Intel Cilk Plus Array Notation

A new syntax provided with Intel Cilk Plus allows for simple operations on arrays.  Array notation is intended to allow users to directly express data-parallel array operations (vectors) in their programs.  Most C/C++ implementations do not provide ways to express data-parallel operations on ordinary declared C/C++ arrays, e.g. float a[N];   Instead, the programmer has to write a loop and express the operation in terms of elements of the arrays, thus creating serial order of the operations. The down side of writing loops with serial ordering is that, in order to convert the serial loop to a vector loop, the compiler has to prove that the vector processing would be equivalent to the scalar processing implied by the loop and mandated by the language.  For example, when the program uses pointers to reference the array, the compiler may not be able to prove that two pointers point to disjoint areas of memory without additional attributes or keywords. Neither will it be obvious to maintainers that the iteration order is irrelevant

Array notation such as a[:] = b[:] + c[:] indicates to the compiler that the elements of the arrays b and c are to be added and that there is no order of operations requirement with the addition semantic. The semantics of array notation also state that if there is overlap, but not complete overlap, between the right hand side and the left hand side, the behavior is undefined.  These semantics make it possible for the compiler to generate vector code without further analysis or requirement to use pragmas, attributes, or keywords like restrict. Figure 4 shows an example of the Double-precision real Alpha X Plus Y (daxpy) algorithm expressed using array notations, where ‘x[:]' and ‘y[:]‘ denote double precision arrays and ‘a' is a double precision scalar[1]. Note the complete overlap for y[:] allowing the use of the += operator. Scalar operands used in array notation expressions are applied to every element of the array section.



Figure 4. Daxpy/saxpy algorithm expressed using array notation.

 

2.2.1 Array Section Operator

The Intel Cilk Plus language extensions define an array section operator whose syntax is array_base[begin:length:stride] where

  • array_base is any array expression allowed in C/C++, including arrays, pointers and C99 variable length arrays
  • begin is the index of the first array element to which the section applies
  • length provides the number of array elements in the section
  • stride is an increment between elements. The stride is optional.  If omitted, the default value 1 is used.

The shorthand [:] for the array section operator denotes all the elements of the array.

The section operator can be applied to multi-dimensional arrays and can be intermixed with array subscripts, such as in a[0:5][3][0:8:2].  In this example, ‘a' has to be a 3 dimensional array or a pointer to an array.  The rank of the expression is the number of dimensions in which the section operator is used, rather than a subscript.  In the example above, the rank is 2.  In other words, this section is a two-dimensional slice through a three-dimensional array.

2.2.2 Intrinsic functions

Intrinsic functions[2] are provided for some commonly used operations such as summation of elements of an array.  A dot product operation might be expressed as


Figure 5.  A dot product operation expression using _sec_reduce_add.

 

Array notation is a natural way for a programmer to express their algorithms at the level of array operations. Array notation also provides maintainable and readable code that may scale with future hardware, as the source code is written in an architecture-independent way.

2.2.3 Intel Cilk Plus Elemental Function

A third programming construct allows the programmer to write a scalar function in standard C/C++ and declare it as an elemental function.  When using an elemental function, the programmer's typical intent is to deploy it on many elements of arrays without prescribing an order of operations among the array elements.  The simple example in Figure 6 shows the use of elemental functions to perform element-wise addition of arrays.


Figure 6.  Example of the use of elemental functions to perform element-wise addition of arrays.

The use of __declspec(vector) indicates to the compiler that the function "v_add" is intended to be used as an elemental function.  The Intel C++ compiler will generate a version of the function to be called in a scalar context and a version to be called in a vector context, meaning that the function body will operate on vectors instead of scalars and the arguments will be vectors as well.  The benefit, may be enhanced performance without requiring the programmer to write a vector version of the function in assembly code level intrinsic calls or assembly instructions.

Instead of calling the function using a ‘for' loop, the programmer may choose a ‘_Cilk_for' as the construct to call the function.  In such a case, the compiler calls the vector version, and process multiple vectors in parallel using threads. With this use of the language construct, the programmer may get the combined benefit of both multi-core and vector level parallelism.

 

3.     Benchmark Description

Ambient Occlusion Benchmark (AOBench) is a popular visual computing benchmark. It is used for rendering ambient occlusion, a shading method that creates non-reflective surfaces which helps add realism to local reflection models used in 3D graphics. AOBench was originally coded in the language called Processing [http://processing.org/]. Later, it was ported to C, Java, Objective C, CUDA, OpenCL, Go, and many others to support various platforms, from high end CPUs, to GPUs and smart-phones[http://code.google.com/p/aobench].

AOBench is focused on benchmarking real world floating point performance in various languages across various platforms. It consists of only 400 lines of code and requires only standard math functions.  Because of its easy understandability the language community was encouraged to port it to various new languages and platforms. Therefore, porting AOBench to Intel Cilk Plus allowed us to compare the performance of Intel Cilk Plus to the other available implementations. Computationally, AOBench may be described as a ray trace kernel applied to a fixed test scene of 3 spheres and 1 plane. For each scene, for each ray intersecting an object, an ambient occlusion approximation is computed by random ray casting back into the scene

The main focus of this paper is to improve the compute resource utilization in AOBench using the following techniques:

  1. array of structures style data layout
  2. nested computational iteration
  3. data dependent branches
  4. use of short C functions
  5. use of built-in transcendental math functions (sin, cos, sqrt, rand)

AOBench's data structures are simply laid on struct data types. A scene consisting of 3 spheres and a plane is defined by these struct data types. As an output AOBench generates a grayscale image of the intersecting spheres and plane.

The pseudo code structure of AOBench is as follows:





Of note, all the functions listed, call other small vector functions such as dot product, cross product, and normalize. They also use transcendental math functions such as square root, cosine, and sine. The most computational-heavy function is compute_ambient_occlusion().  This function has its own small nested loop, where it calls random number generator to create a random ray scattering.  Each of these created rays results in calls to a function to calculate intersections with scene objects.  In total, there are 12 functions; they can be nested as deep as 6 calls during execution.

4.     Intel Cilk Plus Modifications for Parallelism

In this section we will describe the basic steps of mapping AOBench's serial algorithm to a parallel implementation that exploits both data-level-parallelism (DLP) and task-level-parallelism (TLP). We first show how Intel Cilk Plus's array notation can be used to map DLP within the serial algorithm to the VPU (vector processing unit, e.g. Intel SSE (Intel Streaming SIMD Extensions), etc.) of a single core and then show how Intel Cilk Plus's thread mechanism allows mapping the higher level TLP of the algorithm to multiple cores.

The first step in exploiting the VPU resources is finding Single-Instruction-Multiple-Data (SIMD) operations in the algorithm.  Like many compute-intensive applications, AOBench uses loops to traverse through large data sets of floating point data types. Since the same operations are performed in each iteration and the operands are (or can be made) independent, we consider these loops to be huge sources of DLP parallelism well suited for the VPU.  Once found, we express these operations in array notation, which clearly identifies to the compiler the SIMD nature of the operations. We leave it to the compiler to then parse the array notation and strip-mine the large array operations into smaller SIMD operations that map to the native width of the VPU.  For our benchmark tests, we used an Intel Core i7TM CPU which provides 4 element-wide vector instructions with 32-bit floating point data elements.

To further take advantage of multiple cores (each with a VPU) in AOBench we denote thread level parallelism using Intel Cilk Plus's thread mechanism. Figure 7 shows the overall transformation of AOBench exploiting DLP (data level parallelism) and TLP (task level parallelism).  The remainder of this section provides more detail on the process of porting of AOBench to Intel Cilk Plus.



Figure 7: Transformation of AOBench using Cilk Plus exploiting first DLP and then TLP

While examining AOBench for conversion to Intel Cilk Plus we find the following steps helpful:

  1. Find the candidate functions or block of code consuming most of the execution time.
  2. Detect a candidate loop where the loop does not have a loop-carried dependency (or can be made so).
  3. Select the safe instructions where the transformation will maintain existing data dependencies.
  4. Convert the data layout of the definition variable in the safe instruction if it is not in array format and replace all of its uses.
  5. Expand the single data reference in the safe instruction to a multiple data reference-, using array notation.
  6. Express task level parallelism using Intel Cilk Plus thread mechanism to execute different independent iterations processing array notations in parallel.




Algorithm 1: Array notation generation for functions

From our hotspot analysis (which can be done by using a performance analyzer like Intel Parallel Amplifier[3] or setting probes in the code), we find ambient_occlusion() to be the candidate function because it consumes 90% of the execution time and contains an internal candidate loop. The first eight instructions of the candidate loop are the definitions of the eight scalar variables, which are used only inside the loop. Variables theta and phi are computed using the standard math library, and the rest of the variables are computed from them. Since these eight instructions do not have loop-carried dependencies (they do not depend on the results of previous or future loop iterations) other than themselves, they are selected for safe transformations. Since the data processed by these instructions are not in array format, we need to create local arrays for each scalar variable defined by these instructions to apply array notation. Following algorithm 1 applied to the serial code snippet of Figure 8, we declare eight array variables outside the loop and replace the correspondent scalar variable's definitions and uses by the specific array element in accordance with the loop iteration.

Here, the array size is determined by the total number of loop iterations.  To map a single data operation to a multiple-data operation, we express the data parallelism using array notation across the entire array. Since the compiler is going to handle mapping the entire array to the VPU by creating the data chunks that will be fed to the VPU vector instructions, the instruction becomes loop independent. We can simply move the instruction out of the loop.

Figure 9 shows the equivalent array notation code. Since theta and phi are calculated from random numbers to match the serial version result we buffered them in rand1 and rand2 prior to their uses.


Figure 8: Serial code snippet of ambient_occlusion()

Moving further, we will process structure variables and six instructions that define the scalar members of the structure by different values in different iterations. Just as we did scalar variable to array variable conversion above to support array notation, we need to convert the structures internal scalar member variables to arrays. In doing so, the data structure will be changed from a structure of scalar variables to structure of arrays (SOA). Converting to SOA form allows us to apply Intel Cilk Plus array notation, effectively replacing the single instructions that define single member variables, with instructions defining multiple member variables at a time. Of course, to make the transformation correct, we must replace all uses of the structure members by the proper elements of the corresponding array members of the new structure. Currently, this conversion to SOA must be done manually by the programmer, but future versions of Intel Cilk Plus intend to handle this automatically.


Figure 9: Equivalent Array notation for scalar variables of figure 8


Figure 10: Array notation transformation for struct variables

At this point we have applied data parallelism in most of the ambient occlusion computation. The next instruction processed by the loop is a function call to ray_sphere_intersect(). An internal function inside the loop can also provide opportunities for parallelization through array notation, if all of the instructions inside the function are safe instructions. Transformation of the instructions within the function can be handled by following almost the same steps used on the loop above. We convert the function arguments to array references if they are not already, and introduce array section notation inside the function body, so that the function now processes multiple array elements in parallel. We adjust the call site accordingly. Figure 11 shows the partial transformation of ray_sphere_intersect() and its call site. In this case, function arguments are structures of scalar variables which were already converted to SOA.

So far we have converted all of the instructions in the innermost loop of ambient_occlusion() to array notation, except for one accumulation instruction. We will leave it as it is, since it has a data dependency from the previous iteration. If we measure the runtime we notice the application is 2.49 (Table 1)times faster than serial execution. Here the fun part begins. If we inline some of the functions (which allow us to remove some redundant instructions) we are able to achieve 3.21x(Table 1) speedup. For example, instead of computing rs in ray_sphere_intersect(), nphi times, we only need to compute once since ray->org and sphere->center does not change in the loop.

So far, only data parallelism is exploited, utilizing a single core's vector instructions. Since we would like to see the full utilization of the Core i7's 4 core, 8-thread hardware, running 4 VPUs in parallel, we turn our focus to exploiting task level parallelism. Exploiting task level parallelism using Intel Cilk Plus is quite straight forward. To express task level parallelism we look for candidate loops since AOBench's heavy computations are done using loops.  It is generally true that the higher the granularity of a thread, the more the performance can be achieved using task parallelism. Having that in mind we select the outermost loop of render() for spawning threads. Considering Intel Cilk Plus's thread mechanisms, we choose the _Cilk_for keyword since the loop does not have any of the restrictions defined by _Cilk_for in section 2 and successfully achieved 6.7x speedup utilizing only the thread mechanism. Figure 12 shows the code transformation of _Cilk_for.



Figure 12: Intel Cilk Plus transformation for supporting task-level-parallelism

5.     Performance Results

In this section we present our experimental evaluation using Intel Parallel Studio XE 2011 SP1 compared to serial version. The serial source code of AOBench used for the experiment was taken from the Google Project Hosting website:

http://code.google.com/p/aobench/ .

For Intel Cilk Plus transformation we used Intel Parallel Studio XE 2011 SP1 C++ compiler. The system configuration used for this experiment is Intel® CoreTM CPU with 2.80 GHz with 4 GB RAM. Table 1 represents the different execution times of AOBench in second and Table 2 presents the speedup numbers of array notation itself, Intel Cilk Plus task parallelism itself and finally the complete Intel Cilk Plus. On average we are able to achieve 16.47x speedup using data and thread parallelism of using Intel Cilk Plus.

 

Types

Serial

Intel Cilk Plus (Array notation )

Array Notation with Inlining

Intel Cilk Plus (task parallelism)

Complete

Intel Cilk Plus

Inlined version

Exection time (in second)

1.862

0.745

.579

0.287

0.159

0.113

Speedup

1x

2.49x

3.21x

6.7x

11.71x

16.47x


Table 1: Execution times and speedup results for AOBench

6.     Conclusion

In this paper, we presented the design and empirical evaluation of  Intel Cilk Plus on well-know benchmark AOBench.  We demonstrated how Intel Cilk Plus eases parallel programming by allowing the programmer to focus on identifying parallel block of code while leaving it to the compiler to manage the complexity of mapping the code to parallel recourses. Our experimental results show that Intel Cilk Plus is able to utilize the underlying multicore resources efficiently by successfully using its work-stealing scheduler gaining 16x speedup.

References

/en-us/articles/intel-cilk-plus/

Notes

1. This example demonstrates double precision code, but the example can equally well illustrate integer and single precision code (Saxpy).

2. Intrinsic functions, are implemented and specially handled by the compiler. While the usage is similar to a user defined function, you may get the performance benefits of  inlined code

3. Intel Parallel Amplifier details c: /en-us/articles/intel-parallel-amplifier/. Intel Parallel Composer details are here: http://software.intel.com/en-us/articles/intel-parallel-composer/



*Other names and brands may be claimed as the property of others.

 

 

Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.