Parallel Building Blocks - Vectorize Inside the Cilk 'for' with Array Notations

In the previous blog, I gave a very fast crash course on the Cilk ‘for’. There are a few other ways you can manipulate that for loop, but I will go over that another time. The key is to understand what is possible first, get the basics running and parallelized, and use the more expert features later. I think people new to parallel programming get too caught up on speedups and start using the full gamut of cilk features. I really feel the best course of action with any of the PBB models is the following:

1. Get it to scale across some cores with threading. Use some of the parallel studio features to understand how many cores are being scaled to based on your data size.
2. See how you can vectorize. I will discuss a very convenient option for vectorization with the Array Notations feature here.

So for even parallel programming veterans, vectorization can be a scary black art.

DEFINITION: Vectorization takes code that performs operations on individual operands and utilizes Intel® Streaming SIMD Extensions to perform those operations on multiple sets of data in parallel.

This means you're using one of the possible SIMD instruction sets (right now usually a variant of SSE or AVX for Sandy Bridge) to take advantage of parallelism *inside* each individual core.

If you just used the Cilk 'for', threads are spawned for an arbitrary amount of cores, but the SIMD registers, for the most part, are not being utilized, outside of some vectorization done by the compiler, which by the way, you have to understand/specify yourself (the compiler flags and pragmas are not that hard – we’ll discuss that in another blog). In other words, the lanes are not filled. You need to fill them!

Thread parallelism: Take advantage of many cores: divide & conquer.
Vector parallelism: CPU instructions that process multiple data elements

You need both to maximize that fancy new processor. Otherwise you're leaving some performance on the table.

What’s the philosophy behind PBB vectorization? Enable all developers to use vector hardware (SSE instructions) in the CPU easily, without them having to use intrinsic functions or inline assembly. A convenient consequence of generated code is portability. You don’t have to rewrite your SIMD code every time a new processor/instruction set comes out.

Conveniently, Cilk™ Plus is two components - the Cilk component for threading and the Array Notations component for vectorization. As I explained in the previous blog, these are actual language extensions, not templates. No new data types are added. So the compiler makes decisions at compile time based on the code you wrote. The language extension makes it very easy to convert your existing code.

So what do these Array Notations look like? It’s a colon/bracket style syntax and there are lots of things you can do within the language to manipulate the colon/brackets. Behind the scenes the compiler generates tailored SIMD code based on your expression.

In the example above,
A is a C/C++ array or pointer variable.
0 is the lower bound.
N is the length.
It is like the memcpy() function call, which is memcpy(start_address, length);
So A[0:N] means that we want to do a parallel operation on A[0],A[1],A[2],...,A[N-1].
We do not need special vector or matrix types for parallel operations.

So you express your cilk ‘for’ loop and inside the loop express your array/vectors/matrices/cubes/hypercubes/whatever_you_want_to_call_them with this new syntax and define the operations you would like done on them.

In the next blog entries, I will go into more details on the syntax with code samples.

For more complete information about compiler optimizations, see our Optimization Notice.