Parallel Building Blocks - Slick Array Notations Syntax Inside a Cilk ‘for’

In the previous blog, I explained the rationale behind vectorizing with Array Notations inside of the Cilk ‘for’. There are lots of ways to vectorize inside a for loop once you’ve used the Cilk ‘for’, but using this colon/bracket style syntax has been the easiest way for me to get vectorization without having to think about using AVX128 vs. AVX256 at all. That’s simply much harder to wrap your head around than just changing your data structures to use a different, compiler-friendly style of expressing code:"Hi compiler, please vectorize this stuff inside my for loop. Thanks!"

Today let’s go into the specifics. I really didn’t show much in Blog 4.

The syntax goes like this


A[lower_bound : length : stride]


This is a length sized array section (vector) starting from lower_bound, with elements separated by stride
• Note that all operands can be omitted: A[:] which means all array elements.
• Operators and assignments are applied element-wise in unspecified order
• You can also use a reduction to combine a set of elements into a single value. Reduction is an excellent data parallel programming pattern. Reduction combines array section elements to generate a scalar result.

Here are some examples that will explain what I mean:


a[:] = 5.0; // fill everything inside “a” with 5.0 – NO ‘for’ loop needed
b[2:6] = 6.0; // fill elements 2 thru 7 with 6.0 – again, specifying a range but not explicitly iterating
c[:] = a[:] * b[:]; // element-wise multiplication
d[0:n][0:n] = a[0:n][0:n] + b[0:n][0:n];
// n x n matrix addition – this definitely is something everyone had to do in school with a for loop, now we are abstracting away from that



Notice, that in many cases, you DON’T need the Cilk ‘for’ loop altogether to use these. Typically you would have to iterate through all your elements to do these things. You can use the Cilk ‘for’ if you need to do many Array Notations operations. So you could use the Cilk ‘for’ to do 100 of the operations above. The Cilk ‘for’ would separate each of the individual operations across available cores, and then each of the operations would then run across each of the individual SIMD registers.

An array section is a portion of an array that acts like a vector. It can be the whole array, or just a small part of it. The lower bound and length do not have restrictions regarding alignment, vector size, etc. Operations on array sections will be vectorized by the compiler as much as possible. If the hardware cannot support a vector operation, the fastest scalar code will be generated.

All the array sections in a statement must have the same length. But they can have a different lower bound and different stride.
Stride is 1 by default. A stride of 1 means that we want consecutive array elements.

It is possible to write a statement, A[0:N] = B[0:N:2]. This copies B[0],B[2],B[4],... into A[0],A[1],A[2],...

Now think about how you would do a fast sum. You can do it with a few lines serially, but how about getting it parallelized + vectorized? Quite a few more lines. There’s a powerful parallel construct within Array Notations that takes that compacts all those of code into one.


sum = __sec_reduce_add(a[:]); // sum of all elements

int a[] = {1,2,3,4};
sum = __sec_reduce_add(a[:]); // sum is 10



There are 9 built-in reduction functions that you can use that are already optimized.

add, mul, max, max_ind, min, min_ind, all_zero, all_non_zero, any_nonzero

You can also create your own custom reduction which we’ll go over in a future blog.

Here’s some more things you can do on sections


a[:]++ // increment all a[]
a[0:3:2] = 5; // elements 0,2,4 = 5
a[:][:] + b[0][1] // adds b[0][1] to all elements of a
a[0:4] = b[2:4]; // copy 4 elements in b starting at index 2, to elements in a starting at 0
a[3:2][3:2] + b[5:2][5:2] // 2x2 matrix addition.
// The left corners of the matrices are a[3][3] and b[5][5].



An operator map is a C/C++ operation that is performed in parallel across all the elements in an array section. The syntax of this can be understood easily, as it is like standard C/C++ operations. Please keep in mind that the second element in the array section is length. In an expression such as a[0:3:2], the length is 3. That is why there are 3 elements a[0],a[2],a[4] that are addressed by the array section.
Similarly, a[3:2][3:2] refers to elements a[3][3],a[3][4],a[4][3],a[4][4]. The lower bound is a[3][3]. The length is 2 in both array dimensions.
Type conversion from float to int, etc. is done automatically, as it is in C/C++.

So, after seeing how the basic array notations operations work, you can see that if you’re doing one simple computation, you may not need to use the Cilk ‘for’ altogether. With Cilk, you’re thinking more about what you’d like to happen from a parallel algorithms perspective without specifying how exactly you’d like it to get done.

For more complete information about compiler optimizations, see our Optimization Notice.