Parallel Building Blocks - Have Cilk Vectorize your own Scalar Functions

In the previous blog, I explained two mini-kernels, the scatter and gather, which can be written up quickly and still have the benefits of compiler vectorization with Array Notations. There’s also a variety of run-of-the-mill functions, primarily in the “raw plug and chug math” category that you can look up in the Compiler User’s Guide. Here’s a few examples

a[:] = sin(b[:]);
a[:] = pow(b[:], c); which is the equivalent of b[:]**c

That’s all fine and good, but what are your options for some fast vectorization if the function you need to use doesn’t really fall within a commonly-used parallel pattern or just isn’t in the User’s Guide? You can teach the compiler to convert your scalar function to vector when you use Array Notation arguments instead of your conventional C++ data structures.

You just need to instruct the compiler to convert your function to vector, when called with array notation arguments, with __declspec(vector).

Let’s take this step by step. There’s a lot of “implicit magic” here.

First, you have your plan C scalar function declared with __declspec

// Plain C scalar function declared with __declspec(vector)

Seem’s cryptic? Don’t worry, it will make sense in a second.

Now, let’s declare our *scalar* function that we’re going to allow for vectorized vector inputs.

float saxpy (float a, float x, float y) {
// This is now saying that a is scalar – do NOT vectorize it . But now x and y are converted to vectors

return (a * x + y);

You can now call the saxpy scalar function with array sections

Z[:] = saxpy(A, X[:], Y[:]);

What’s happening behind the scenes? The compiler *automatically* maps the function across multiple array elements (in this example, the function becomes “a * x[:] + y[:]”). Why? Because you told it to with __declspec()! The __declspec(vector) syntax is used so that the compiler will generate a vectorized version of the function saxpy().

If you don’t use it, the compiler will generate a loop that will call saxpy on every element in Z[],X[],Y[]. NOT GOOD.

Here is what happens:

Without __declspec()

for (int i = 0; i < N; i++) {
Z[i] = saxpy(A,X[i],Y[i]);

With __declspec(), the compiler will generate something that looks like this. Lots of room for error for a beginner. Best let the compiler do it for you and forget about it.

__m128 v_saxpy(__m128 *, __m128 *);
for (int i = 0; i < N; i += 4) {
*((__m128 *)(&Z[i])) = v_saxpy(A,(__m128 *)(&X[i]), (__m128 *)(&Y[i]));

What is happening there? A v_saxpy() function will be generated that takes __m128 vectors (or __m256 on AVX). It can compute on 4 array elements at a time (8 on AVX). The compiler will break the arrays into __m128 sized pieces and pass them to this function in a loop.

The simple act of adding that one line can eliminate a lot of code writing and really take advantage of those vector units.

For more complete information about compiler optimizations, see our Optimization Notice.