Outer Loop Vectorization via Intel Cilk Plus Array Notations

Compiler Methodology for Intel® MIC Architecture

Outer Loop Vectorization via Intel® Cilk™ Plus Array Notations


Outer loop vectorization is a technique to enhance performance. By default, compilers attempt to vectorize innermost loops in nested loop structures. Sometime, though, the number of iterations in the innermost loop are small. In this case, vectorization is not profitable. However, if an outer loop contains more work, C++ Array Notation can be used to simulate outer loop vectorization. C++ Array Notation is part of Intel® Cilk™ Plus which is a feature of the Intel® C++ Composer XE.


Outer loop vectorization via array notation is demonstrated below.

C++ Array Notation can be used to block loops in such a way that simulates “outer loop vectorization” (vectorization of an outer loop, while retaining the structure of the inner loop). Here is an example, based on the calculation of the Mandelbrot set:

Serial code:

void serial_mandel(
   float c_re[SIZE], float c_im[SIZE],
   unsigned int max_recurrences,
   char output[SIZE])
   for (int i = 0; i < SIZE; i++) {
     unsigned int iter = 0;
     float z_re = c_re[i];
     float z_im = c_im[i];

     // If the loop reaches max_recurrences, c is an element of
     // the Mandelbrot set.
     while (iter < max_recurrences) {
       if (z_re * z_re + z_im * z_im > 4.0) {
         break; // escape from a circle of radius 2

       float new_z_re = c_re[i] + z_re*z_re – z_im*z_im;
       float new_z_im = c_im[i] + 2.*z_re*z_im;

     // Scale the number of iterations to the range of an 8-bpp image
     output = static_cast<unsigned char>(static_cast<float>(iter) /
     max_recurrences * 255);

This code has two loops. The outer loop iterates through all the points in an input set, which represent coordinates in the complex plane. For each point, a inner loop calculates a recurrence with the coordinate as an initial condition, and counts how many iterations it takes to saturate.

The inner loop is not a great vectorization candidate because it is a iterative summation with an unknown number of iterations. But the outer loop, on the other hand, could work well because the recurrence can be calculated for each point independently.

Loop interchange cannot be done automatically by the compiler because of the code existing in the outer loop (the compiler prefers “perfect” loop nests with no intervening code).

We can use array notation to simulate outer loop vectorization:

void mandel_v(
  double c_re[SIZE], double c_im[SIZE],
  int max_recur,
  char output[SIZE])
  float count[VLEN];
  for (int i = 0; i < size; i += VLEN) {
    double z_re[VLEN], z_im[VLEN], new_z_re[VLEN], new_z_im[VLEN];
    int test[VLEN];
    z_re[:] = c_re[i:VLEN];
    z_im[:] = c_im[i:VLEN];
    int iter;

    // Count how many times you can multiply z by itself
     // until it is greater than 4. count[:] is the result.
    count[:] = 0;
      for (iter=0; iter<max_recur; iter++) {
        test[:] = (z_re[:] * z_re[:] + z_im[:] * z_im[:]) <= 4.;
         if (__sec_reduce_all_zero(test[:])) {
        count[:] += test[:];

        // add 1 or 0 depending on whether z[:] is <= 4
        new_z_re[:] = c_re[:] + z_re[:]*z_re[:] - z_im[:]*z_im[:];
        new_z_im[:] = c_im[:] + 2.*z_re[:]*z_im[:];
        z_re[:] = new_z_re[:];
        z_im[:] = new_z_im[:];
      output[i:VLEN] = (char)((float)count[:] / max_recur*255);

This code uses the basic technique of widening scalar variables to array sections of VLEN size. We don’t want to widen to the entire array length as it may be very large, and the large array section operations may cause sub-optimal code to be generated. Therefore we don’t entirely remove the outer loop; we iterate through the points in chunks of VLEN size.

The interesting feature of this conversion is in how to treat the while-loop in the original code. In the original code, we run the loop on each point until it saturates, and then we stop. Now that we are handling VLEN points in parallel, we must stop iterating each point when it saturates, and still use the same code for each point (SIMD semantics). We do this by testing the points in parallel for saturation, which generates a 1 or 0 result. We add this 1 or 0 value to the count vector: 1 if the point is not saturated, and 0 if it is. When a point is saturated, the vector counter increment therefore becomes a no-op for that vector element. We use the __sec_reduce_all_zero() intrinsic to test if all the points have become saturated and break out of the loop, so that we don’t get stuck doing no-op adds until max_recur is hit.

The vector speedup depends on how many no-op adds are generated, which depends on the uniformity of the result values for a given VLEN. If all points but one are saturated, and that one point keeps iterating to the max, we are wasting a computation on the saturated points. For this reason, it may be possible that a smaller-than-machine-size VLEN may perform better; the user will have to experiment. The __sec_reduce_all_zero() also adds some overhead; if the counter result is always close to max_recur for some input set, it may be omitted.

This technique has the same result of outer loop vectorization, without actually doing it (we help the compiler by blocking the outer loop in VLEN-size chunks and creating vector operations based on those chunks---each array notation statement is a short one-statement inner loop).

You can find more examples of outer-loop vectorization described in the following paper:

ISCA 2012 Paper: Can Traditional Programming Bridge the Ninja Performance Gap for Parallel Computing Applications?(June 2012)

Take Aways

By default, compilers will attempt to vectorize the innermost loop of a nested loop structure. If there are not enough iterations in this innermost loop, it may not be profitable for the compiler to vectorize that innermost loop. However, if an outer loop has more iterations, vectorization at that outer level may be effective. To do this, C++ Array Notation can be used to simulate outer loop vectorization. C++ Array Notation is part of Intel® Cilk™ Plus which is a feature of the Intel® C++ Composer XE.

It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on Intel® Xeon Phi™ architecture. The paths provided in this guide reflect the steps necessary to get best possible application performance.

Back the main chapter Vectorization Essentials.

For more complete information about compiler optimizations, see our Optimization Notice.