Array Notation Tradeoffs

Compiler Methodology for Intel® MIC Architecture

Overview

C++ Array Notation is part of Intel® Cilk™ Plus which is a feature of the Intel® C++ Composer XE. Array Notation is one way to express parallelism. Array Notation helps the compiler with vectorization. However, one has to be careful in it's use. Array expressions often require creation of temporary copies of the intermediate arrays used in evaluation of the expression. A side effect could be that these temporary vectors spill out of cache, eliminating reuse and causing a performance loss compared to the original loop equivalent. Rewriting the array syntax in shorter vectors can avoid this cache overflow.

Topics

Tradeoffs in Array Notations due to vector lengths.

When converting code to array notation, it is useful to keep in mind what operations are directly supported by the target hardware. Consider this scalar code:

void scalar(T *R, T *A, T *B, int S, T k) {

  // S is size __assume_aligned(R,64);

  __assume_aligned(A,64);

  __assume_aligned(B,64);

  for (int i = 0; i < S; i++) {

    T tmp = A[i] * k - B[i];

    if (tmp > 5.0f) {

      tmp = tmp * sin(B[i]);

    }

    A[i] = tmp;

  }

}

If the scalar code were converted directly to array notation using the basic technique of replacing loop-index-subscripts “[i]” with array notation “[0:S]”, this code would be the result:

void longvector(T *R, T *A, T *B, int S, T k) {

  __assume_aligned(R,64);

  __assume_aligned(A,64);

  __assume_aligned(B,64);

  T tmp[S];

  tmp[0:S] = A[0:S] * k - B[0:S];

  if (tmp[0:S] > 5.0f) {

    tmp[0:S] = tmp[0:S] * sin(B[0:S]);

  }

  A[0:S] = tmp[0:S];

}

If the array size "S" is large (larger than L2 cache size), the above code may not perform optimally because the array sections are very large. Specifically:

1) The temporary array "tmp", which was a single scalar value in the original code, is now a large array. This data is reused several times in the algorithm, but does not fit in the cache and must be reloaded from memory. It may even lead to stack allocation problems.

2) The array "B" has the same problem: it is reused but does not fit in the cache.

3) The size "S" array section operations are much larger than hardware vector length. The compiler must decide how to break them up for efficient vectorization.

The compiler may be able to "fuse" all the code together which will improve reuse, but it may be hampered by the unknown size “S” and the generic declaration of the arrays as "kpointers to T".

A way to write the above code which relies less on aggressive compiler analysis is:

void shortvector(T *R, T *A, T *B, int S, T k) {

  __assume_aligned(R,64);

  __assume_aligned(A,64);

  __assume_aligned(B,64);

  for (int i = 0; i < S; i += VLEN) {

    T tmp[VLEN];

    tmp[:] = A[i:VLEN] * k - B[i:VLEN];

    if (tmp[:] > 5.0f) {

      tmp[:] = tmp[:] * sin(B[i:VLEN]);

    }

    A[i:VLEN] = tmp[:];

  }

}

This "short vector style" re-introduces the for-loop and iterates through the loop in groups of VLEN elements. Within the loop, the compiler will generate operations that handle VLEN elements at a time. If VLEN is chosen to match the target hardware vector length, these operations can map directly into vector instructions and register operations.

The obvious question to ask is: isn't the "short vector" style more complicated to read than the original for-loop---wouldn't it be better to use a scalar for-loop and rely on compiler vectorization and other optimizations? The answer, of course, is "it depends". The advantage of the short vector style is that it tells the compiler exactly what is expected, and assists with dependency analysis due to the array notation semantics (array notation implies no dependencies within a statement.) If the scalar for-loop perfoms optimally, of course there is no reason to use this style. But if the compiler is not providing the desired code generation with the scalar for-loop, then short vectors are a good way to tune the loop without unnatural looking constructions like strip-mining.

Another natural question is: "What is the optimal value of VLEN?" A good rule of thumb is to start with the target hardware vector size and try multiplying or dividing by 2. Increasing the size will increase the number of vector registers that are needed to compute the loop, but may increase performance by reducing trip-count and exposing more optimization opportunities. Decreasing the size will increase trip count, but may be needed if the loop operations require more vector registers than are available (it is a good idea to reduce VLEN when mixing floats and doubles, for instance). For Intel® Xeon Phi™ coprocessor, 16 seems to be the optimal VLEN for the above routine.

On Intel® Xeon Phi™ coprocessor, the short vector code above runs 25% faster than the native scalar code. Full code:

// Example of short vector style coding vs. scalar. With benchmark timing.

#include <stdio.h>

#include <stdlib.h>

#include <math.h>

#define S 8192*4

#define T float

#define ITERS 100

#define VLEN 16

 

// Reduce for AVX,SSE,etc.

__declspec(noinline) void scalar(T *R,T *A, T *B, T k) {

  __assume_aligned(R,64);

  __assume_aligned(A,64);

  __assume_aligned(B,64);

  for (int i = 0; i < S; i++) {

    T tmp = A[i] * k - B[i];

    if (tmp > 5.0f) {

      tmp = tmp * sin(B[i]);

    }

    A[i] = tmp;

  }

}

// NOT EXECUTED; CAUSES STACK OVERFLOW DUE TO LARGE stack allocation

__declspec(noinline) void longvector(T *R,T *A, T *B, T k) {

//__declspec(noinline) void longvector(T R[S],T A[S], T B[S], T k) {

  __assume_aligned(R,64);

  __assume_aligned(A,64);

  __assume_aligned(B,64);

  T tmp[S];

  tmp[0:S] = A[0:S] * k - B[0:S];

  if (tmp[0:S] > 5.0f) {

    tmp[0:S] = tmp[0:S] * sin(B[0:S]);

  } A[0:S] = tmp[0:S];

}

__declspec(noinline) void shortvector(T *R,T *A, T *B, T k) {

  __assume_aligned(R,64);

  __assume_aligned(A,64);

  __assume_aligned(B,64);

  for (int i = 0; i < S; i += VLEN) {

    T tmp[VLEN];

    tmp[:] = A[i:VLEN] * k - B[i:VLEN];

    if (tmp[:] > 5.0f) {

      tmp[:] = tmp[:] * sin(B[i:VLEN]);

    } A[i:VLEN] = tmp[:];

  }

}

bool compare(T ref, T cean) {

  return (fabs(ref - cean) < 0.00001);

}

//__declspec(align(64)) T A[S],B[S],C[S];

int main() {

  volatile __int64 start=0, end=0, perf_ref=0, perf_short=0, max, tmpdiff;

  T *A,*B,*C;

  posix_memalign((void **)&A, 64, sizeof(T)*S);

  posix_memalign((void **)&B, 64, sizeof(T)*S);

  posix_memalign((void **)&C, 64, sizeof(T)*S);

  //__declspec(align(64)) T A[S],B[S],C[S];

  int short_vs_ref;

  T ref_result, short_result;

  float k = 0.5;

  max = 0;

  for (int i = 0; i < ITERS; i++) {

    A[0:S] = __sec_implicit_index(0);

    B[0:S] = __sec_implicit_index(0);

    C[0:S] = __sec_implicit_index(0);

    start = __rdtsc();

    scalar(A,B,C,k);

    tmpdiff = __rdtsc() - start;

    perf_ref += tmpdiff;

    if (max < tmpdiff) max = tmpdiff;

    ref_result = __sec_reduce_add(A[0:S]);

  }

  perf_ref -= max;

  tmpdiff = (perf_ref - perf_short) * 100 / perf_ref;

  short_vs_ref = (int)tmpdiff;

  if (!compare(ref_result, short_result)) {

    printf("MISCOMPARE SHORT: FAILEDn");

    return -1;

  } else if (short_vs_ref < 15) {

    printf("SHORT VECTOR IS < 15%% FASTER THAN SCALAR : %d%%n", short_vs_ref);

    printf("FAILEDn"); return -2;

  }

  printf("SHORT VS SCALAR SPEEDUP >= 15%%: %d%%n", short_vs_ref);

  printf("PASSEDn");

  return 0;

}

You can find more examples of outer-loop vectorization described in the following paper:

ISCA 2012 Paper: Can Traditional Programming Bridge the Ninja Performance Gap for Parallel Computing Applications? (June 2012)

Take Aways

C++ Array Notation is part of Intel® Cilk™ Plus which is a feature of the Intel® C++ Composer XE. Array Notation is one way to express parallelism. Array Notation helps the compiler with vectorization. However, one has to be careful in it's use. Array expressions often require creation of temporary copies of the intermediate arrays used in evaluation of the expression. A side effect could be that these temporary vectors spill out of cache, eliminating reuse and causing a performance loss compared to the original loop equivalent. Rewriting the array syntax in shorter vectors can avoid this cache overflow.

NEXT STEPS

It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on Intel® Xeon Phi™ architecture. The paths provided in this guide reflect the steps necessary to get best possible application performance.

Back the main chapter Vectorization Essentials.

For more complete information about compiler optimizations, see our Optimization Notice.