Common Vectorization Tips

Compiler Methodology for Intel® MIC Architecture

Common Vectorization Tips

Handling user-defined function-calls inside vector-loops

If you want to vectorize a loop that has a user-defined function call, (possibly re-factor the code and) make the function-call a vector-elemental function.

Specifying unit-stride accesses inside elemental functions

If your SIMD-enabled functionaccesses memory in unit-stride, these are the two ways you can write:

  • uniform pointer indexed by linear integer
  • linear pointer
__declspec(vector(uniform(a),linear(i:1)))
float foo(float *a, int i){
  return a[i]++;
}

__declspec(vector(linear(a:1)))
float foo1(float *a){
  return (*a)++;
}

Handling memory disambiguation inside vector-loops

Consider vectorization for a simple loop:

void vectorize(float *a, float *b, float *c, float *d, int n) {
    int i;
    for (i=0; i<n; i++) {
        a[i] = c[i] * d[i];
        b[i] = a[i] + c[i] - d[i];
    }
} 
  • Here, compiler has no idea where those 4 pointers are pointing to. As a programmer, you may know they point to totally independent locations. Compiler thinks differently. Unless programmer explicitly tells the compiler that they point to independent locations, compiler has to assume they are VERY BADLY aliased to each other --- for example, c[1] and a[0] may be at the same addr and thus the loop cannot be vectorized at all.
  • When the number of unknown pointers are VERY SMALL, compiler can generate a runtime check and generate optimized and unoptimized versions of the loops (w/ overhead in compile time, code size, and also runtime testing). Since the overhead grows quickly, that "VERY SMALL" number has to be really small ---- like two --- and even then you are paying the price for not telling the compiler that "pointers are independent".
  • So, the better thing to do is to tell the compiler that "pointers are independent". One way to do it is to use C99 "restrict pointer" keyword. Even if you are not compiling with "C99 standard", you can use -restrict (Linux) and -Qrestrict (Windows) flag to let the Intel compilers accept "restrict" keyword.
void vectorize(float *restrict a, float *restrict b, float *c, float *d, int n) {
    int i;
    for (i=0; i<n; i++) {
        a[i] = c[i] * d[i];
        b[i] = a[i] + c[i] - d[i];
    }
}
  • Here, the programmer tells the compiler that "a" isn't aliased with anything else, and "b" isn't aliased with anything else.
  • Another way is to use IVDEP pragma. Semantics of IVDEP is different from the restrict pointer, but it allows the compiler to eliminate some of the assumed dependencies ---- just enough to let the compiler think vectorization is safe.
void vectorize(float *a, float *b, float *c, float *d, int n) {
    int i;
#pragma ivdep
    for (i=0; i<n; i++) {
        a[i] = c[i] * d[i];
        b[i] = a[i] + c[i] - d[i];
    }
}

Avoid 64bit integer convert to/from floating point

  • Don't convert 64bit integer to/from floating point.
  • Use 32bit integer instead.
  • Use 32bit signed integer if possible (most efficient)

Avoid gather/scatter indexed by 64bit integer

  • No efficient hardware support available for 64bit index gather/scatter

Option to influence the compiler's vectorization cost-model

-vec-threshold[n]: Sets a threshold for the vectorization of loops based on the probability of profitable execution of the vectorized loop in parallel is a number between 0 and 100. Default is 100. If you want to vectorize whenever a loop is vectorizable regardless of the cost model, use the option -vec-threshold0.

Vectorization of loops with indirect accesses

Consider vectorization of a loop with indirect memory loads/stores as shown below:

 for (i = kstart; i < kend; ++i) {
    istart = iend;
    iend   = mp_dofStart[i+1];
    float w = xd[i];

    for (j = istart; j < iend; ++j) {
        index  = SCS[j];
        xd[index] -= lower[j]*w;
    }
 }

For the code shown above, the key pre-requisite to vectorization is that the xd values are distinct (otherwise, there are genuine dependences that will make the loop NOT vectorizable. If that is the case, the only alternative is to rewrite the algorithm in a vector-friendly way). If the xd values are guaranteed (by the user) to be distinct, then one can use the ivdep/simd pragmas (before the inner j-loop) to vectorize. The compiler will still generate gather/scatter vectorization - if there is an alternative algorithmic formulation where unit-strides can be used, that may be beneficial.

Vectorization of loop-forms that involve a monotonic induction variable

Compiler can vectorize certain kind of loops that use a monotonic induction variable (induction variable that gets updated only under some condition inside the loop). An example of this is shown below (this loop-form is sometimes referred to as having a “compress idiom”):

    int index_0 = 0;
    for(int k0=0; k0<count0; k0++) {
        TYPE X1 = *(Pos0 + k0);        TYPE Y1 = *(Pos0 + k0 +   count0);
        TYPE Z1 = *(Pos0 + k0 + 2*count0);
        #pragma loop_count min(220) avg (300) max (380)
        #pragma ivdep
        for(int k1=0; k1<count1; k1+=1) {
            TYPE X0 = *(Pos1 + k1);
            TYPE Y0 = *(Pos1 + k1 +   count1);
            TYPE Z0 = *(Pos1 + k1 + 2*count1);
            TYPE diff_X = (X0 - X1);
            TYPE diff_Y = (Y0 - Y1);
            TYPE diff_Z = (Z0 - Z1);
            TYPE norm_2 = (diff_X*diff_X) +  (diff_Y*diff_Y) + (diff_Z*diff_Z);

            if ( (norm_2 >= rmin_2) && (norm_2 <= rmax_2))
                   Packed[index_0++] = norm_2;
        }
    }

Here the variable index_0 is a getting updated under a condition. Note that it is incorrect to use the SIMD pragma to this loop-form using any of the currently supported clauses (ivdep pragma is okay).

You can improve the code generation sequence for such loops by adding the –qopt-assume-safe-padding option. This option determines whether the compiler assumes that variables and dynamically allocated memory are padded past the end of the object. When -opt-assume-safe-padding is specified, the compiler assumes that variables and dynamically allocated memory are padded. This means that code can access up to 64 bytes beyond what is specified in your program. The compiler does not add any padding for static and automatic objects when this option is used, but it assumes that code can access up to 64 bytes beyond the end of the object, wherever the object appears in the program. To satisfy this assumption, you must increase the size of static and automatic objects in your program when you use this option.

Here is a detailed example that shows how the code generated for the compress-idiom loop gets improved using this option:

void foo(float* restrict a, float* restrict b, float* restrict c)
{
   int i;
   int j = 0;
   for(i = 0; i < N; i++) {
       if (b[i])  {
         a[j++] = c[i];
       }
   }
}

In the default configuration without the -opt-assume-safe-padding option, the compiler is forced to be conservative (to protect against memory faults) and generates the following vectorized code sequence that uses the vpackstore and vscatter instructions:

..B1.6:
        vloadunpackld (%rsi,%rax,4), %zmm2
        vprefetch1 512(%rsi,%rax,4)
        vloadunpackld (%rdx,%rax,4), %zmm3
        vprefetch0 256(%rsi,%rax,4)
        vloadunpackhd 64(%rsi,%rax,4), %zmm2
        vprefetch1 512(%rdx,%rax,4)
        vloadunpackhd 64(%rdx,%rax,4), %zmm3
        vprefetch0 256(%rdx,%rax,4)
        vcmpneqps %zmm1, %zmm2, %k1
        movl      $65535, %r10d
        vpackstorelps %zmm3, -64(%rsp){%k1}
        lea       (%rdi,%r8,4), %r11
        vmovaps   -64(%rsp), %zmm4
        kmov      %k1, %r9d
        popcnt    %r9d, %ecx
        addq      $16, %rax
        movl      %ecx, %ecx
        shll      %cl, %r10d
        addq      %rcx, %r8
        xorl      $-1, %r10d
        kmov      %r10d, %k2
..L7:
        vscatterdps %zmm4, (%r11,%zmm0,4){%k2}
        jkzd      ..L6, %k2
        vscatterdps %zmm4, (%r11,%zmm0,4){%k2}
        jknzd     ..L7, %k2
..L6:
        cmpq      $992, %rax
        jb        ..B1.6

When the option -opt-assume-safe-padding is added (For this example, this can be done safely if the user adds a padding of 64 bytes to the array a when it is allocated), the compiler generates the following higher-performing version:

..B1.6:
        vloadunpackld (%rsi,%rax,4), %zmm1
        vprefetch1 512(%rsi,%rax,4)
        vloadunpackld (%rdx,%rax,4), %zmm2
        vprefetch0 256(%rsi,%rax,4)
        vloadunpackhd 64(%rsi,%rax,4), %zmm1
        vprefetch1 512(%rdx,%rax,4)
        vloadunpackhd 64(%rdx,%rax,4), %zmm2
        vprefetch0 256(%rdx,%rax,4)
        vcmpneqps %zmm0, %zmm1, %k1
        movl      $65535, %r10d
        vpackstorelps %zmm2, -64(%rsp){%k1}
        addq      $16, %rax
        vmovaps   -64(%rsp), %zmm3
        kmov      %k1, %r9d
        popcnt    %r9d, %ecx
        movl      %ecx, %ecx
        shll      %cl, %r10d
        xorl      $-1, %r10d
        kmov      %r10d, %k2
        vpackstorelps %zmm3, (%rdi,%r8,4){%k2}
        vmovaps   -64(%rsp), %zmm4
        nop
        vpackstorehps %zmm4, 64(%rdi,%r8,4){%k2}
        addq      %rcx, %r8
        cmpq      $992, %rax
        jb        ..B1.6

NEXT STEPS

It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on Intel® Xeon Phi™ architecture. The paths provided in this guide reflect the steps necessary to get best possible application performance.

Back the main chapter Vectorization Essentials.

For more complete information about compiler optimizations, see our Optimization Notice.