# Outer Loop Vectorization

Published on September 7, 2012, updated September 3, 2014

*Compiler Methodology for Intel® MIC Architecture*

## Vectorization Essentials, Outer Loop Vectorization

### Overview

Outer loop vectorization is a technique to enhance performance. By default, compilers attempt to vectorize innermost loops in nested loop structures. But, in some cases, the number of iterations in the innermost loop is small. In this case, inner-loop vectorization is not profitable. However, if an outer loop contains more work, a combination of elemental functions, strip-mining, and pragma/directive SIMD can force vectorization at this outer, profitable level.

### Topics

Outer loop vectorization is demonstrated below.

**Example loop-nest:**

If you have a sparse-matrix-vector pattern as follows:

for(LocalOrdinalType row=ib*BLOCKSIZE; row<top; ++row) { local_y[row]=0.0; for(LocalOrdinalType i=Arowoffsets[row]; i<Arowoffsets[row+1]; ++i) { local_y[row] += Acoefs[i]*local_x[Acols[i]]; } }

*Vectorization at inner-level vs. outer-level:*

You can vectorize the inner i-loop. If you do this default vectorization, one gets unit-stride load for arrays "Acoefs" and "Acols", and gathers for the elements of array "local_x". The "local_y" storage accesses get moved out of the loop and a reduction-temp will be introduced by the compiler in the vectorized loop.

As an alternative, you can vectorize the outer-loop (using elemental functions): In this case, for the inner-loop-body the compiler will generate unit-stride load/store for "local_y", "Acoefs" and "Acols" loads become gather, "local_x" continues to be gather.

**Why this may be useful:** If the average inner-loop trip-counts are low (not enough to fill up a full vector) and the outer-loop trip-counts are large, then you may get better vectorization-efficiency. The other considerations are: Are there any expensive operations in the outer-loop (say a divide) that now get vectorized due to outer-loop vectorization? The potential gains have to be weighed against the costs of extra gathers/scatters compared to inner-loop vectorization (Plus, specifying alignment becomes harder for outer-loop vectorization).

**How to do the outer-loop vectorization:**

- Add the simd pragma (with the right clauses) to the outer-loop.
- Directly vectorize at outer loop level by outlining the body of the outer-loop into a vector-elemental function and using the simd pragma
- Strip-mine outer loop iterations and change each statement in the loop body to operate on the strip. Intel® Cilk™ Plus array notation extension helps the programmers to express the second approach in a natural fashion. See the article here for a more detailed description and example of using this technique: Outer Loop Vectorization via Intel® Cilk™ Plus Array Notations

For the example given above, the outer-loop can be vectorized using the first method simply by adding "#pragma simd" to the outer loop. Alternatively, it can be vectorized using the second method as follows:

#pragma simd for(LocalOrdinalType row=ib*BLOCKSIZE; row<top; ++row) { local_y[row]=0.0; Inner_loop_elem_function(local_y, row, Acoefs, local_x, Acols, Arowoffsets); } __declspec(vector(uniform(Arowoffsets, Acoefs, local_x, Acols, local_y), linear(row)))) Inner_loop_elem_function(float *local_y, int row, float *Acoefs, float *local_x, int *Acols, int *Arowoffsets) { for(LocalOrdinalType i=Arowoffsets[row]; i<Arowoffsets[row+1]; ++i) { local_y[row] += Acoefs[i]*local_x[Acols[i]]; } }

**Here is another example of outer-loop vectorization via use of an elemental function:**

#pragma simd // simd pragma for outer-loop at call-site of elemental-function for (int i = beg*16; i < end*16; ++i) { particleVelocity_block(px[i], py[i], pz[i], destvx + i, destvy + i, destvz + i, vel_block_start, vel_block_end); }

Definition of the elemental function:

__declspec(vector(uniform(start,end), linear(velx,vely,velz))) static void particleVelocity_block(const float posx, const float posy, const float posz, float *velx, float *vely, float *velz, int start, int end) { __assume_aligned(velx,64); __assume_aligned(vely,64); __assume_aligned(velz,64); for (int j = start; j < end; ++j) { const float del_p_x = posx - px[j]; const float del_p_y = posy - py[j]; const float del_p_z = posz - pz[j]; const float dxn = del_p_x * del_p_x + del_p_y * del_p_y + del_p_z * del_p_z +pa[j]* pa[j]; const float dxctaui = del_p_y * tz[j] - ty[j] * del_p_z; const float dyctaui = del_p_z * tx[j] - tz[j] * del_p_x; const float dzctaui = del_p_x * ty[j] - tx[j] * del_p_y; const float dst = 1.0f/std::sqrt(dxn); const float dst3 = dst*dst*dst; *velx -= dxctaui * dst3; *vely -= dyctaui * dst3; *velz -= dzctaui * dst3; } }

Note the use of the uniform and linear clauses on the vector-elemental function here. The last two function arguments vel_block_start and vel_block_end are the same for all **i,** so these two are marked as uniform. The pointer arguments** velx, vely**, and** velz** are marked as linear to correspond to the single call-site.

These pointer variables are also marked as aligned (by the user taking advantage of the alignment properties for the pointers at the call-site) using appropriate clauses inside the elemental function.

ISCA 2012 Paper:Can Traditional Programming Bridge the Ninja Performance Gap for Parallel Computing Applications? (June 2012)

## Take Aways

By default, compilers will attempt to vectorize the innermost loop of a nested loop structure. If there are not enough iterations in this innermost loop, it may not be profitable for the compiler to vectorize that innermost loop. However, if an outer loop has more iterations, vectorization at that outer level may be effective. To do this, a combination of using elemental functions and pragma/directive SIMD can move the vectorization to this outer level.

## NEXT STEPS

It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on Intel® Xeon® Phi™ architecture. The paths provided in this guide reflect the steps necessary to get best application performance on Xeon Phi architecture.

Back the main chapter Vectorization Essentials.