Optimize embarrassingly parallel loops

Introduction

Algorithms that display data parallelism with iteration independence lend themselves to loops that exhibit ‘embarrassingly parallel’ code. Let us look at techniques to maximize the performance of such loops with minimal effort. One such example might be a loop to calculate the log of an array of variables.

Auto vectorization

Intel®Composer can automatically detect loops that lend themselves to auto-vectorization. This includes explicit for loops with static or dynamic arrays , vector and valarray containers. Implicit valarray loops can either be auto vectorized or directed to invoke optimized Intel® Performance Primitives (Intel® IPP) library primitives. Please see section on how to enable Intel optimized valarray  headers.

The following example includes an explicit valarray and vector loops and an implicit valarray loop.

```valarray<float> vf(size), vfr(size);
vector<float> vecf(size), vecfr(size);

//log function, vector,  explicit loop
for (int j = 0; j < size-1; j++) {
vecfr[j]=log(vecf[j]);
}

//log function, valarray, explicit loop
for (int j = 0; j < size-1; j++) {
vfr[j]=log(vf[j]);
}

//log function, valarray, implicit loop
vfr=log(vf);
```

• Make sure that the Intel IPP libraries are selected in the ‘Build Component Selection’ dialog.

Limitations

Algorithms that do not lend themselves to iteration independence require fine grained parallelism. See this link for additional information.

Currently valarray calculations that result in implicit temporary arrays incur a severe penalty even with the Intel IPP interface. An example might be an equation such as y = (b^2 – 4*a*c) / 2*a, where a,b,c and y are valarray components. It is recommended that in such cases you should not use valarray.