# Vectorizing Intel(R) TBB parallel_for block

This article demonstrates on how to write vector friendly code inside Intel(R) TBB parallel_for block. Consider the below code snippet:

```#include <iostream>

#include <tbb/tbb.h>

#include <tbb/parallel_for.h>

#include <cstdlib>

using namespace std;

using namespace tbb;

long len = 0;

float *a;

float *b;

float *c;

class Test {

public:

void operator()( const blocked_range<size_t>& x ) const {

for (long i=x.begin(); i!=x.end(); ++i ) {

c[i] = (a[i] * b[i]) + b[i];

}

}

};

int main(int argc, char* argv[]) {

cout << atol(argv) << endl;

len = atol(argv);

a = new float[len];

b = new float[len];

c = new float[len];

parallel_for(blocked_range<size_t>(0,len, 100), Test() );

return 0;

}

```

The above code has a parallel_for block which calls Test() functor. When this program is compiled, the vectorization report states the Loop was not vectorized as shown below:

```
\$ icpc -S -O3 -vec-report2 test1.cc -o test1_O3_icc.s

parallel_for.h(108): (col. 22) remark: loop was not vectorized: unsupported loop structure

parallel_for.h(108): (col. 22) remark: loop was not vectorized: unsupported loop structure

parallel_for.h(108): (col. 22) remark: loop was not vectorized: unsupported loop structure

parallel_for.h(108): (col. 22) remark: loop was not vectorized: unsupported loop structure

parallel_for.h(108): (col. 22) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate

parallel_for.h(108): (col. 22) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate

partitioner.h(158): (col. 9) remark: loop was not vectorized: existence of vector dependence

```

Studying the loop closely, it is clear that the compiler is unable to figure out if the loop is a countable loop since the bounds of the loop are essentially function calls (x.begin()/x.end()). Modifying the code as shown below (in red font) will avoid this confusion for the compiler:

From:

```
class Test {

public:

void operator()( const blocked_range<size_t>& x ) const {

for (long i=x.begin(); i!=x.end(); ++i ) {

c[i] = (a[i] * b[i]) + b[i];

}

}

};

```

To:

```
class Test {

public:

void operator()( const blocked_range<size_t>& x ) const {

long j = x.begin();

long k = x.end();

for (long i=j; i!=k; ++i ) {

c[i] = (a[i] * b[i]) + b[i];

}

}

};

```

The vectorization report for the above change is:

```
\$ icpc -S -O3 -vec-report2 test1.cc -o test1_O3_icc.s

parallel_for.h(108): (col. 22) remark: loop was not vectorized: unsupported loop structure

parallel_for.h(108): (col. 22) remark: loop was not vectorized: existence of vector dependence

parallel_for.h(108): (col. 22) remark: loop was not vectorized: unsupported loop structure

parallel_for.h(108): (col. 22) remark: loop was not vectorized: existence of vector dependence

parallel_for.h(108): (col. 22) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate

parallel_for.h(108): (col. 22) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate

partitioner.h(158): (col. 9) remark: loop was not vectorized: existence of vector dependence

```

Still the loop was not vectorized but this time because the compiler assumes that there is vector dependence. This is because compiler has clue if the arrays “a”, “b” and “c” are aliased (do they point to overlapping memory locations). Since in this case the arrays are disjoint in memory, declaring them as restrict pointers helps. The __restrict__ keyword is explicitly inform the compiler that there is no aliasing. Below the code change:

From:

```
float *a;

float *b;

float *c;

```

To:

```
float * __restrict__ a;

float * __restrict__ b;

float * __restrict__ c;

```

Compiling this modified code will vectorize the loop as shown below:

```
\$ icpc -S -O3 -vec-report2 test1.cc -o test1_O3_icc.s

parallel_for.h(108): (col. 22) remark: loop was not vectorized: unsupported loop structure

parallel_for.h(108): (col. 22) remark: LOOP WAS VECTORIZED

parallel_for.h(108): (col. 22) remark: loop was not vectorized: unsupported loop structure

parallel_for.h(108): (col. 22) remark: LOOP WAS VECTORIZED

parallel_for.h(108): (col. 22) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate

parallel_for.h(108): (col. 22) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate

partitioner.h(158): (col. 9) remark: loop was not vectorized: existence of vector dependence

```

Considering that there is inlining happening, a better vectorization report which relates to our original source can be obtained by using compiler option -debug inline-debug-info as shown below:

```
\$ icpc -S -O3 -vec-report2 test1.cc -o test1_O3_icc.s -debug inline-debug-info

partitioner.h(171): (col. 9) remark: loop was not vectorized: unsupported loop structure

test60.cc(14): (col. 37) remark: LOOP WAS VECTORIZED

partitioner.h(164): (col. 9) remark: loop was not vectorized: unsupported loop structure

test60.cc(14): (col. 37) remark: LOOP WAS VECTORIZED

partitioner.h(245): (col. 33) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate

partitioner.h(265): (col. 52) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate

partitioner.h(164): (col. 9) remark: loop was not vectorized: existence of vector dependence

```

For more complete information about compiler optimizations, see our Optimization Notice.

## 1 comment

Top

Hi Anoop,

you have parallelized and vectorized a loop at the same time.

The solution does not apply to the nested loops where the first one needs to be parallelized and the second one vectorized. Have you tried that scenario? What is your opinion?