Vectorizing Intel(R) TBB parallel_for block

This article demonstrates on how to write vector friendly code inside Intel(R) TBB parallel_for block. Consider the below code snippet:

#include <iostream>

	#include <tbb/tbb.h>

	#include <tbb/parallel_for.h>

	#include <cstdlib>

	using namespace std;

	using namespace tbb;

	long len = 0;

	float *a;

	float *b;

	float *c;

	class Test {

	public:

	    void operator()( const blocked_range<size_t>& x ) const {

	        for (long i=x.begin(); i!=x.end(); ++i ) {

	            c[i] = (a[i] * b[i]) + b[i];

	        }

	    }

	};

	int main(int argc, char* argv[]) {

	    cout << atol(argv[1]) << endl;

	   len = atol(argv[1]);

	    a = new float[len];

	    b = new float[len];

	    c = new float[len];

	    parallel_for(blocked_range<size_t>(0,len, 100), Test() );

	    return 0;

	}


The above code has a parallel_for block which calls Test() functor. When this program is compiled, the vectorization report states the Loop was not vectorized as shown below:


	$ icpc -S -O3 -vec-report2 test1.cc -o test1_O3_icc.s

	parallel_for.h(108): (col. 22) remark: loop was not vectorized: unsupported loop structure

	parallel_for.h(108): (col. 22) remark: loop was not vectorized: unsupported loop structure

	parallel_for.h(108): (col. 22) remark: loop was not vectorized: unsupported loop structure

	parallel_for.h(108): (col. 22) remark: loop was not vectorized: unsupported loop structure

	parallel_for.h(108): (col. 22) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate

	parallel_for.h(108): (col. 22) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate

	partitioner.h(158): (col. 9) remark: loop was not vectorized: existence of vector dependence

	

Studying the loop closely, it is clear that the compiler is unable to figure out if the loop is a countable loop since the bounds of the loop are essentially function calls (x.begin()/x.end()). Modifying the code as shown below (in red font) will avoid this confusion for the compiler:

From:


	class Test {

	public:

	    void operator()( const blocked_range<size_t>& x ) const {

	        for (long i=x.begin(); i!=x.end(); ++i ) {

	            c[i] = (a[i] * b[i]) + b[i];

	        }

	    }

	};

	

To:


	class Test {

	public:

	    void operator()( const blocked_range<size_t>& x ) const {

	        long j = x.begin();

	        long k = x.end();

	        for (long i=j; i!=k; ++i ) {

	            c[i] = (a[i] * b[i]) + b[i];

	        }

	    }

	};

	

The vectorization report for the above change is:


	$ icpc -S -O3 -vec-report2 test1.cc -o test1_O3_icc.s

	parallel_for.h(108): (col. 22) remark: loop was not vectorized: unsupported loop structure

	parallel_for.h(108): (col. 22) remark: loop was not vectorized: existence of vector dependence

	parallel_for.h(108): (col. 22) remark: loop was not vectorized: unsupported loop structure

	parallel_for.h(108): (col. 22) remark: loop was not vectorized: existence of vector dependence

	parallel_for.h(108): (col. 22) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate

	parallel_for.h(108): (col. 22) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate

	partitioner.h(158): (col. 9) remark: loop was not vectorized: existence of vector dependence

	

Still the loop was not vectorized but this time because the compiler assumes that there is vector dependence. This is because compiler has clue if the arrays “a”, “b” and “c” are aliased (do they point to overlapping memory locations). Since in this case the arrays are disjoint in memory, declaring them as restrict pointers helps. The __restrict__ keyword is explicitly inform the compiler that there is no aliasing. Below the code change:

From:


	float *a;

	float *b;

	float *c;

	

To:

float * __restrict__ a;

	float * __restrict__ b;

	float * __restrict__ c;

	

Compiling this modified code will vectorize the loop as shown below:


	$ icpc -S -O3 -vec-report2 test1.cc -o test1_O3_icc.s

	parallel_for.h(108): (col. 22) remark: loop was not vectorized: unsupported loop structure

	parallel_for.h(108): (col. 22) remark: LOOP WAS VECTORIZED

	parallel_for.h(108): (col. 22) remark: loop was not vectorized: unsupported loop structure

	parallel_for.h(108): (col. 22) remark: LOOP WAS VECTORIZED

	parallel_for.h(108): (col. 22) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate

	parallel_for.h(108): (col. 22) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate

	partitioner.h(158): (col. 9) remark: loop was not vectorized: existence of vector dependence

Considering that there is inlining happening, a better vectorization report which relates to our original source can be obtained by using compiler option -debug inline-debug-info as shown below:


	$ icpc -S -O3 -vec-report2 test1.cc -o test1_O3_icc.s -debug inline-debug-info

	partitioner.h(171): (col. 9) remark: loop was not vectorized: unsupported loop structure

	test60.cc(14): (col. 37) remark: LOOP WAS VECTORIZED

	partitioner.h(164): (col. 9) remark: loop was not vectorized: unsupported loop structure

	test60.cc(14): (col. 37) remark: LOOP WAS VECTORIZED

	partitioner.h(245): (col. 33) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate

	partitioner.h(265): (col. 52) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate

	partitioner.h(164): (col. 9) remark: loop was not vectorized: existence of vector dependence

For more complete information about compiler optimizations, see our Optimization Notice.