Vectorization and parallelism in Cilk with elemental functions

Vectorization and parallelism in Cilk with elemental functions

This question is not Xeon Phi-specific, but I know that some Cilk Plus gurus are monitoring this forum, so maybe somebody could help.

I am trying to figure out a way to use elemental functions to parallelize a loop with vectorization. Using ICPC 13.3. 

1) This works: elemental function declaration and implementation, _Cilk_for loop:

// Function declaration and implementation 
__attribute__((vector)) double ef_add (double x, double y) {
  return x + y;
}
// 
int main() {
  const int n = 100000;
  double a[n], b[n], c[n];
  a[:]=1.0; b[:]=1.0; c[:]=1.0;
  // Gets vectorized 
  _Cilk_for (int j = 0; j < n; ++j) {
    a[j] = ef_add(b[j],c[j]);
  }
}

[avladim@dublin ~]$ icpc -vec-report3 -c clk1.cpp
clk1.cpp(9): (col. 3) remark: LOOP WAS VECTORIZED.
clk1.cpp(2): (col. 60) remark: FUNCTION WAS VECTORIZED.
clk1.cpp(2): (col. 60) remark: FUNCTION WAS VECTORIZED.
clk1.cpp(11): (col. 37) remark: LOOP WAS VECTORIZED.
[avladim@dublin ~]$

2) This doesn't work: elemental function declaration only (function implemented in a different file), _Cilk_for loop results in "subscript too complex"

// Only declaration 
__attribute__((vector)) double ef_add (double x, double y);
// 
// 
int main() {
  const int n = 100000;
  double a[n], b[n], c[n];
  a[:]=1.0; b[:]=1.0; c[:]=1.0;
 // Does not get vectorized 
  _Cilk_for (int j = 0; j < n; ++j) {
    a[j] = ef_add(b[j],c[j]);
  }
}

[avladim@dublin ~]$ icpc -vec-report3 -c clk2.cpp
clk2.cpp(8): (col. 3) remark: LOOP WAS VECTORIZED.
clk2.cpp(11): (col. 21) remark: loop was not vectorized: subscript too complex.
[avladim@dublin ~]$

3) This works: elemental function declaration only, _Cilk_for loop with strip-mining:

// Only declaration 
__attribute__((vector)) double ef_add (double x, double y);
// 
// 
int main() {
  const int n = 100000;
  double a[n], b[n], c[n];
  a[:]=1.0; b[:]=1.0; c[:]=1.0;
  // With strip-mining, gets vectorized 
  _Cilk_for (int j = 0; j < n; j+=1000) {
    a[j:1000] = ef_add(b[j:1000],c[j:1000]);
  }
}

[avladim@dublin ~]$ icpc -vec-report3 -c clk3.cpp
clk3.cpp(8): (col. 3) remark: LOOP WAS VECTORIZED.
clk3.cpp(11): (col. 17) remark: LOOP WAS VECTORIZED.
clk3.cpp(10): (col. 41) remark: loop was not vectorized: not inner loop.
[avladim@dublin ~]$

So,

(1) Cilk knows how to vectorize a unit stride parallel loop if the function is implemented right there, in the same file.

(2) It doesn't know how to vectorize a parallel loop with unit stride when only elemental function declaration is available, not implementation.

(3) Then again, it knows how to vectorize a parallel loop if only elemental function declaration is available, but the loop is strip-mined. 

My question is: why doesn't (2) get vectorized? Should it?

Thanks for any hints!

Andrey

3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Just cleaning  up some old issues - what follows are comments passed to me by one of the developers. When he talks about a pedigree, he is talking about a unique identifier for a strand of execution in a Cilk program. Basically - the compiler is worried that something being done inside the function is going to mess with the identifier. At the developer's suggestion, a feature request has been opened to look into getting the second case to vectorize.

The developer said:

The Cilk_for loop has a function-call inside it. When the compiler can see the body of the function-call, it can make sure that the Cilk pedigree does not get updated by the callee and the pedigree function-calls inside the loop get cleaned up before the vectorizer and vectorization happens.

In the case of clk2.cpp, there is no special processing being done to account for the fact that the callee is a simd-enabled function (I am not sure if we can, but this needs to be explored further). The pedigree call remains inside the loop, and vectorization does not happen.

 When the user does the stripmining in clk3.cpp, the pedigree call exists in the outer-loop (that is the Cilk_region) and the inner-loop gets optimized properly

Thank you, Frances! Compiler support for this feature would be in line with the paradigm "program in tasks not threads" ( http://www.drdobbs.com/parallel/rules-for-parallel-programming-for-multi... )

 

Leave a Comment

Please sign in to add a comment. Not a member? Join today