Virtual Vector Function Supported in Intel® C++ Compiler 17.0

Intel® C++ Compiler 17.0 starts supporting virtual vector functions. A Vector Function (aka SIMD-enabled function) allows you to vectorize a loop containing user-defined functions without inlining them. This is especially helpful for virtual functions, which normally are not inlined, to get vectorized within a loop.

The syntax of a virtual vector function is exactly the same as for usual vector functions1,2

All the vector declarations on a virtual function are inherited and cannot be altered in overrides. This means that you should either fully replicate or fully omit vector specifications on overrides. In any case all overrides will get set of versions declared for original function. This also means that vector declarations cannot be introduced on override: you should make virtual function vector upon introduction. 

It is allowed to apply vector declarations to pure virtual functions, so if one may declare vector interfaces and all implementations will become vector with the same set of versions as declared for interface.

An Example:

#include <stdio.h>
#define N 100

class A {
public:
    #pragma omp declare simd linear(X)
    #pragma omp declare simd uniform(this) linear(X)
    virtual int foo(int X);
};

// #pragma omp declare simd linear(X)  -  inherited
// #pragma omp declare simd uniform(this) linear(X)
int A::foo(int X){ return X+1; }

class B : public A {
public:
  // #pragma omp declare simd linear(X)  -  inherited
  // #pragma omp declare simd uniform(this) linear(X)
  int foo(int X) { return (X*X); }
};

int main() {
  A* b[N], *a = new B();
  int sum=0;

  for (int i=0; i < N; i++) {
    b[i] = (i % 6) < 2 ? new A() : new B();
  }

  #pragma omp simd reduction (+:sum)
  for (int i=0; i < N; i++) {
     sum += a->foo(i);  // uniform(this) matched
  }                     // one call per chunk

  #pragma omp simd reduction (+:sum)
  for (int i=0; i < N; i++) {
    sum += b[i]->foo(i);  // linear(X) matched
  }

  printf("sum=%d", sum);
}

Here virtual function "foo" takes two vector forms with multiple OpenMP declare simd directives. One directive declares uniform(this) clause, which means the same actual function will be called within the loop. The other one does not, which means different actual functions may be called within the loop.

The second loop in line 31 always calls the overriding B::foo(). It matches the first vector form: declare simd directive containing uniform(this) clause, which makes the vectorization easier and more effective. The optimization report (generated by compiling with -xCORE-AVX2 -qopt-report3) approved this:

LOOP BEGIN at vv.cpp(31,13)
   remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
   remark #15475: --- begin vector loop cost summary ---
   remark #15476: scalar loop cost: 103
   remark #15477: vector loop cost: 27.750
   remark #15478: estimated potential speedup: 3.710
   remark #15484: vector function calls: 1
   remark #15488: --- end vector loop cost summary ---
   remark #15489: --- begin vector function matching report ---
   remark #15490: Function call: (Indirect call) with simdlen=4, actual parameter types: (uniform,linear:1)   [ vv.cpp(32,20) ]
   remark #15492: A suitable vector variant was found (out of 4) with xmm, simdlen=4, unmasked, formal parameter types: (uniform,linear:1)
   remark #15493: --- end vector function matching report ---
   remark #25015: Estimate of max trip count of loop=2500
LOOP END

You will see vector function takes first vector form with both uniform and linear declarations. The estimated potential speedup with vector length 4 (simdlen=4) gets 3.5x!

The third loop in line 36 calls different actual functions depending on value of (A *)b[i]. It matches the second vector form without uniform(this) clause. To vectorize this loop, compiler has to firstly gather actual function locations for a chunk of loop iterations into one vector, and check whether all vector lanes contain the same call target. If yes, indirect vector call to single target will be executed like in uniform case. If no, compiler will find unique targets through lanes and form mask to call to this unique target. The masked version of the target function will be called with mask selecting this target. In this example simdlen of 4 is chosen, so on the 1st iteration b[0:3] = {A, A, B, B} and so vector version of A::foo() will be called with mask 0b0011 and version of B::foo() will be called with mask 0b1100. On the 2nd iteration b[4:7] = {B, B, A, A} so vector version of B::foo() will be called with mask 0b0011 and version of A::foo() will be called with mask 0b1100. On the 3rd iteration b[8:11] = {B, B, B, B} and so there will be call to unmasked version of B::foo(). Details can be viewed from optimization report(compiled with -xCORE-AVX2 -qopt-report5):

LOOP BEGIN at vv.cpp(36,13)
   remark #15388: vectorization support: reference b[i] has aligned access   [ vv.cpp(37,12) ]
   remark #15388: vectorization support: reference b[i] has aligned access   [ vv.cpp(37,12) ]
   remark #15415: vectorization support: gather was generated for the variable <b[i]->__vptr>, indirect access, 64-bit indexed, b[i] is read from memory   [ vv.cpp(37,12) ]
   remark #15415: vectorization support: gather was generated for the variable <*(b[i]->__vptr+0)>, indirect access, 64-bit indexed, b[i]->__vptr is read from memory   [ vv.cpp(37,12) ]
   remark #15305: vectorization support: vector length 4
   remark #15309: vectorization support: normalized vectorization overhead 0.138
   remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
   remark #15448: unmasked aligned unit stride loads: 1
   remark #15458: masked indexed (or gather) loads: 2
   remark #15475: --- begin vector loop cost summary ---
   remark #15476: scalar loop cost: 105
   remark #15477: vector loop cost: 36.250
   remark #15478: estimated potential speedup: 2.890
   remark #15484: vector function calls: 1
   remark #15488: --- end vector loop cost summary ---
   remark #15489: --- begin vector function matching report ---
   remark #15490: Function call: (Indirect call) with simdlen=4, actual parameter types: (vector,linear:1)   [ vv.cpp(37,22) ]
   remark #15492: A suitable vector variant was found (out of 4) with xmm, simdlen=4, unmasked, formal parameter types: (vector,linear:1)
   remark #15493: --- end vector function matching report ---
   remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
   remark #25015: Estimate of max trip count of loop=2500
LOOP END

You will see vector function takes second vector form with linear only and without uniform. The estimated potential speedup with vector length 4 reduced to 2.890.

In conclusion, performance of virtual vector function depends on divergence. Actual function call within a loop matches uniform(this) declaration is fastest with single call per chunk, while actual function calls with non-uniform target will be slower or less efficient.

Limitations

  • Multiple inheritance is not supported.
  • Changing vector specifications on overrides is not supported. 
  • Pointers to virtual vector methods are not supported. This also precludes use of virtual vector functions in C++11 constructs like std::bind() because they decay function to pointer internally.
  • This feature is Intel-specific and will not interoperate with neither caller nor callee build using 3rd party OpenMP 4.0 compiler.

Reference

1. SIMD-enabled function. https://software.intel.com/en-us/node/522650

2. OpenMP declare simd directive: https://software.intel.com/en-us/node/524514

3. OpenMP user-defined reductions: https://software.intel.com/en-us/articles/openmp-40-new-features-supported-in-intel-compiler-160

 

For more complete information about compiler optimizations, see our Optimization Notice.