Call site dependence for SIMD-enabled functions in C++

Introduction

Elemental function is tool for explicit vectorization which shipped as a part of Intel(R) Cilk(TM) Plus package. Many loops in the real world applications consist of function call inside the loop body. Often these function calls are to user defined functions and if there is no vector version of these functions defined, then the loop body's vectorization will be hindered. That's when Elemental function comes as a savior. But just having one vector version of the function won't cater the needs for all call sites. The compiler decides when to generate a vector version call of the function from the loop body purely based on the list of vector versions of the function available and how the function is called at the call site. This article will demonstrate this in detail.

Case 1

$cat header.h

__declspec(noinline,vector(uniform(a),linear(i), vectorlength(4)))
void foo(int *a, int i);
 

$cat header.cc

#include"header.h"
#include<iostream>
void foo(int *a, int i){
        std::cout<<a[i]<<"n";
}

$cat testmain.cc

#include"header.h"
int main(){
int a[40], b[40], k;
a[:] = __sec_implicit_index(0);
b[:] = 39 - __sec_implicit_index(0);
#pragma simd vectorlength(4)
for(int i = 0; i < 40; i++)
        foo(b, i);
return 0;
}
 

$ icpc testmain.cc header.cc -vec-report2
testmain.cc(5): (col. 13) remark: LOOP WAS VECTORIZED
testmain.cc(7): (col. 1) remark: SIMD LOOP WAS VECTORIZED
header.cc(3): (col. 24) remark: FUNCTION WAS VECTORIZED
header.cc(3): (col. 24) remark: FUNCTION WAS VECTORIZED
header.cc(3): (col. 24) remark: FUNCTION WAS VECTORIZED
header.cc(3): (col. 24) remark: FUNCTION WAS VECTORIZED

Callee or Definition site: 

Here there is a function definition which is declared as Elemental function. The function receives an integer pointer "a" and an index variable "i". Then it prints the content of the array. Since the array base address will be constant across all iterations and the only thing changing is the array index in a linear fashion by a stride of 1, variable “a” is declared as uniform (value of “a” is braodcast to all iterations) and variable “i” is declared as linear with a unit stride. Also assuming that the target is SSE architecture here, the vector operand should have a vectorlength of 4.

Call Site:

Here there is a for loop iterating from 0 to n-1, and for every iteration of the loop, there is a call to foo() function. In order to vectorize this loop, compiler looks for a corresponding vector version of the function. The checks compiler makes in this case are:

1. Is there a elemental version of this function?
2. Checks to see that the parameters passed from the call site matches the clauses criteria on the callee site. 

Only when these two conditions are matching in the above case, the compiler generates a vector version of foo() inside the loop.By default, the function declared as elemental function with no mask clause will generate two version of vector functions:

1. Non-masked vector function (normal vector function)
2. Masked vector function

Case 2

In this case, there is a new second loop introduced as shown below:

$cat testmain.cc

#include"header.h"
int main(){
int a[40], b[40], k;
a[:] = __sec_implicit_index(0);
b[:] = 39 - __sec_implicit_index(0);
#pragma simd vectorlength(4)
for(int i = 0; i < 40; i++)
        foo(b, i);
#pragma simd vectorlength(4)
for(int i = 0; i < 40; i++)
{
        k = b[i];
        foo(a, k);
}
return 0;
}
 

$ icpc testmain.cc header.cc -vec-report2
testmain.cc(4): (col. 8) remark: LOOP WAS VECTORIZED
testmain.cc(5): (col. 13) remark: LOOP WAS VECTORIZED
testmain.cc(7): (col. 1) remark: SIMD LOOP WAS VECTORIZED
testmain.cc(13): (col. 2) remark: No suitable vector variant of function '_Z3fooPii' found
testmain.cc(10): (col. 1) remark: SIMD LOOP WAS VECTORIZED
header.cc(3): (col. 24) remark: FUNCTION WAS VECTORIZED
header.cc(3): (col. 24) remark: FUNCTION WAS VECTORIZED
header.cc(3): (col. 24) remark: FUNCTION WAS VECTORIZED
header.cc(3): (col. 24) remark: FUNCTION WAS VECTORIZED

In order to prove the compiler checks at call site before it decides to generate vector function call inside the loop, consider the above extra for loop at the call site. The only difference between the previous loop and the newly added one is that, the new loop passes a second argument to foo() which is not changing in a linear unit strided fashion at the call site. As mentioned in the previous case, the compiler should detect this and should not generate a vector call for the second newly introduced loop.
As seen above from the vectorization report, it is evident that, second loop’s foo() call is not a vector call rather compiler vectorizer states “No suitable vector variant of function ‘_Z3f00Pii’ found”. So though there is a vector version of the function foo(), the compiler was intelligent enough to identify that this vector version of the function is not applicable in this case. It all because of the linear(i:1) imposed on the second argument.

Case 3

In the previous case, the second loop introduced wasn’t vectorizing because function foo() called inside the loop body wasn’t having a right vector version which matches the second argument criteria passed from the call site. The problem is because of the linear(i:1) clause. Thus here it is explicitly stated that the compiler should generate multiple flavors of the Elemental function.
1. One with linear clause
2. Another without the linear clause (as shown in the below code)
$cat header.h
__declspec(noinline,vector(uniform(a),linear(i), vectorlength(4)))
__declspec(noinline, vector(uniform(a),vectorlength(4)))
void foo(int *a, int i);
 
$ icpc testmain.cc header.cc -vec-report2
testmain.cc(4): (col. 8) remark: LOOP WAS VECTORIZED
testmain.cc(5): (col. 13) remark: LOOP WAS VECTORIZED
testmain.cc(7): (col. 1) remark: SIMD LOOP WAS VECTORIZED
testmain.cc(10): (col. 1) remark: SIMD LOOP WAS VECTORIZED
header.cc(3): (col. 24) remark: FUNCTION WAS VECTORIZED
header.cc(3): (col. 24) remark: FUNCTION WAS VECTORIZED
header.cc(3): (col. 24) remark: FUNCTION WAS VECTORIZED
header.cc(3): (col. 24) remark: FUNCTION WAS VECTORIZED
header.cc(3): (col. 24) remark: FUNCTION WAS VECTORIZED
header.cc(3): (col. 24) remark: FUNCTION WAS VECTORIZED
header.cc(3): (col. 24) remark: FUNCTION WAS VECTORIZED
header.cc(3): (col. 24) remark: FUNCTION WAS VECTORIZED
This time the second loop vectorizes (vector call to foo() is generated by the compiler), because now there is also a vector version of the function foo() which doesn’t have a restriction on the second argument to have a linear behavior with unit stride.

Case 4

$cat header.h
__declspec(noinline, vector(uniform(a), linear(i), vectorlength(4), mask))
void foo(int *a, int i);
$cat testmain.cc
#include"header.h"
int main(){
int a[40], b[40], k;
a[:] = __sec_implicit_index(0);
b[:] = 39 - __sec_implicit_index(0);
#pragma simd vectorlength(4)
for(int i = 0; i < 40; i++)
{
        if(a[i]>10)
           foo(a, i);
}
return 0;
}
$ icpc testmain.cc header.cc -vec-report2
testmain.cc(4): (col. 8) remark: LOOP WAS VECTORIZED
testmain.cc(7): (col. 1) remark: SIMD LOOP WAS VECTORIZED
header.cc(3): (col. 24) remark: FUNCTION WAS VECTORIZED
header.cc(3): (col. 24) remark: FUNCTION WAS VECTORIZED

By default the compiler will generate two versions of the vector function:

1. Non-masked vector function (normal vector function)
2. Masked vector function
But if the developer is aware of the fact of that the function which is converted into Elemental function will be called only under a condition from all call sites, then he can specify a mask clause while generating the Elemental function. This will make sure that the compiler only generates the masked version of the vector function apart from the scalar version and not the non-masked vector function. This saves a lot of binary space considering that there lot of potential elemental functions in the application. 

Case 5

Specifying a mask clause on the Elemental function declaration and then calling the function without any condition as shown below will lead to scalar call of foo() with in the loop body. As you can see from the vectorization report below, the suitable non-masked version of the vector function foo() is not available.
$cat testmain.cc
#include"header.h"
int main(){
int a[40], b[40], k;
a[:] = __sec_implicit_index(0);
b[:] = 39 - __sec_implicit_index(0);
#pragma simd vectorlength(4)
for(int i = 0; i < 40; i++)
        foo(b, i);
return 0;
}
$ icpc testmain.cc header.cc -vec-report2
testmain.cc(5): (col. 13) remark: LOOP WAS VECTORIZED
testmain.cc(8): (col. 9) remark: No suitable vector variant of function '_Z3fooPii' found
testmain.cc(7): (col. 1) remark: SIMD LOOP WAS VECTORIZED
header.cc(3): (col. 24) remark: FUNCTION WAS VECTORIZED
header.cc(3): (col. 24) remark: FUNCTION WAS VECTORIZED
 
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione