# How to overcome 'vector dependence' while Loop Vectorizing on ATOM processor using icc compiler

## How to overcome 'vector dependence' while Loop Vectorizing on ATOM processor using icc compiler

Hi,

I'm trying to optimize the code to use it on ATOM processor. I come across one of the loop, which is not vector dependent (after analysing) but still gives the message saying "vector dependent". Following is the code snippet of the loop.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

ll_band=in_buf+band_size*band_size*3;
hl_band=in_buf;
low_coeff=out_buf;

lh_band=in_buf+band_size*band_size;
hh_band=in_buf+band_size*band_size*2;
high_coeff=out_buf+band_size*band_size*2;

for(i=0;i<band_size;i++)
{

low_coeff[0] = ll_band[0] - ((hl_band[0] + hl_band[0] + 1)>>1);
high_coeff[0] = lh_band[0] - ((hh_band[0] + hh_band[0] + 1)>>1);

/*even coefficients computation*/
for(j=1;j<band_size;j++) //line 671 is this line
{
low_coeff[2*j]= ll_band[j]-((hl_band[j-1]+hl_band[j]+1)>>1);
high_coeff[2*j]=lh_band[j]-((hh_band[j-1]+hh_band[j]+1)>>1);
}

/*odd coefficients computation*/
for(j=0;j<band_size-1;j++) //line 679 is this line
{
low_coeff[2*j+1]=2*hl_band[j]+((low_coeff[2*j]+low_coeff[2*j+2])>>1);
high_coeff[2*j+1]=2*hh_band[j]+((high_coeff[2*j]+high_coeff[2*j+2])>>1);
}

low_coeff[2*j+1] = (hl_band[j]<<1) + (low_coeff[2*j]);
high_coeff[2*j+1] = (hh_band[j]<<1) + (high_coeff[2*j]);

ll_band+=band_size;
hl_band+=band_size;
low_coeff+=t_band_size;

lh_band+=band_size;
hh_band+=band_size;
high_coeff+=t_band_size;

}

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

When i get the vector report by following command

icc -c -O3 -Wall -march=core2 -vec-report3 xxxxxx.c

i get the following report with respect to the above loops.

xxxxxx.c(671): (col. 5) remark: loop was not vectorized: existence of vector dependence.
xxxxxx.c(674): (col. 7) remark: vector dependence: assumed FLOW dependence between high_coeff line 674 and hl_band line 673.
xxxxxx.c(673): (col. 7) remark: vector dependence: assumed ANTI dependence between hl_band line 673 and high_coeff line 674.

xxxxxx.c(679): (col. 5) remark: loop was not vectorized: existence of vector dependence.
xxxxxx.c(682): (col. 7) remark: vector dependence: assumed FLOW dependence between high_coeff line 682 and low_coeff line 681.
xxxxxx.c(681): (col. 7) remark: vector dependence: assumed ANTI dependence between low_coeff line 681 and high_coeff line 682.

In the first loop, which complains about the dependence between hl_band and high_coeff, are two independent memory locations in in_buf (input_buffer) and out_buffer (output_buffer).

In the second loop, which complains about the depedence between low_coeff and high_coeff, are two independent memort locations in out_buf (first half of the out_buffer is for low_coeff and second half of out_buffer is for high_coeff). So, these two variables are also independent.

As these loops have independent statements, i tried to vectorise forcefully using the #pragma ivdep and #pragma vector always. Then, both loops got vectorised. But in actual the timing to execute these two loops got increased (by few millisec and didn't get reduced for sure).

So, inorder to vectorize the loop by compiler itself in normal way (not forcefully), i modified the code to separate the two statements in the loop which initially said vector dependent by the compiler. The modified code is as follows.

for(i=0;i<band_size;i++)
{

low_coeff[0] = ll_band[0] - ((hl_band[0] + hl_band[0] + 1)>>1);
high_coeff[0] = lh_band[0] - ((hh_band[0] + hh_band[0] + 1)>>1);

for(j=1;j<band_size;j++) //line 671 is this line
{
low_coeff[2*j]= ll_band[j]-((hl_band[j-1]+hl_band[j]+1)>>1);
}

for(j=1;j<band_size;j++) //line 676 is this line
{
high_coeff[2*j]=lh_band[j]-((hh_band[j-1]+hh_band[j]+1)>>1);
}

for(j=1;j<band_size;j++) //line 684 is this line
{
low_coeff[2*j]= ll_band[j]-((hl_band[j-1]+hl_band[j]+1)>>1);
}

for(j=0;j<band_size-1;j++) //line 689 is this line
{
high_coeff[2*j+1]=2*hh_band[j]+((high_coeff[2*j]+high_coeff[2*j+2])>>1);
}

low_coeff[2*j+1] = (hl_band[j]<<1) + (low_coeff[2*j]);
high_coeff[2*j+1] = (hh_band[j]<<1) + (high_coeff[2*j]);

}

Now, the compiler says, for first two inner loops:

xxxxxx.c(671): (col. 5) remark: LOOP WAS VECTORIZED.
xxxxxx.c(671): (col. 5) remark: REMAINDER LOOP WAS VECTORIZED.
xxxxxx.c(671): (col. 5) remark: loop skipped: multiversioned.
xxxxxx.c(676): (col. 5) remark: LOOP WAS VECTORIZED.
xxxxxx.c(676): (col. 5) remark: REMAINDER LOOP WAS VECTORIZED.
xxxxxx.c(676): (col. 5) remark: loop skipped: multiversioned.

and next two inner loops, it says:

xxxxxx.c(684): (col. 5) remark: loop was not vectorized: vectorization possible but seems inefficient.
xxxxxx.c(684): (col. 5) remark: loop skipped: multiversioned.
xxxxxx.c(689): (col. 5) remark: loop was not vectorized: vectorization possible but seems inefficient.
xxxxxx.c(689): (col. 5) remark: loop skipped: multiversioned.

Then, i used #pragma vector always for last 2 inner loops to get them vectorized.

With these changes, i got the vectorised loop. But, again good amount of timing reduction didn't happen. May be because i'm running the loop mulitple times (4 times of inner loop) to compute the values than earlier (2 times of inner loop).

Can anyone please, help me out with options how to tell the compiler that these variables are not vector dependent and hence vectorize, which yields me some good amount of reduction in execution time for these loops?

Thanks.

Karthik
4 posts / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.

Hello Karthik,

There is an Intel c++ compiler forum. See http://software.intel.com/en-us/forums/intel-c-compiler

Can you post your question there? The compiler folks would give you better responses.

Pat

Thanks Pat for directing me to correct forum. I'll post this there.

Karthik

Please don't submit any posts in that thread and take a look at a similar thread on Intel C++ compiler forum ( it is in active state ) . Thanks.