# How to overcome 'vector dependence' while Loop Vectorizing on ATOM processor using icc compiler

## How to overcome 'vector dependence' while Loop Vectorizing on ATOM processor using icc compiler

Hi,

I'm trying to optimize the code to use it on ATOM processor. I come across one of the loop, which is not vector dependent (after analysing) but still gives the message saying "vector dependent". Following is the code snippet of the loop. All the variables present in the loop are of the type 'short'.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

ll_band=in_buf+band_size*band_size*3;
hl_band=in_buf;
low_coeff=out_buf;

lh_band=in_buf+band_size*band_size;
hh_band=in_buf+band_size*band_size*2;
high_coeff=out_buf+band_size*band_size*2;

for(i=0;i<band_size;i++)
{

low_coeff[0] = ll_band[0] - ((hl_band[0] + hl_band[0] + 1)>>1);
high_coeff[0] = lh_band[0] - ((hh_band[0] + hh_band[0] + 1)>>1);

/*even coefficients computation*/
for(j=1;j<band_size;j++) //line 671 is this line
{
low_coeff[2*j]= ll_band[j]-((hl_band[j-1]+hl_band[j]+1)>>1);
high_coeff[2*j]=lh_band[j]-((hh_band[j-1]+hh_band[j]+1)>>1);
}

/*odd coefficients computation*/
for(j=0;j<band_size-1;j++) //line 679 is this line
{
low_coeff[2*j+1]=2*hl_band[j]+((low_coeff[2*j]+low_coeff[2*j+2])>>1);
high_coeff[2*j+1]=2*hh_band[j]+((high_coeff[2*j]+high_coeff[2*j+2])>>1);
}

low_coeff[2*j+1] = (hl_band[j]<<1) + (low_coeff[2*j]);
high_coeff[2*j+1] = (hh_band[j]<<1) + (high_coeff[2*j]);

ll_band+=band_size;
hl_band+=band_size;
low_coeff+=t_band_size;

lh_band+=band_size;
hh_band+=band_size;
high_coeff+=t_band_size;

}

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

When i get the vector report by following command

icc -c -O3 -Wall -march=core2 -vec-report3 xxxxxx.c

i get the following report with respect to the above loops.

xxxxxx.c(671): (col. 5) remark: loop was not vectorized: existence of vector dependence.
xxxxxx.c(674): (col. 7) remark: vector dependence: assumed FLOW dependence between high_coeff line 674 and hl_band line 673.
xxxxxx.c(673): (col. 7) remark: vector dependence: assumed ANTI dependence between hl_band line 673 and high_coeff line 674.

xxxxxx.c(679): (col. 5) remark: loop was not vectorized: existence of vector dependence.
xxxxxx.c(682): (col. 7) remark: vector dependence: assumed FLOW dependence between high_coeff line 682 and low_coeff line 681.
xxxxxx.c(681): (col. 7) remark: vector dependence: assumed ANTI dependence between low_coeff line 681 and high_coeff line 682.

In the first loop, which complains about the dependence between hl_band and high_coeff, are two independent memory locations in in_buf (input_buffer) and out_buffer (output_buffer).

In the second loop, which complains about the depedence between low_coeff and high_coeff, are two independent memort locations in out_buf (first half of the out_buffer is for low_coeff and second half of out_buffer is for high_coeff). So, these two variables are also independent.

As these loops have independent statements, i tried to vectorise forcefully using the #pragma ivdep and #pragma vector always. Then, both loops got vectorised. But in actual the timing to execute these two loops got increased (by few millisec and didn't get reduced for sure).

So, inorder to vectorize the loop by compiler itself in normal way (not forcefully), i modified the code to separate the two statements in the loop which initially said vector dependent by the compiler. The modified code is as follows.

for(i=0;i<band_size;i++)
{

low_coeff[0] = ll_band[0] - ((hl_band[0] + hl_band[0] + 1)>>1);
high_coeff[0] = lh_band[0] - ((hh_band[0] + hh_band[0] + 1)>>1);

for(j=1;j<band_size;j++) //line 671 is this line
{
low_coeff[2*j]= ll_band[j]-((hl_band[j-1]+hl_band[j]+1)>>1);
}

for(j=1;j<band_size;j++) //line 676 is this line
{
high_coeff[2*j]=lh_band[j]-((hh_band[j-1]+hh_band[j]+1)>>1);
}

for(j=1;j<band_size;j++) //line 684 is this line
{
low_coeff[2*j]= ll_band[j]-((hl_band[j-1]+hl_band[j]+1)>>1);
}

for(j=0;j<band_size-1;j++) //line 689 is this line
{
high_coeff[2*j+1]=2*hh_band[j]+((high_coeff[2*j]+high_coeff[2*j+2])>>1);
}

low_coeff[2*j+1] = (hl_band[j]<<1) + (low_coeff[2*j]);
high_coeff[2*j+1] = (hh_band[j]<<1) + (high_coeff[2*j]);

}

Now, the compiler says, for first two inner loops:

xxxxxx.c(671): (col. 5) remark: LOOP WAS VECTORIZED.
xxxxxx.c(671): (col. 5) remark: REMAINDER LOOP WAS VECTORIZED.
xxxxxx.c(671): (col. 5) remark: loop skipped: multiversioned.
xxxxxx.c(676): (col. 5) remark: LOOP WAS VECTORIZED.
xxxxxx.c(676): (col. 5) remark: REMAINDER LOOP WAS VECTORIZED.
xxxxxx.c(676): (col. 5) remark: loop skipped: multiversioned.

and next two inner loops, it says:

xxxxxx.c(684): (col. 5) remark: loop was not vectorized: vectorization possible but seems inefficient.
xxxxxx.c(684): (col. 5) remark: loop skipped: multiversioned.
xxxxxx.c(689): (col. 5) remark: loop was not vectorized: vectorization possible but seems inefficient.
xxxxxx.c(689): (col. 5) remark: loop skipped: multiversioned.

Then, i used #pragma vector always for last 2 inner loops to get them vectorized.

With these changes, i got the vectorised loop. But, again good amount of timing reduction didn't happen. May be because i'm running the loop mulitple times (4 times of inner loop) to compute the values than earlier (2 times of inner loop).

Can anyone please, help me out with options how to tell the compiler that these variables are not vector dependent and hence vectorize, which yields me some good amount of reduction in execution time for these loops?

Thanks.

Karthik
29 posts / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.

Does -xSSSE3 give the same result as -march=core2?  I don't know the history of icc treatment of -march options; besides, you don't say which version you are using.  I guess you didn't need -ansi-alias or restrict.

Does it make a difference whether you use all short data types rather than mixing short and int?  For 2* you could try replacement by addition in case the compiler didn't do it that way.  You don't seem to have a consistent style anyway.

If you don't need j as a global short variable, try making it a local int inside each loop, using C99 or C++ int j=1; and then of course leaving the loop ending conditional as an int expression.  Check whether the loop ending condition is treated as loop invariant.  Without a working example, there's no way for us to know, except that any report of vectorization would imply it's OK.

When the compiler says "seems inefficient" it should hardly be a surprise when your results bear that out.

The compiler should fuse your loops if it finds that is better than splitting them in two.

Vectorization with non-contiguous array items is always problematical. gcc sometimes has better non-vector optimization than icc  (although of course the required options are more complicated).

Yes, -xSSSE3 is giving me the same result as that of -march=core2.

I'm using 13.0.1 version of icc compiler.

I didn't find any difference between mixing short and int (short for all data variables and int for loop iterations), and using short for all variables.

Also, i didn't find any difference between 2* and addition (j+j).

Using local int inside each loop didn't help me either.

May be your statement "Vectorization with non-contiguous array items is always problematical." is right. Because, i was able to vectorise the loop which was operating on contiguous memory locations.

My basic question here is with respect to the message "vector dependence". Because, 'high_coeff' and 'hl_band' points to (as initialised before the loop) 'out_buf' and 'in_buf' which are 2 independent buffers. I'm puzzled, why is it saying both variables are vector dependent? Is there any options like assembler directives to tell the compiler that these 2 are not vector dependent? How can i resolve this?

Karthik

Is there any options like compiler directives to tell the compiler that these 2 are not vector dependent?

Karthik

The #pragma simd directives or the CEAN notation over-rule all considerations of vector dependence, but you said originally that you were able to get all loops vectorized (except possibly with the original fused version, where restrict might have been sufficient).

In principle, restrict might be useful to work on multiple arrays per loop even if you choose a scalar optimized unrolled compilation with gcc.

ATOM doesn't have as full a set of blend instructions as other current CPUs which could support these cases where you modify some but not all elements of a vector.

Out of curiosity, what happens when you replace the shifts with *2 and /2?

Jim Dempsey

Back in the Pentium4 days, when using compilers which didn't optimize the choice between add, shift, and multiply, the add was recommended.   Right shift is usually recommended rather than signed /2 if applicable.  Vectorizing compilers would be expected to make such choices automatically.

I was wondering about inconsistencies of style which do give us more to consider, and whether an int shift count or mutliplier or divider would alter behavior by forcing promotion from short int.

Do you think it is always possible to rely on C++ compiler instead of restructuring your own codes? We recently had a case when the most agressive optimization option /O3 of Intel C++ compiler did Not work well and created the slowest codes compared to /O1 and /O2 options.

>>...Is there any options like compiler directives to tell the compiler that these 2 are not vector dependent?

Blocks of lines ( for example, 671 - 674 ) in your initial post look like some kind of software pipelining and take a look at a #pragma directive swp in Intel C++ compiler Users and Reference Guide.

What about #pragma optimize ( "", off ) before your function and #pragma optimize ( "", on ) after with manual unrolling of these loops and prefetching?

#pragma directives are vectorising the loop. But, as this is a over-ruling process, i'm not getting much gain in terms of the loop execution time reduction.

My understanding from these experiments is that, as we are writing the values to non-contiguous memory locations, vectorization is not so efficient. Because, i'm computing the array indices separately (addresses separately) to place the odd-components and even-components in odd and even places in the array (buffer). Am i right here concluding so?

I did another experiment where i removed all these array indices calculations and made the statements in the loop very simple, whose result didn't support my above conclusion. The code snippet is as follows:

/*even coefficients computation*/
for(j=1;j<band_size;j++) //line 671 is this line
{
low_coeff[j]= ll_band[j];
high_coeff[j]=lh_band[j];
}

/*odd coefficients computation*/
for(j=0;j<band_size-1;j++) //line 679 is this line
{
low_coeff[j+1]=2*hl_band[j];
high_coeff[j+1]=2*hh_band[j];
}

xxxxxx.c(671): (col. 5) remark: loop was not vectorized: existence of vector dependence.
xxxxxx.c(674): (col. 7) remark: vector dependence: assumed FLOW dependence between high_coeff line 674 and ll__band line 673.
xxxxxx.c(673): (col. 7) remark: vector dependence: assumed ANTI dependence between ll_band line 673 and high_coeff line 674.

xxxxxx.c(679): (col. 5) remark: loop was not vectorized: existence of vector dependence.
xxxxxx.c(682): (col. 7) remark: vector dependence: assumed FLOW dependence between high_coeff line 682 and hl_band line 681.
xxxxxx.c(681): (col. 7) remark: vector dependence: assumed ANTI dependence between hl_band line 681 and high_coeff line 682.

Here also complier says both these 2 varialbes are vector dependent even they point to 2 different buffers (in_buf and out_buf).

#pragma basically over-rules these messages and vectorize, but may not do efficiently. Is there any other ways to tell compiler that these 2 varialbes are not dependent and that vectorizes efficiently.

I tried another option by declaring the variables initialized by in_buf to 'const'. But it was not helpful.

Now, i feel i'm left with option of modifying/restructuring my code to make contiguous memory write OR switch to intrinsics.

Karthik

#pragma directives are vectorising the loop. But, as this is a over-ruling process, i'm not getting much gain in terms of the loop execution time reduction.

My understanding from these experiments is that, as we are writing the values to non-contiguous memory locations, vectorization is not so efficient. Because, i'm computing the array indices separately (addresses separately) to place the odd-components and even-components in odd and even places in the array (buffer). Am i right here concluding so?

I did another experiment where i removed all these array indices calculations and made the statements in the loop very simple, whose result didn't support my above conclusion. The code snippet is as follows:

/*even coefficients computation*/
for(j=1;j<band_size;j++) //line 671 is this line
{
low_coeff[j]= ll_band[j];
high_coeff[j]=lh_band[j];
}

/*odd coefficients computation*/
for(j=0;j<band_size-1;j++) //line 679 is this line
{
low_coeff[j+1]=2*hl_band[j];
high_coeff[j+1]=2*hh_band[j];
}

xxxxxx.c(671): (col. 5) remark: loop was not vectorized: existence of vector dependence.
xxxxxx.c(674): (col. 7) remark: vector dependence: assumed FLOW dependence between high_coeff line 674 and ll__band line 673.
xxxxxx.c(673): (col. 7) remark: vector dependence: assumed ANTI dependence between ll_band line 673 and high_coeff line 674.

xxxxxx.c(679): (col. 5) remark: loop was not vectorized: existence of vector dependence.
xxxxxx.c(682): (col. 7) remark: vector dependence: assumed FLOW dependence between high_coeff line 682 and hl_band line 681.
xxxxxx.c(681): (col. 7) remark: vector dependence: assumed ANTI dependence between hl_band line 681 and high_coeff line 682.

Here also complier says both these 2 varialbes are vector dependent even they point to 2 different buffers (in_buf and out_buf).

#pragma basically over-rules these messages and vectorize, but may not do efficiently. Is there any other ways to tell compiler that these 2 varialbes are not dependent and that vectorizes efficiently.

I tried another option by declaring the variables initialized by in_buf to 'const'. But it was not helpful.

Now, i feel i'm left with option of modifying/restructuring my code to make contiguous memory write OR switch to intrinsics.

Thanks.

Karthik

By the way, How big are all these arrays? Could you post a complete simplified test case with declarations including a value for band_size variable?

You haven't showed us how you informed the compiler that hh_band, low_coeff, and high_coeff point to non-overlapping data regions.   However you did it, the message didn't get through.

Once again, frequently used alternatives include:

buffer definitions local to compilation unit

short int *restrict hh_band,.....

#pragma ivdep

#pragma simd vectorlength(16)  // or any other number which may be accepted

compiler options to ask for assumption of standard compliance:  -std=c99 -ansi-alias

noalias options more aggressive than standard

These arrays are 4096 size. Here is the function part which i'm trying to vectorise.

void func(short *in_buf, short *out_buf, short band_size)

{

short *low_coeff, *high_coeff;
short *ll_band,*hl_band, *lh_band, *hh_band;
short t_band_size,j,i;

t_band_size = band_size*2;

ll_band=in_buf+band_size*band_size*3;
hl_band=in_buf;
low_coeff=out_buf;

lh_band=in_buf+band_size*band_size;
hh_band=in_buf+band_size*band_size*2;
high_coeff=out_buf+band_size*band_size*2;

for(i=0;i<band_size;i++)
{

low_coeff[0] = ll_band[0] - ((hl_band[0] + hl_band[0] + 1)>>1);
high_coeff[0] = lh_band[0] - ((hh_band[0] + hh_band[0] + 1)>>1);

/*even coefficients computation*/
for(j=1;j<band_size;j++) //line 671 is this line
{
low_coeff[2*j]= ll_band[j]-((hl_band[j-1]+hl_band[j]+1)>>1);
high_coeff[2*j]=lh_band[j]-((hh_band[j-1]+hh_band[j]+1)>>1);
}

/*odd coefficients computation*/
for(j=0;j<band_size-1;j++) //line 679 is this line
{
low_coeff[2*j+1]=2*hl_band[j]+((low_coeff[2*j]+low_coeff[2*j+2])>>1);
high_coeff[2*j+1]=2*hh_band[j]+((high_coeff[2*j]+high_coeff[2*j+2])>>1);
}

low_coeff[2*j+1] = (hl_band[j]<<1) + (low_coeff[2*j]);
high_coeff[2*j+1] = (hh_band[j]<<1) + (high_coeff[2*j]);

ll_band+=band_size;
hl_band+=band_size;
low_coeff+=t_band_size;

lh_band+=band_size;
hh_band+=band_size;
high_coeff+=t_band_size;

}

}

It is called with these parameters:

funcfunc(in_buf, out_buf, 32); //band_size = 32

Karthik

It is called with these parameters:

func(in_buf, out_buf, 32); //band_size = 32

Karthik

I didn't inform compiler explicitly that hl_band and high_coeff are non-overlapping memory regions. I initialised these 2 variables with different buffers. Isn't the compiler going to consider these 2 variables in non-overlapping memory regions? How do i explicitly inform compiler about it?

I have used  restrict (short *restrict hh_band...), #pragma ivdep, #pragma simd. #pragma ivdep and #pragma simd vectorises the loop, but i'm not getting the good amount of reduction in execution time of these loops (there is no change in execution timing). So, i'm assuming that the loop vectorization is not efficient with #pragma. restrict didn't help me.

Karthik

>>...Isn't the compiler going to consider these 2 variables in non-overlapping memory regions? How do i explicitly

There are several Intel C++ compiler options for rearranging memory and you could try these options. Since your array sizes are 4096 elements ( short type / 8192 bytes for each array ) you possibly have some issues related to L2 cache line and VTune could provide you additional technical details.

You may get some advantage by declspec(align(32)) at the point of definition of the buffers, if you also specify __assume_aligned(.... for those just before the loops to be optimized.  #pragma vector aligned is like vector always with the additional implied assertion that all operands are 16-byte aligned.

The advantage of 32-byte over 16-byte alignment varies with CPU model, and I haven't seen it documented.  It's not directly associated with the chosen instruction set architecture, so I don't know about atom.

Thanks for you valuable inputs.

Byte alignment options __attribute__(align(16)) (as i'm on linux) didn't help me to vectorise the loop. Still getting the 'vector dependence' between those 2 variables. Is there any other options to make this loop vectorise?

Trying the option of using the intrinsics to vectorize this loop.

Karthik

Karthik, Could you post a complete list of command line options and a list of #pragma directives for Intel C++ compiler you currently use?

I use this command to generate the vector report, icc -c -Wall -xSSSE3 -vec-report3 xxxxxx.c

I declare the variables as follows to align memory:

short *ll_band __attribute__(align(16)) //declaring in the same way for other memory variables in the loop

I used the following #pragma directives, and its result are mentioned before it.

#pragma simd : vectorized the loop

#pragma vector aligned : didn't vectorize, says vector dependence between 2 variables

#pragma ivdep : vectorized the loop

#pragma vector always : didn't vectorize, says vector dependence between 2 variables

Currently I'm using #pragma simd

Whenever i got compiler message LOOP WAS VECTORIZED by using the #pragma directives, i didn't get the substantial(reasonable) amount of execution timing reduction when i ran it. So, i concluded that the vectorization was not efficient.

Karthik

>>...Whenever i got compiler message LOOP WAS VECTORIZED by using the #pragma directives, i didn't get the substantial
>>(reasonable) amount of execution timing reduction when i ran it. So, i concluded that the vectorization was not efficient.
>>

Here is a generic comment: I don't think Intel would waste time and resources on some technology that has no effect.

Could you try to reduce size of your input arrays, for example to 256 elements, and complete a set of tests increasing arrays sizes by 64?

>>icc -c -Wall -xSSSE3 -vec-report3 xxxxxx.c

I don't see any optimization options in your command line. Could you try /O2 and /O3 options?

icc assumes -O2 unless -g or another value of -O are set.  If anything is vectorized, it's clear that -O2 or -O3 are in effect.

#pragma simd includes the effects of #pragma ivdep and #pragma vector always, plus ignoring any proven dependencies, of which there shouldn't be any if you have shown everything.

As you saw, #pragma vector always|aligned don't take affect unless the compiler has convinced itself there are no dependencies.

As you also saw, the compiler's analysis of vectorization effectiveness is usually accurate when you specify the appropriate architecture.

I don't remember whether we covered #pragma vector nontemporal, which tells the compiler to avoid using cache for stores, as this may be effective for vectors of the length you mentioned (provided that you don't access the data again while they could have remained in cache).

Sorry, I forgot that optimization for speed ( /O2 ) is a default option.

This is a minor observation or suggestion for improving your code. Data layout has a major impact on performance. I notice that you have arrays low_coeff and high_coeff but are filling them in using a stride of two (even indexes and odd indexes). Consider splitting these arrays into four stride 1 containing the even's and odd's: even_low_coeff, odd_low_coeff, even_high_coeff, odd_high_coeff. Doing so will eliminate a scatter/gather and be much more efficient use of memory bandwidth.

Jim Dempsey

Sorry, i missed -O option. I was using -O3 initially and changet to -O2 to see its effect as you mentioned above stating sometimes -O2 gives better result than -O3. With respect to this loop, i didn't see any difference. Currently i'm using -O2.

I used #pragma vector nontemporal. This didn't help in vectorization.

Memory allocation for 4096 elements (Array size of 4096) happens at run-time. So, inorder to implement your suggestion i declared array locally of size 256 and compiled. Here compiler with -O3 says, "existence of vector dependence" (doesn't say between which variables). Compiler with -O2 says, "vectorization possible but seems inefficient". Operation causing these messages is accessing array with indices j-1 and j.
low_coeff[2*j]= ll_band[j]-((hl_band[j-1]+hl_band[j]+1)>>1);
high_coeff[2*j]=lh_band[j]-((hh_band[j-1]+hh_band[j]+1)>>1);
Adding #pragma vector always vectorizes this loop. Also, tried changing array indices to j and j (to check). Then loop gets vectorized without #pragma directives.
Incrementing the array size in terms of 64 also works fine. But compiler is considering accessing array with indices j-1 and j as a dependency. Here compiler is not considering 'high_coeff' and 'hl_band' variables are vector dependent.
My question with respect to this implementation is:
How do I inform compiler that 'high_coeff' variable has no relation with 'hl_band' variable in my original code where we initialize these variables from 'out_buf' and 'in_buf'?

Karthik

>> Here is a generic comment: I don't think Intel would waste time and resources on some technology that has no effect.
I agree with you. What i meant to say was, #pragma directives were forcefully doing the vectorization, so that might have been the reason for not getting reasonable execution timing reduction. As the write operation i was doing, was a non-contiguous array write (odd values were written into array by computing odd index of array and even values by computing even index of array), that might be causing the adverse effect. As Tim had mentioned above that vectorization with non-contiguous array items were always problematical.
In one of other loop in my code, when write operation was along the column (non-contiguous writing) and to compute these elements, reading elements was also along column and it was not getting vectorized. I modified that loop so as to read along row and write along row (row-wise write) which was a contiguous memory write. Then the loop got vectorized without #pragma directives.

So, i'm thinking of restructuring the code OR use the intrinsics, compute odd and even cofficients separately and do a interleaved write operation. This will do contiguous memory write and hence may fetch timing reduction.

What is your view on this? Do you agree that non-contiguous memory write is causing this problem? Or You say we can still vectorize this loop without considering non-contiguous write operation.

Karthik

Hi Karthik,

Sure, One of the good rule to make vectorization efficient is making the access CONTIGUOUS. Also i would like to suggest you to use "-fno-alias" compiler option, But you should be sure that there is no actual aliasing in the source files you are compiling with this option.

Regards,

Sukruth H V

This is a long story already and we're back to almost the same question:

>>...
>>My question with respect to this implementation is:
>>How do I inform compiler that 'high_coeff' variable has no relation with 'hl_band' variable in my original code...
>>...

You're trying to get rid of Intel C++ compiler warning about dependencies of two variables. I would try to implement as fastest as possible version of that processing and only performance evaluation could prove it. If the message is stil displayed and the code is fastest then I would consider as my objectives are achieved.

Compiler option "-fno-alias" helped to vectorize the loop. Basically using this option, I didn't get the vector dependence message for those 2 variables.

I agree that I was trying to get rid of Intel C++ compiler warning thinking that, by doing so compiler would do the better job for me.

I undesrstood that to do the job faster, modifying/restructuring the code is also required sometimes.

In order to gain the performance with this loop, I implemented the intrinsics (for this loop) and found the improvement in performance.

Thanks a lot for sharing your valuable information and helping me learn on Intel C++ compiler. This is for the first time I used intel compiler to enhance the performance.

Karthik