vectorization help...

vectorization help...

Consider the following loop I am trying to get to vectorize, where pblock is a member pointer to a struct containing several arrays.

for (int i = 0; i < len; i++)
{

ftemp = in[i] * out;

1853: out = pblock->filter1.coeffsL[0] * ftemp
1854: + pblock->filter1.coeffsL[1] * pblock->filter1.indata[0]
1855: + pblock->filter1.coeffsL[2] * pblock->filter1.indata[1]
1856: + pblock->filter1.coeffsL[3] * pblock->filter1.indata[2]
1857: + pblock->filter1.coeffsL[4] * pblock->filter1.indata[3];
1858: pblock->filter1.indata[1] = pblock->filter1.indata[0];
1859: pblock->filter1.indata[0] = ftemp;
1860: pblock->filter1.indata[3] = pblock->filter1.indata[2];
1861: pblock->filter1.indata[2] = out;
1862: ftemp = out;

1871: temp1 = pblock->filter2.coeffsL[0] * ftemp;
1872: temp2 = pblock->filter2.coeffsL[1] * ftemp;
1873: out = temp1 + pblock->filter2.buff[0];
1874: pblock->filter2.buff[0] = (out * pblock->filter2.coeffsL[2]) + temp2;

}

1>..\..\..\src\Filter\filter.cpp(1861): (col. 3) remark: vector dependence: proven FLOW dependence between pblock line 1861, and pblock line 1853.
1>..\..\..\src\Filter\filter.cpp(1853): (col. 3) remark: vector dependence: proven ANTI dependence between pblock line 1853, and pblock line 1861.
1>..\..\..\src\Filter\filter.cpp(1861): (col. 3) remark: vector dependence: proven FLOW dependence between pblock line 1861, and pblock line 1860.
1>..\..\..\src\Filter\filter.cpp(1860): (col. 3) remark: vector dependence: proven ANTI dependence between pblock line 1860, and pblock line 1861.
1>..\..\..\src\Filter\filter.cpp(1860): (col. 3) remark: vector dependence: proven FLOW dependence between pblock line 1860, and pblock line 1853.
1>..\..\..\src\Filter\filter.cpp(1853): (col. 3) remark: vector dependence: proven ANTI dependence between pblock line 1853, and pblock line 1860.
1>..\..\..\src\Filter\filter.cpp(1860): (col. 3) remark: vector dependence: proven ANTI dependence between pblock line 1860, and pblock line 1861.
1>..\..\..\src\Filter\filter.cpp(1861): (col. 3) remark: vector dependence: proven FLOW dependence between pblock line 1861, and pblock line 1860.
1>..\..\..\src\Filter\filter.cpp(1859): (col. 3) remark: vector dependence: proven FLOW dependence between pblock line 1859, and pblock line 1853.
1>..\..\..\src\Filter\filter.cpp(1853): (col. 3) remark: vector dependence: proven ANTI dependence between pblock line 1853, and pblock line 1859.
1>..\..\..\src\Filter\filter.cpp(1859): (col. 3) remark: vector dependence: proven FLOW dependence between pblock line 1859, and pblock line 1858.
1>..\..\..\src\Filter\filter.cpp(1858): (col. 3) remark: vector dependence: proven ANTI dependence between pblock line 1858, and pblock line 1859.
1>..\..\..\src\Filter\filter.cpp(1858): (col. 3) remark: vector dependence: proven FLOW dependence between pblock line 1858, and pblock line 1853.
1>..\..\..\src\Filter\filter.cpp(1853): (col. 3) remark: vector dependence: proven ANTI dependence between pblock line 1853, and pblock line 1858.
1>..\..\..\src\Filter\filter.cpp(1858): (col. 3) remark: vector dependence: proven ANTI dependence between pblock line 1858, and pblock line 1859.
1>..\..\..\src\Filter\filter.cpp(1859): (col. 3) remark: vector dependence: proven FLOW dependence between pblock line 1859, and pblock line 1858.
1>..\..\..\src\Filter\filter.cpp(1853): (col. 3) remark: vector dependence: proven ANTI dependence between pblock line 1853, and pblock line 1859.
1>..\..\..\src\Filter\filter.cpp(1859): (col. 3) remark: vector dependence: proven FLOW dependence between pblock line 1859, and pblock line 1853.
1>..\..\..\src\Filter\filter.cpp(1853): (col. 3) remark: vector dependence: proven ANTI dependence between pblock line 1853, and pblock line 1858.
1>..\..\..\src\Filter\filter.cpp(1858): (col. 3) remark: vector dependence: proven FLOW dependence between pblock line 1858, and pblock line 1853.
1>..\..\..\src\Filter\filter.cpp(1853): (col. 3) remark: vector dependence: proven ANTI dependence between pblock line 1853, and pblock line 1861.
1>..\..\..\src\Filter\filter.cpp(1861): (col. 3) remark: vector dependence: proven FLOW dependence between pblock line 1861, and pblock line 1853.
1>..\..\..\src\Filter\filter.cpp(1853): (col. 3) remark: vector dependence: proven ANTI dependence between pblock line 1853, and pblock line 1860.
1>..\..\..\src\Filter\filter.cpp(1860): (col. 3) remark: vector dependence: proven FLOW dependence between pblock line 1860, and pblock line 1853.

1>..\..\..\src\Filter\filter.cpp(1874): (col. 3) remark: vector dependence: proven FLOW dependence between pblock line 1874, and pblock line 1873.
1>..\..\..\src\Filter\filter.cpp(1873): (col. 3) remark: vector dependence: proven ANTI dependence between pblock line 1873, and pblock line 1874.
1>..\..\..\src\Filter\filter.cpp(1873): (col. 3) remark: vector dependence: proven ANTI dependence between pblock line 1873, and pblock line 1874.
1>..\..\..\src\Filter\filter.cpp(1874): (col. 3) remark: vector dependence: proven FLOW dependence between pblock line 1874, and pblock line 1873.

What does it all mean? Is there a good doc that explains what these errors mean and how to fix them?
Is this just not going to happen because of the arrays changing within the loop? Or will some combination of temp variables fix this?

Any suggestions much appreciated.

15 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

drd,
Try using an intermediary array (for the compiler to optimize out)
for (int i = 0; i < len; i++)
{

ftemp = in[i] * out;
fload hack[4];
hack[0] =pblock->filter1.coeffsL[1] * pblock->filter1.indata[0];
hack[1] =pblock->filter1.coeffsL[2] * pblock->filter1.indata[1];
hack[2] =pblock->filter1.coeffsL[3] * pblock->filter1.indata[2];
hack[3] =pblock->filter1.coeffsL[4] * pblock->filter1.indata[3];

out = pblock->filter1.coeffsL[0] * ftemp
+ hack[0] + hack[1] + hack[2] + hack[3];
...
Start with that and then work your way into other areas that may have problems.
Jim Dempsey

www.quickthreadprogramming.com

Quoting - drd

Consider the following loop I am trying to get to vectorize, where pblock is a member pointer to a struct containing several arrays.

for (int i = 0; i < len; i++)
{

ftemp = in[i] * out;

1853: out = pblock->filter1.coeffsL[0] * ftemp
1854: + pblock->filter1.coeffsL[1] * pblock->filter1.indata[0]
1855: + pblock->filter1.coeffsL[2] * pblock->filter1.indata[1]
1856: + pblock->filter1.coeffsL[3] * pblock->filter1.indata[2]
1857: + pblock->filter1.coeffsL[4] * pblock->filter1.indata[3];
1858: pblock->filter1.indata[1] = pblock->filter1.indata[0];
1859: pblock->filter1.indata[0] = ftemp;
1860: pblock->filter1.indata[3] = pblock->filter1.indata[2];
1861: pblock->filter1.indata[2] = out;
1862: ftemp = out;

1871: temp1 = pblock->filter2.coeffsL[0] * ftemp;
1872: temp2 = pblock->filter2.coeffsL[1] * ftemp;
1873: out = temp1 + pblock->filter2.buff[0];
1874: pblock->filter2.buff[0] = (out * pblock->filter2.coeffsL[2]) + temp2;

}

1>......srcFilterfilter.cpp(1861): (col. 3) remark: vector dependence: proven FLOW dependence between pblock line 1861, and pblock line 1853.
1>......srcFilterfilter.cpp(1853): (col. 3) remark: vector dependence: proven ANTI dependence between pblock line 1853, and pblock line 1861.
1>......srcFilterfilter.cpp(1861): (col. 3) remark: vector dependence: proven FLOW dependence between pblock line 1861, and pblock line 1860.
1>......srcFilterfilter.cpp(1860): (col. 3) remark: vector dependence: proven ANTI dependence between pblock line 1860, and pblock line 1861.
1>......srcFilterfilter.cpp(1860): (col. 3) remark: vector dependence: proven FLOW dependence between pblock line 1860, and pblock line 1853.
1>......srcFilterfilter.cpp(1853): (col. 3) remark: vector dependence: proven ANTI dependence between pblock line 1853, and pblock line 1860.
1>......srcFilterfilter.cpp(1860): (col. 3) remark: vector dependence: proven ANTI dependence between pblock line 1860, and pblock line 1861.
1>......srcFilterfilter.cpp(1861): (col. 3) remark: vector dependence: proven FLOW dependence between pblock line 1861, and pblock line 1860.
1>......srcFilterfilter.cpp(1859): (col. 3) remark: vector dependence: proven FLOW dependence between pblock line 1859, and pblock line 1853.
1>......srcFilterfilter.cpp(1853): (col. 3) remark: vector dependence: proven ANTI dependence between pblock line 1853, and pblock line 1859.
1>......srcFilterfilter.cpp(1859): (col. 3) remark: vector dependence: proven FLOW dependence between pblock line 1859, and pblock line 1858.
1>......srcFilterfilter.cpp(1858): (col. 3) remark: vector dependence: proven ANTI dependence between pblock line 1858, and pblock line 1859.
1>......srcFilterfilter.cpp(1858): (col. 3) remark: vector dependence: proven FLOW dependence between pblock line 1858, and pblock line 1853.
1>......srcFilterfilter.cpp(1853): (col. 3) remark: vector dependence: proven ANTI dependence between pblock line 1853, and pblock line 1858.
1>......srcFilterfilter.cpp(1858): (col. 3) remark: vector dependence: proven ANTI dependence between pblock line 1858, and pblock line 1859.
1>......srcFilterfilter.cpp(1859): (col. 3) remark: vector dependence: proven FLOW dependence between pblock line 1859, and pblock line 1858.
1>......srcFilterfilter.cpp(1853): (col. 3) remark: vector dependence: proven ANTI dependence between pblock line 1853, and pblock line 1859.
1>......srcFilterfilter.cpp(1859): (col. 3) remark: vector dependence: proven FLOW dependence between pblock line 1859, and pblock line 1853.
1>......srcFilterfilter.cpp(1853): (col. 3) remark: vector dependence: proven ANTI dependence between pblock line 1853, and pblock line 1858.
1>......srcFilterfilter.cpp(1858): (col. 3) remark: vector dependence: proven FLOW dependence between pblock line 1858, and pblock line 1853.
1>......srcFilterfilter.cpp(1853): (col. 3) remark: vector dependence: proven ANTI dependence between pblock line 1853, and pblock line 1861.
1>......srcFilterfilter.cpp(1861): (col. 3) remark: vector dependence: proven FLOW dependence between pblock line 1861, and pblock line 1853.
1>......srcFilterfilter.cpp(1853): (col. 3) remark: vector dependence: proven ANTI dependence between pblock line 1853, and pblock line 1860.
1>......srcFilterfilter.cpp(1860): (col. 3) remark: vector dependence: proven FLOW dependence between pblock line 1860, and pblock line 1853.

1>......srcFilterfilter.cpp(1874): (col. 3) remark: vector dependence: proven FLOW dependence between pblock line 1874, and pblock line 1873.
1>......srcFilterfilter.cpp(1873): (col. 3) remark: vector dependence: proven ANTI dependence between pblock line 1873, and pblock line 1874.
1>......srcFilterfilter.cpp(1873): (col. 3) remark: vector dependence: proven ANTI dependence between pblock line 1873, and pblock line 1874.
1>......srcFilterfilter.cpp(1874): (col. 3) remark: vector dependence: proven FLOW dependence between pblock line 1874, and pblock line 1873.

What does it all mean? Is there a good doc that explains what these errors mean and how to fix them?
Is this just not going to happen because of the arrays changing within the loop? Or will some combination of temp variables fix this?

Any suggestions much appreciated.

extent the use of the local hack[4] array

for (int i = 0; i < len; i++)
{

ftemp = in[i] * out;
fload hack[4];
hack[0] = pblock->filter1.coeffsL[1] * pblock->filter1.indata[0];
hack[1] = pblock->filter1.coeffsL[2] * pblock->filter1.indata[1];
hack[2] = pblock->filter1.coeffsL[3] * pblock->filter1.indata[2];
hack[3] = pblock->filter1.coeffsL[4] * pblock->filter1.indata[3];

out = pblock->filter1.coeffsL[0] * ftemp
+ hack[0] + hack[1] + hack[2] + hack[3];

hack[0] = ftemp;
hack[1] = pblock->filter1.indata[0];
hack[2] = out;
hack[3] = pblock->filter1.indata[2];

pblock->filter1.indata[0] = hack[0];
pblock->filter1.indata[1] = hack[1];
pblock->filter1.indata[2] = hack[2];
pblock->filter1.indata[3] = hack[3];

hack[0] = pblock->filter2.coeffsL[0] * out;
hack[1] = pblock->filter2.coeffsL[1] * out;
out = hack[0] + pblock->filter2.buff[0];
pblock->filter2.buff[0] = (out * pblock->filter2.coeffsL[2]) + hack[1];

}
Jim Dempsey

www.quickthreadprogramming.com

Hi!

Yes, I would suggest breaking upp the dependencies between the current and the previous/next iteration by adding more (possibly precalculated) arrays. Please read Aart Bik's excellent Vectorization Handbook on data dependency theory

http://www.amazon.com/Software-Vectorization-Handbook-Multimedia-Performance/dp/0974364924

Probably declaring the arrays aligned and not using structs may avoid confusing the compiler too, this way:

__declspec(align(16)) float bufffilter1coeffsL[5];

Recently I saw a similar problem in a filter I used myself where one need to calculate

for (i=0;i<4;i++)

x[i] *= a*y[i];

This can be vectorized by precalculating the constants

__declspec(align(16))const float A[4] = {a,a*a,a*a*a,a*a*a*a};

And the following becomes vectorizable:

for (i=0;i<4;i++)

x[i] = A[i]*y[i];

In general this can be extended to a filter of length N, but to precalculate A may become compuationally costly for large N.

Best Regards,

Lars Petter Endresen

Lars,

Thanks for mentioning the __declspec for alignment. I was attempting to write more of pseudo code than supply actual code. I misspelled float too and assumed float when the proper value may be double or other.

Jim

www.quickthreadprogramming.com

If you are certain that there is no data overlap between iterations, and restrict keywords don't resolve the compiler's concern about dependencies, #pragma ivdep may do so. If you then are able to get "vectorization" with SSE4.1 (Penryn required), and performance is not limited by cache misses, you may see a performance gain of perhaps 3%. As others pointed out, it may be helpful if you are able to assure the compiler that the short vector components of your structure are aligned.

Thank you wise Kung Fu Masters! (Sorry, had to say it once ;) All the stuff I read in the docs is starting to make a little more sense now... unfortunately I only learn via example, and well, TBH, there's not enough in the docs ;) So it looks as though my stack of big blue bound bunnymen books will get even bigger *sigh*.

Unfortunately a comination of all your suggestions doesn't stop the compiler from complaining... and of course it takes an hour to tell me I'm still wrong, so at this point I just need to break it out into a test file and start from scratch.

Quick question though regarding the alignment... I was under the impression that the compiler option "strcut member alignment" was aligning those arrays for me... is this not the case? I haven't actually stepped through to look at the addresses yet, perhaps I have too much faith.

Quoting - drd

Thank you wise Kung Fu Masters! (Sorry, had to say it once ;) All the stuff I read in the docs is starting to make a little more sense now... unfortunately I only learn via example, and well, TBH, there's not enough in the docs ;)

I second that opinion. Aart Bik's book is very good, but but one shouldnt have to buy it when you spend that anmount of money on a compiler. The docs should offer way more examples to illustrate the problems and solutions. None of the concept are actually really difficult, its just its badly explained. Look at matlab's toolbox help for a better help.

Quoting - alxgatt

I second that opinion. Aart Bik's book is very good, but but one shouldnt have to buy it when you spend that anmount of money on a compiler. The docs should offer way more examples to illustrate the problems and solutions. None of the concept are actually really difficult, its just its badly explained. Look at matlab's toolbox help for a better help.

Hi
I'd suggest you provide this feedback via support issue on http://premier.intel.com to request for additional docs with examples on how to vectorize. Any large software projects gets lots of requests, and the development teams need to prioritize based on user input.
I'd point you to very goodarticleon working with the compiler to vectorize at
Assessing the accelerator buzz: Tips and Tricks for Intel Compiler vectorization. The author wrote 3 articles, this is the first. It gives severalexamples of workarounds to get the compiler to vectorize various C++ loops.
In addition to Aart Bik's book, I recommend the Software Optimization Cookbook, 2nd edition & Aart is a co-author. This covers additional topics such as OpenMP, floating point optimizations, etc. There is a paper online talking about changes to the compiler vectorization & optimization framework available at http://download.intel.com/technology/itj/2007/v11i4/1-inside/1-Inside_the_Intel_Compilers.pdf.
JohnO

Taking a quick look at this segment:

1853:        out = pblock->filter1.coeffsL[0] * ftemp
1854:            + pblock->filter1.coeffsL[1] * pblock->filter1.indata[0]
1855:            + pblock->filter1.coeffsL[2] * pblock->filter1.indata[1]
1856:            + pblock->filter1.coeffsL[3] * pblock->filter1.indata[2]
1857:            + pblock->filter1.coeffsL[4] * pblock->filter1.indata[3];

Raises a question whether there is a way to change the coeffsL[] member so that it goes from 0 to 3 and to keep current 0-th coefficient in another member variable? Then coeffsL[4] array would be aligned and have the same stride which should enable compiler to use MULPS and HADDPS (or SSE4.1 DPPS) or at least enable you to use the relevant intrinsics easier.

Regards,
Igor Levicki

First my apologies for not posting compilable code to begin with, the production code is far more complex, but I did break it down to the root of my problem:

float ftemp, out = 1.0;
int i;

__declspec(align(16)) float data[256] = {0.5};

__declspec(align(16)) float filter1CoeffsL[4] = {1.0, 0.0, 1.0, 0.0};
__declspec(align(16)) float filter2CoeffsL[6] = {1.0, 0.0, 0.0, 1.0, 0.0, 0.0};

__declspec(align(16)) float filter1Buff[2] = {1.0, 1.0};
__declspec(align(16)) float filter2Buff[4] = {1.0, 1.0, 1.0, 1.0};

#pragma ivdep
#pragma vector aligned
for (i = 0; i < 256; i++)
{
float ftemps[8];

ftemp = data[i] * out;

ftemps[0] = filter1CoeffsL[0] * ftemp;
ftemps[1] = filter1CoeffsL[1] * ftemp;
ftemps[2] = ftemps[0] + filter1Buff[0];
ftemps[3] = ftemps[2] * filter1CoeffsL[2];
ftemps[4] = ftemps[3] + ftemps[1];

filter1Buff[0] = ftemps[4];
ftemp = ftemps[2];

ftemps[0] = filter2CoeffsL[1] * filter2Buff[0];
ftemps[1] = filter2CoeffsL[2] * filter2Buff[1];
ftemps[2] = filter2CoeffsL[3] * filter2Buff[2];
ftemps[3] = filter2CoeffsL[4] * filter2Buff[3];

out = filter2CoeffsL[0] * ftemp + ftemps[0] + ftemps[1] + ftemps[2] + ftemps[3];

ftemps[4] = ftemp;
ftemps[5] = ftemps[0];
ftemps[6] = out;
ftemps[7] = ftemps[2];

filter2Buff[0] = ftemps[4];
filter2Buff[1] = ftemps[5];
filter2Buff[2] = ftemps[6];
filter2Buff[3] = ftemps[7];
}

I'm sure it's pretty clear to you guys just by looking at it, and it's clear to me now, that this will not vectorize at all. At least not the way it's currently designed. The real problem is the fact that the loop is "stateful", and relies on values from the previous itteration... the "out" variable for example, and of course the _Buff[] arrays. Doesn't seem to matter how I spin it, this "flow" dependency cannot be overcome. Back to the drawing board :(

Quoting - drd

First my apologies for not posting compilable code to begin with, the production code is far more complex, but I did break it down to the root of my problem:

float ftemp, out = 1.0;
int i;

__declspec(align(16)) float data[256] = {0.5};

__declspec(align(16)) float filter1CoeffsL[4] = {1.0, 0.0, 1.0, 0.0};
__declspec(align(16)) float filter2CoeffsL[6] = {1.0, 0.0, 0.0, 1.0, 0.0, 0.0};

__declspec(align(16)) float filter1Buff[2] = {1.0, 1.0};
__declspec(align(16)) float filter2Buff[4] = {1.0, 1.0, 1.0, 1.0};

#pragma ivdep
#pragma vector aligned
for (i = 0; i < 256; i++)
{
float ftemps[8];

ftemp = data[i] * out;

ftemps[0] = filter1CoeffsL[0] * ftemp;
ftemps[1] = filter1CoeffsL[1] * ftemp;
ftemps[2] = ftemps[0] + filter1Buff[0];
ftemps[3] = ftemps[2] * filter1CoeffsL[2];
ftemps[4] = ftemps[3] + ftemps[1];

filter1Buff[0] = ftemps[4];
ftemp = ftemps[2];

ftemps[0] = filter2CoeffsL[1] * filter2Buff[0];
ftemps[1] = filter2CoeffsL[2] * filter2Buff[1];
ftemps[2] = filter2CoeffsL[3] * filter2Buff[2];
ftemps[3] = filter2CoeffsL[4] * filter2Buff[3];

out = filter2CoeffsL[0] * ftemp + ftemps[0] + ftemps[1] + ftemps[2] + ftemps[3];

ftemps[4] = ftemp;
ftemps[5] = ftemps[0];
ftemps[6] = out;
ftemps[7] = ftemps[2];

filter2Buff[0] = ftemps[4];
filter2Buff[1] = ftemps[5];
filter2Buff[2] = ftemps[6];
filter2Buff[3] = ftemps[7];
}

I'm sure it's pretty clear to you guys just by looking at it, and it's clear to me now, that this will not vectorize at all. At least not the way it's currently designed. The real problem is the fact that the loop is "stateful", and relies on values from the previous itteration... the "out" variable for example, and of course the _Buff[] arrays. Doesn't seem to matter how I spin it, this "flow" dependency cannot be overcome. Back to the drawing board :(

You should get some measure of vectorization in the above code in particular

ftemps[0] = filter2CoeffsL[1] * filter2Buff[0];
ftemps[1] = filter2CoeffsL[2] * filter2Buff[1];
ftemps[2] = filter2CoeffsL[3] * filter2Buff[2];
ftemps[3] = filter2CoeffsL[4] * filter2Buff[3];

and

filter2Buff[0] = ftemps[4];
filter2Buff[1] = ftemps[5];
filter2Buff[2] = ftemps[6];
filter2Buff[3] = ftemps[7];

Are likely candidates for vectorization

Jim Dempsey

www.quickthreadprogramming.com

Quoting - jimdempseyatthecove

You should get some measure of vectorization in the above code in particular

ftemps[0] = filter2CoeffsL[1] * filter2Buff[0];
ftemps[1] = filter2CoeffsL[2] * filter2Buff[1];
ftemps[2] = filter2CoeffsL[3] * filter2Buff[2];
ftemps[3] = filter2CoeffsL[4] * filter2Buff[3];

and

filter2Buff[0] = ftemps[4];
filter2Buff[1] = ftemps[5];
filter2Buff[2] = ftemps[6];
filter2Buff[3] = ftemps[7];

Are likely candidates for vectorization

Jim Dempsey

Also, eliminate ftemps[4:7] (use different temp of ftemps[4])

Jim Dempsey

www.quickthreadprogramming.com

Quoting - jimdempseyatthecove

Also, eliminate ftemps[4:7] (use different temp of ftemps[4])

Jim Dempsey

ahah! yes indeed, yes indeed... with a little fudging ;) Thanks!

Leave a Comment

Please sign in to add a comment. Not a member? Join today