Is there a way to vectorize that?

Is there a way to vectorize that?

Community Admin's picture

Hello guys,


I am trying to write a fast version ofthe followingmatrix operation:


for all row


for all col


if(matrix(row, col) < 0.04045){


matrix(row, col) /= 12.92;


}else{


matrix(row, col) = pow(((matrix(row, col)+0.055)/1.055), 2.4);


} // if


} // for col


} // for row



Can this be vectorized?? Thanks in advance for any suggestions.


Alex

18 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
gordan's picture

Not with that else block in there. If you can do away with it, them depending on what matrix() is, you may get it to vectorize.

To speed it up further, multiply with the reciprocal instead of dividing.

You may also do a lot better to use an array insead of the matrix.

Community Admin's picture
wordy79@hotmail.com: Not with that else block in there. If you can do away with it,


unfortunately no, that's my problem.


wordy79@hotmail.com: them depending on what matrix() is, you may get it to vectorize.


matrix is a 2D array of floating-point data.

wordy79@hotmail.com: You may also do a lot better to use an array insead of the matrix.


You mean a 1D array vsa 2D matrix? Why would that make a difference?

Tim Prince's picture

icc 10.0 has added a masked powf() function in svml. As there is no double counterpart (masked or not), it looks like the support you would need for double is not there. As you would need aggressive optimizations (default or -fp-model fast=2) to invoke vectorization here, you would expect the compiler also to convert your divisions to multiplication by reciprocal.
It may be that the library team didn't find a generally satisfactory way to write a vector pow() (double) function. There is an exp2() (not masked), and log2() is present in both masked and unmasked versions. If you can accept the limitations of substituting those functions for pow(), you may be able to approximate the result you requested.

gordan's picture
You mean a 1D array vsa 2D matrix? Why would that make a difference?

I mean a 2D array (e.g. double foo[10][10]) instead of an abstract data type.

Either way, it's academic if you can't re-design to lose the else part. :-(

Community Admin's picture

> icc 10.0 has added a masked powf() function in svml.

> As there is no double counterpart (masked or not), it

> looks like the support you would need for double is not

> there.

The data I process are actually float (not double). Can it be vectorized in this case? I thought the presence of the if-else made the vectorization almost impossible? Is that right?

Also, what is the difference between a masked and a unmasked function?!

Thanks a lot for the informative answer.

Community Admin's picture
wordy79@hotmail.com:
You mean a 1D array vsa 2D matrix? Why would that make a difference?

I mean a 2D array (e.g. double foo[10][10]) instead of an abstract data type.

When I write foo(row, col), foo is actually an instance of an Image class, and the operator() is simply an inlined "foo.data[row*width + col]". It's not a static array though, it can't be.

wordy79@hotmail.com:
Either way, it's academic if you can't re-design to lose the else part. :-(

I don't quite understand what you mean by "academic"? (English is not my first language)

Thanks for the answer.

Community Admin's picture
tim18: icc 10.0 has added a masked powf() function in svml. As there is no double counterpart (masked or not), it looks like the support you would need for double is not there.

It's interesting you assumed the algo was working on double, and not float. What made you think that?

Collorary question: Say I've got a template as follow

template

void ExpMultiply(Matrix mat)

// Matrix contains a 2D array of type T

{

for(all rows){

for(all cols){

mat(row, col) = exp(mat(row, col) * 1.0);

}

}

}

and say I compile the app with parameters for fastest speed. Which svml version of exp() will be called? The one for double or the one for float?

Whay about the initial function of my first post? say the matrix data are float? will the computation be made in double or float?

Thanks again

Alex

jimdempseyatthecove's picture

Alex,


For operations on the entire matrix add an overload for single cell reference. i.e. permit access via either matrix(row, col) or via matrix(cellNumber).

// if a cell is < 0.04045 make it even smaller
for all cellNumber
if(matrix(cellNumber) < 0.04045){
matrix(cellNumber) /= 12.92;
} // if
} // for cellNumber
// Now catch the cells >= 0.04045
// Avoid repeating binary fraction problem

// Note, any cell that was < 0.04045 is now < 0.04045/12.92
// And therefore is also < 0.04045/10.
// Therefore 0.004045 is known to be > any modified cell above
// as well as known to be < any unmodified cell above
for all cellNumber
if(matrix(cellNumber) > 0.004045){
matrix(cellNumber) = pow(((matrix(cellNumber)+0.055)/1.055), 2.4);
} // if
} // for cellNumber
Jim Dempsey
www.quickthreadprogramming.com
Tim Prince's picture

In your first post, you didn't show the array definitions, but you specify double arithmetic by the use of double constants and math functions. If you had shown float array definitions, we could have asked you why you were promoting to double, and why you did not heed compiler diagnostics about mixed data types preventing vectorization.
The float version of exp() is expf(), and the float version of 1.0 is 1.0f. The compiler will not demote your expressions to float. If you are expecting the compiler to change your double code to float in order to vectorize, you will be frustrated.

Community Admin's picture

Thanks for the answer.


JimDempseyAtTheCove:

For operations on the entire matrix add an overload for single cell reference. i.e. permit access via either matrix(row, col) or via matrix(cellNumber).


Is is faster to do that?


The reason I am not doing it is because each row is 16-bytes aligned, hence there can be up to 15 extra pixels per row. That is not much because most of the matrices I am dealing with are 3000x2000, but still. Do you think I should reconsider? I'd be glad to hear any recommendations on this issue.

JimDempseyAtTheCove:// if a cell is < 0.04045 make it even smaller
for all cellNumber
if(matrix(cellNumber) < 0.04045){
matrix(cellNumber) /= 12.92;
} // if
} // for cellNumber
// Now catch the cells >= 0.04045
// Avoid repeating binary fraction problem

// Note, any cell that was < 0.04045 is now < 0.04045/12.92
// And therefore is also < 0.04045/10.
// Therefore 0.004045 is known to be > any modified cell above
// as well as known to be < any unmodified cell above
for all cellNumber
if(matrix(cellNumber) > 0.004045){
matrix(cellNumber) = pow(((matrix(cellNumber)+0.055)/1.055), 2.4);
} // if
} // for cellNumber
Jim Dempsey
This is very interesting. What is the benefit of doing that? Or more specifically, is this vectorizable when written your way?
Thanks again, and looking forward to your answer.
Alex
Community Admin's picture
tim18: The float version of exp() is expf(), and the float version of 1.0 is 1.0f. The compiler will not demote your expressions to float. If you are expecting the compiler to change your double code to float in order to vectorize, you will be frustrated.


Soit looks like I completely misunderstood then.


So, just to make sure, if I call exp() on a float variable, the variable will first be casted from float to double, and then the computation will be done in double, and the output will also be double? If I want to have a float-only pipeline, I have to call expf() instead of exp()?


Thanks a lot for the help!


Alex


jimdempseyatthecove's picture

Alex,


The array is the same size. In the matrix(row,col) you reduce this to a single index using a [row*rowSize+col]. The alternate matrix(cellNumber access the array using [cellNumber]. My suggestion was to eliminate the computation for the indexing.


As for the two loops vs one loop.


The two loops eliminate the else clause (branch inside loop). So each loop will run faster and may vectorize (atleast the first loop). As to if the sum of the two loop times is less than the single loop - give it a try.


Your matrix size may exceed your cache size. Therefor recode the two loops to do part of the matrix then enclose the two loops in a major loop to advance through the parts of the matrix. e.g. inner loops work on col, outer loop advances row.


Jim


www.quickthreadprogramming.com
Tim Prince's picture

You could test with #include with -std=c99 option, to see whether that makes exp() change to expf() when only float arguments are present. You never said anything about tgmath, and your expressions are promoted to double anyway by the use of double constants.

Community Admin's picture
tim18: You could test with #include with -std=c99 option, to see whether that makes exp() change to expf() when only float arguments are present. You never said anything about tgmath, and your expressions are promoted to double anyway by the use of double constants.


I did not know about , thanks for the tip. I'll definitely test its usefulness.


As for the use of double constants, that was not my intention.I am actually trying to write generic code that can work on both float and double. Does someone know a way to write a floating-point constant so that the compiler will interpret it as float when the template argument is float, and double with double argument? To illustrate, it should be as follow


template [typename T]


T func(T value){


return exp(value * 8.0);


}


If T=float, the expression should become expf(value * 8.0f);


If T=double, the expression should become exp(value * 8.0);


Any ideas/hints/suggestions _more_ than welcome.


Alex

Tim Prince's picture

While icc supports mixtures of C99 and C++, other compilers which support similar mixtures do it in a different way. There is not even full compatibility between icc and gcc in this respect. For example, one compiler might accept tgmath.h as an implicit extension to C89 or even C++, while another might accept it only in C99 mode.
In the example you give, where the constant is represented exactly in either float or double, 8.0f gives no problem. The compiler can decide whether it is more efficient to promote to double at compile or run time.
More generally, it is a bit awkward. You might define each of your constants with a name, using a T typedef, e.g. const T eight = 8; or put (T) casts on all your constants.

Community Admin's picture
tim18: The compiler can decide whether it is more efficient to promote to double at compile or run time.


My only worry is speed, hence I want to make sure that all computations are donewith the correct datatype, and with no cast operations.Worst case would be for an expression like "exp(v[m] * 8.0)" applied to an array of float. If the compiler had to cast each cell to double, make a call to vmldExp2 and cast the result back to float instead of simply calling vmlsExp4, that'd be very bad. Data throughput from SVML functions would be half, and the 2 casts would slow things down even further.


Is this issue discussed somewhere in Intel's reference manuals?


tim18: More generally, it is a bit awkward. You might define each of your constants with a name, using a T typedef, e.g. const T eight = 8; or put (T) casts on all your constants.


any ideas for LUTs? They are not defined in functions' bodies, hence T is not known.


Anyway, thanks a lot for the terrific answers.


Alex

Tim Prince's picture

There should be a discussion of the problems of vectorizing mixed precision in Aart Bik's book, available from Intel Press.
My point is that multiplying a double array by 8.0f is no problem, as the compiler should promote the constant to double at compile time, at least when that will improve vectorization. In that case, there is no reason to worry about changing the constant data type when you change the array from float to double.
Forcing calculations into double unnecessarily must be avoided, as you have pointed out. We have seen large applications which were designed for x87, which did not vectorize effectively until the problem with double constants mixed with float variables was corrected. This could prevent vectorization, or it could involve splitting up the loop ("distribution"), which you would see by PARTIAL LOOP VECTORIZED reports. If the loop is big enough that the data don't reside in L1 cache between the split loops, the purpose of vectorization may be defeated, even when all the bits are vectorized.

Login to leave a comment.