Optimization problem (beginner questions)

Optimization problem (beginner questions)

Hello,

I have just recently started using the IPP primitives and Intel Compiler. I tried optimizing the following simple loop (orig_code):

int orig_code( unsigned short *data, unsigned short *gain)

{

int pixel_value = 0;

int idex = 0;

int ImageSize = 2000*2000;

for( idex = 0; idex < ImageSize; idex++, gain++ )

{

pixel_value = (int)*data;

pixel_value *= (int)*gain;

}

return(0);

}

Trying to convert this code to a loop using the vector data types (slow_code)yielded worse results. The orig_code takes 10 msec to process a 2K x 2K 16 bit grayscale image. The slow_code takes 18 msec. Any suggestions? I was expecting roughly a factor of 8 improvement in the processing time.

int slow_code( Ipp16u *data, Ipp16u *dst, Ipp16u *gain)

{

int idex = 0;

int nLoop = 2000*2000/8;

Iu16vec8 *vdata, *vgain, *vdst;

Ipp16u *tdst;

vdst = (Iu16vec8 *) dst;

vgain = (Iu16vec8 *) gain;

vdata = (Iu16vec8 *) data;

for( idex = 0; idex < nLoop; idex++, vgain++, vdata++, vdst++ )

{

*vdst = (*vdata) * (*vgain);

}

return(0);

}

4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi,

it seems for me you are trying to multiply each corresponding pixels in two images? Please look for ippiMul functions in IPP manual

Regards,
Vladimir

Vladmir,

You are correct. The code as written could be done with the image processing functions. I left out some other more complicated parts of the code that would prevent me from using the image processing functions in general, so I was trying to understand why this simple example would not result in an 8 times speed improvement. Any suggestions besides using the image processing library functions?

Thanks.

Hi,

Having such a big amount of data, like in your case 2Kx2K 16s images it is important to use processor's cache in efficient manner. I mean you need to organize processing in such a way to work with limited amount of data (which can fit L0 processor cache) as long as you can, and only after that you can move processing window to another part of your data. For example, processin in row-by-row fashion should improve performance in general. If you can't use image processing functions by some reason you should be able to take advantage from using Intel compiler vectorization. You are going to run this code on multi-core processors it also important to use OpenMP parallelazation supported by Intel Compiler. Hope this should help to improve performance for such tasks.

Regards,
Vladimir

Leave a Comment

Please sign in to add a comment. Not a member? Join today