I am wondering if IPP(7.1) in general and ippiWarpAffine* in special, does take advantage of TBB's parallel_for and if yes what is the way to enable it. When I did enabled TBB on OpenCV I got a significant speed boost on the warpAffine().

My test images(CT medical image) are 512x512 (8u) and I am using CUBIC interpolation on a destination sizes of 1590x820, . OpenCV(with TBB) is more that 3 times faster than IPP for exactly the same AffineTransform. Is worth mentioning that I am using in both (IPP and OpenCV) cases java wrappers under linux(RH6) 64-bit. For IPP I did compile the java language support (from IPP 7.0.7) against 7.1 and I am using jipp.ip.ippiWarpAffine_8u_C1R(). From OpenCV I am using Imgproc.warpAffine().

Any ideas? Please note that I am new to IPP and TBB and I am evaluating different products in order to find a good basis for a rendering libray (64-bit - Win7, Linux, Mac). From Intel I did download Intel C++ Composer XE 2013 which bundles IPP and TBB along with IMK and intel's compiler and it   seems a nice fit for us so far.

Thank You,


4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Dacian,

ippiWarpAffine is not internally threaded (check the Documentation\en_US\ipp\ThreadedFunctionsList.txt for

threaded function list), so it can not benefit from the internal threadings. If you want to get threading

performance, you needs to implement the high level threading by yourself with tbb, or other ways.


Thanks Chao,

I did notice the ThreadedFunctionList.txt and I decomposed my affine transform into mirror, rotate, resize. Overall I got prety good results,  however I am wondering if you can be a little bit more specific about how I can proceed in using TBB's parallel_for with ippiWarpAffine_8u_C1R(). Are you suggesting to decompose the source image in smaller parts (with some overlap perhaps)?



>>...how I can proceed in using TBB's parallel_for with ippiWarpAffine_8u_C1R().

For '...TBB's parallel_for...' you should look at TBB samples, for example a set of classes for partitioning.

>>...Are you suggesting to decompose the source image in smaller parts (with some overlap perhaps)?

In overall Yes but you will need to verify that a final result after processing of several parts of the image in parallel will be identical ( rounding errors are possible ) to a regular processing with one image ( without partitioning ).

Leave a Comment

Please sign in to add a comment. Not a member? Join today