parallel_for alternative?

parallel_for alternative?


I have two independent functions, and i'm trying to run one on each core. The way that i'm doing this is creating a class, copying all images to the class,and running a parallel_for loop with range (0,2,1) so that it splits into two subranges, each one calling a different piece of code.

From my understanding,the parallel_for callshould spawn two threads, each one running one of the functions inside operator(), and this should happen across all cores (I'm usingan Intel Core 2 Duo).However,this provides no speedup and in fact slows things down CONSIDERABLY. I am usingIPP static libraries.My code is below:

*old code*

//function call
ippiWTFwdCol_B53(src, high, low);
ippiWTFwdRow_B53(high, lxly, hxly);
ippiWTFwdRow_B53(low,lxhy, hxhy);

*end old code*

*tbb attempt*

class Class {
image high, low, lxly, hxly, lxhy, hxhy;
Class(_high, _low, _lxly, _hxly, _lxhy, _hxhy) :high(_high), low(_low), lxly(_lxly), hxly(_hxly),
lxhy(_lxhy), hxhy(_hxhy) {}
void operator() (const tbb:rangeblock) const {
ippiWTFwdRow_B53(high, lxly, hxly);
ippiWTFwdRow_B53(low, lxhy, hxhy);

//function call
//tbb thread pool is already init
ippiFwdCol_B53(src, high, low);
Class c(high, low, lxly, hxly, lxhy, hxhy);
tbb::parallel_for(tbb:range(0,2,1), c);

*end tbb attempt*

is there any alternative solution besides using parallel_for that could possibly speed things up (like directly calling task scheduler and writing my own policy) oris there anything i should know about parallezing ipp Wavelet Transform calls?

9 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

[I'm not an IPP expert.]

What is the cost of copying an "image" object? The example appears to be copying the image objects by value, possibly twice. E.g., high is copied to the constructor parameter _high by value, and _high is copyed to the member field high. Perhaps using by-reference would help. E.g.:

class Class {
image& high, &low, ...;
Class(image& _high, image& _low,...


Another possible issue could be the grain size. How long do the ippiWTFwdRow_B53 calls take? If they take less than about 10,000 instructions, you might not see any speedup.

I am already using references, sorry i didn't specify that in my example. The WTFwd calls take 4-9 milliseconds on a 1.86ghz intel core 2 duo, and the time is nearly doubled when i run the tbb version.

I'm trying to understand your code sample in light of the API provided by IPP forwavelet transform calculations. I'm basically reading from the manual here, so perhaps you'll be able to educate me.

The documentation I've gotfor IPP 5.3 offers two flavors of the ippiWTFwd call, for 1 channel or 3 channel images in 32-bit floats. Each of these calls has a source pointer and four destination pointers, to receive the high and low detail bands of the source image in X and Y. It does this as the result of a single call. Whereas the sample code you provided appears to split this two dimensional process into a sequential pair of 1D filter steps. It is not clear to me how the calls in your example codemap to the calls provided by IPP. Is there more to the WT API interface that I'm missing? Or maybe you're working with a different version?

If your functions are based upon this IPP interface, I could easily see how the work might be doubled if ippiWTFwd got called twice, but I don't see how this can even work because this function does not provide the intermediate images your sample uses. I must be missing something. Any details you can provide will be helpful.

So the WTFwd call that returns 4 destination pointers is in Wavelet transforms (Section 13 ippiman.pdf). Look in Section 15 Image Compression functions -> JPEG2000 Coding -> Wavelet Transform Functions. There you will see that you can separate the WTFwd into rows and columns, where each call returns 2 dest pointers.

If you go back and match this with my pseudocode (i didnt include linewidths and roi sizes in my pseudocode) you can see that the two calls to WTFwdRow should be independent, using two different sources and returning two different pointers.

Thanks for your time, please reply if u see an error in what i'm trying to do.

I've had a chance to bone up a little on the one dimensional functions, so now your original code makes a lot more sense to me. You're using the static libraries so internal threading shouldn't be an issue (if you were trying to thread a call to an already threaded function, that could sure explain the slowdown youreported, but no dice here).

The only thing that raises an alarm with me is the parallel calls to ippiWTFwdRow_B53. In the 2D calls there's an opportunity to pass a buffer pointer. My guess is that with the filter convolution and the downsampling, there must be some internal buffering. If the static version gets those buffers from the stack, that should pose no problem. If they're somehow sharing a common buffer, that might suggest potential contention that might explain the slowdown. I need to do some further research.

Meanwhile, if you have a copy of Intel Thread Checker or could get a copy on an evaluation license, it might be interesting to find out if it shows any problems in your code. How big are the regions of interest you're compressing?

the original source image im using is 3456X3456 pixels, so high and low will be 3456X1728 pixels eachand lxly, lxhy, hxly, and hxhy will be 3456X864 pixels each.

so is using tbb::parallel_for(tbb:range(0,2,1), class) and making two cases in class's operator() the preferred way to run two pieces of independent code or is there another approach that might make a difference in speedup?

Using parallel_for like you did is the simplest approach right now in TBB. I've used it myself in writing regression tests. I think your example has ample computation to amortize the parallel scheduling overheads.

We've kicked around the idea of writing a template parallel_compose that would more directly deal with cases like yours. E.g., parallel_compose(F,G,H) would run function objects F, G, and H in parallel, and complete when F, G, and H complete.

There is the option to write in terms of raw tbb::task objects. E.g., create a parent task that spawns tasks for F, G, and H, and then waits on them to complete. But I think that would make a relatively small difference in scheduling overhead. Something more fundamentally odd is happening that needs to be better understood.

How very strange. I wrote a reply to this, noted the day I did it (Feb 18th), but it appears to have disappeared. Let's see what I can recall of it.

I had a chance to exchange mail with one of the IPP developers and learned a little about the wavelet compression filters. I'm pretty sure the reason you don't see scaling is because you're pushing the bandwidth limit of your machine.

It's not clear whether you are using 16-bit or 32-bit samples in your images, but even assuming 16 bits, your starting image is 3456x3456x2 = 22.8 MB. Each of the high and low images out of the column filter are then 11.4 MB.

Each of theippiWTFwdRow calls take an 11.4 MB input image which it runs the wavelet filter over twice, producing a high and low detail image, each another 11.4 MB, before downsampling into the resultant images, which should be 1728x1728, not 3456x864 since the filtering happens on the other axis. The end result is two more images per call, requiring another 22.8 MB. These numbers exceed the cache sizes of conventional processors bymany timesso these operations constantly are going out onto the bus for more memory. Bus bound apps may actually slow down with more processors because of increased contention.

Sorry it took so long to get this answer out.

Leave a Comment

Please sign in to add a comment. Not a member? Join today