Parallel_For on a 2D matrix that stores some pixel on a global Vector

Parallel_For on a 2D matrix that stores some pixel on a global Vector

Hi everybody,

I want to use a parallel_for in order to do some heavy calculations on a 2D matrix. Those calculations are really heavy, since I am implementing a SIFT algorithm for image processing.

I divided the matrix in big blocks (ex. 120 x 120), and within each block I apply my calculation pixel per pixel using a parallel_for on it. At the end of each calculation (per pixel), I have to decide if that pixel is important or not, and if it's important, I want to accumulate that pixel (with some other stuff related to it) on a vector. To do so, I have to use a "global" vector to accumulate those special points.

For now, I am using a mutex that protects the vector whenever each point is added to it. However, using a mutex in such a way, it could impact negatively on the performances: in this way, what is the best algorithm or strategy that could be used?

I tried to use parallel_reduce, in order to let each block have its own STL vector, and then join the vectors after each block has finished it's computation, but is slower that the version with mutex!

I think that is quite a common problem, can you help me?

Thanks a lot for your time,

Riccardo

5 Beiträge / 0 neu
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

With synchronisation, the first thing to consider is to amortise its cost by adding all output corresponding to an input block/chunk at the same time instead of individually.

You should also consider concurrent_vector, which is tailor-made for such a "task" (non-TBB meaning). Don't forget the advice above.

But if this really is a "map" (the pattern, not the container, aka. embarrassing parallelism), couldn't you then also use an output vector with the same dimensions, and hence no need to synchronise? You could subsequently use parallel_scan() to compact it, if needed.

Zitat:

Raf Schietekat schrieb:

With synchronisation, the first thing to consider is to amortise its cost by adding all output corresponding to an input block/chunk at the same time instead of individually.

The only thing that is important to me, is that all of the element that can be appended to the vector are CREATED and SAVED DURING the parallel computation, but they will be used (thus read) ONLY AFTER the parallel computation. I don't need to access each element during the computation. And yes, the computation is really embarassing parallelism.

The parallel code is invoked here:

```//the variable f is a variable that contains useful settings for doing the parallel computation
//such settings are only read and not write during the parallel computation
parallel_for(tbb::blocked_range2d<size_t>(0, h / realBlockSize + 1, 0,w / realBlockSize + 1),
ParallelBody(realBlockSize, f, blocksize), tbb::auto_partitioner() );
```

while the parallel_for body is:

```void operator()( const tbb::blocked_range2d<size_t>& r ) const {
FbFilt * my_fb = _fb_block_filterer(my_blockSize, my_blockSize); //created using scalable_allocators
int currentH, currentW;
int initialH = r.rows().begin() * my_realBlockSize;
int initialW = r.cols().begin() * my_realBlockSize;
for( size_t currentIndexH=r.rows().begin(), currentH = initialH; currentIndexH!=r.rows().end(); ++currentIndexH, currentH += my_realBlockSize){
for( size_t currentIndexW=r.cols().begin(), currentW = initialW; currentIndexW!=r.cols().end(); ++currentIndexW, currentW += my_realBlockSize ) {
expensiveComputations(my_f, my_fb, currentH, currentW, 0); //inside this function is analyzed a rectangle
//of pixels and is decided whether to add a new element in a vector contained in my_f or not
}
}
_fb_free_filterer(my_fb); //freed using scalable deallocators
}
```

Zitat:

Raf Schietekat schrieb:

You should also consider concurrent_vector, which is tailor-made for such a "task" (non-TBB meaning). Don't forget the advice above.

I will try using concurrent_vector and I will update you about it. However, I noticed that you should use concurrent_vector only if you need it due to its performance, but how much they are really that worse respect normal stl vectors?

Zitat:

Raf Schietekat schrieb:

But if this really is a "map" (the pattern, not the container, aka. embarrassing parallelism), couldn't you then also use an output vector with the same dimensions, and hence no need to synchronise? You could subsequently use parallel_scan() to compact it, if needed.

I don't understand your advice, can you please be more specific? I need to fill an array appending to it some elements while I am processing a 2D matrix, so each thread is not reading the vector at all, but is appending to it some element according to a custom policy about pixel in 2D matrix.

If there are unclear points, please ask me and I'll be more clear.

Thank you very much for all your support and suggestions.

Riccardo

Zitat:

Riccardo Ancona schrieb:

The only thing that is important to me

I got that, but this is about decreasing the overhead of adding to a shared container, by adding groups of elements instead of individual elements. Well, maybe it's both easier and cheaper to split the output of each block/chunk up into groups of a fixed size in an array on the stack, I'm not sure, but don't just add them one by one.

Zitat:

Riccardo Ancona schrieb:

I will try using concurrent_vector and I will update you about it. However, I noticed that you should use concurrent_vector only if you need it due to its performance, but how much they are really that worse respect normal stl vectors?

I'm not sure I understand what you mean, but it's quite good to concurrently collect results if you're not going to use random access for individual elements, which is its weak point compared to std::vector (I don't have numbers about that). It could be, if you already add enough elements at once, that the difference with a mutex-protected vector becomes quite small, but such amortisation is not always practical or possible, and then concurrent_vector is a clear winner.

Zitat:

Riccardo Ancona schrieb:

I don't understand your advice, can you please be more specific? I need to fill an array appending to it some elements while I am processing a 2D matrix, so each thread is not reading the vector at all, but is appending to it some element according to a custom policy about pixel in 2D matrix.

If you have an output collection with the same dimensions as the input, and a way to flag whether the pixel was "interesting" (special value or maybe a separate flag), then you have no need to lock a mutex: all elements can be assigned concurrently, without conflict. Afterwards, you skip the pixels that are not "interesting", and you can even parallelise compaction if that's what you're doing. I would recommend this if a sufficient fraction of pixels are "interesting" and/or you need random access afterwards, but not if the result is sparse and has a lot of extra information for the "interesting" pixels.

Raf's last suggestion is good. Memory is cheap, mutexes are not. The trade-offs are:

shared vector with internal mutex producing compact vector of interesting points
verses
shared array with no mutex producing sparse vector (an array can be thought of as a vector), post process loop has to skip uninteresting points.

IOW