Compositional efficiency with parallel libraries

Compositional efficiency with parallel libraries

Suppose I have written a library that processes image files and that includes a class to generate histograms from an image.  Suppose this histogram generator is written with TBB:

template<typename Image, typename Histogram>
class histogram_generator {
public:
    histogram_generator(const Image& i): _image(i) {}
    histogram_generator(const histogram_generator& i, split): _image(i._image) {}
    void generate() {
        blocked_range2d<size_t, size_t> range(0, _image.size(), 0, _image[0].size());
        parallel_reduce(range, *this);
    }
    void operator () (const blocked_range2d<size_t, size_t>& r) {
        for (size_t i = r.rows().begin(); i != r.rows().end(); ++i) {
            for (size_t j = r.cols().begin(); j != r.cols().end(); ++j) {
                _histogram.add(_image[i][j]);
            }
        }
    }
    void join(const histogram_generator& rhs) {
        _histogram.join(rhs._histogram);
    }
    const Histogram& histogram() const { return _histogram; }
private:
    const Image& _image;
    Histogram _histogram;
};

Now suppose someone else decides that they need to process large numbers of in-memory images like so:

typedef vector<histogram_generator<rgb_image<u_char>, rgb_histogram>> genvec;
genvec generators;
blocked_range<genvec::iterator> range(generators.begin(), generators.end());
parallel_for(range, [](blocked_range<genvec::iterator>& r) { for(auto& e: r) e.generate(); });

We now have a parallel_for() operating over a vector of generators that will use parallel_reduce() and, if we are not aware of how the generator is implemented, we may be doing this unwittingly, if the interface were separate from the implementation.

I am curious how TBB handles this nesting or composition of algorithms that may or may not be parallel.  So, is there something wrong with the naive approach to simply building up libraries/functions in this way?  I'm concerned that it might be more efficient to run the parallel_for over a set of data structures each member of which operates in serial fashion rather than using TBB to run in parallel.  I'm also concerned that composing things in this fashion may be difficult when the implementation details of the library are hidden, and say, you are not aware that the library you are using has been written with TBB to run in parallel, or is written in more conventional ways.  Here, the separation of implementation from interface could prove to be a real burden.

It would be nice if TBB just "did the right thing" here, but I wanted to confirm if this were the case or not.

Bill

4 posts / novo 0
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.

When your vector is small, you may want to keep the parallization inside the generator, when the vector is larg you may want the parallization outside. You could extend your Histogram_generator template such that the ctor accepts additionally a vector of images. Then the generate() function can a) determine if single image or b) determine if vector, and when b) determine the number of elements and decide which way to go.

Jim Dempsey

www.quickthreadprogramming.com

If I may add to that reply, the first thing to realise is that task-based parallelism is a big leap forward in composability over thread-based parallelism. TBB thrives on recursive parallelism, either homogeneous or heterogeneous (nesting different algorithms), and can easily handle many thousands of tasks. The basic approach is to make sure, at each level, that limited amounts of work are not further subdivided but executed as a sequential base case, and so to limit parallel overhead to an acceptable level (limiting parallelism to an outer loop is just one easy way to do that if there is enough parallelism at that level already). Other than that, use the recursiveness to build up parallel slack, the freedom for the scheduler to do its thing, and enjoy the performance boost that comes from activating all those cores. TBB will make a lot of good decisions behind the scene, even about cache locality. If you're really concerned to get the optimal performance, and you've written a few programs already, you can start thinking about details. In the case of image files, your primary concern will be about I/O, not about nesting algorithms.

An alternate way would be a) inquired from TBB how many threads are in the thread pool. b) as you partition the histogram atomically incriment a partition counter, remember to decriment this when you complete the partition. c) prior to partitioning check the ratio of the number of partition tasks in flight (the atomic counter) against the number of threads in the TBB thread pool. When this ratio gets larger than X (determined experimentally), then avoid partitioning. Think of this as a governer/auto tune. Maybe you can get this done in less than 10 statements of code.

Jim Dempsey

www.quickthreadprogramming.com

Deixar um comentário

Faça login para adicionar um comentário. Não é membro? Inscreva-se hoje mesmo!