Range-based for loops with blocked_range2d

Range-based for loops with blocked_range2d

Bill L.'s picture

I have implemented a modest single-dimension histogram using parallel_reduce:

#include <vector>
#include <algorithm>
#include "tbb/blocked_range.h"
using namespace tbb;
using namespace std;
template<typename Integral>
struct histogram {
    histogram(): counts(numeric_limits<Integral>::max() + 1, 0) { }
    histogram(const histogram& h, split):
        counts(numeric_limits<Integral>::max() + 1, 0) { }
    void operator () (blocked_range<typename vector<Integral>::const_iterator>& r) {
        for (const auto& e: r) ++counts[e];
    }
    void join(const histogram& rhs) {
        transform(rhs.counts.begin(), rhs.counts.end(),
                       counts.begin(), counts.begin(), plus<size_t>());
    } 
    vector<size_t> counts;
};
#include "tbb/parallel_reduce.h"
int main() {
    const u_char value = 10;
    vector<u_char> data(1000001, value);
    histogram<u_char> total;
    parallel_reduce(blocked_range<vector<u_char>::const_iterator>(data.begin(), data.end()), total);
    cout << (total.counts[value] != data.size() ? "fail" : "ok") << endl;
}

I am impressed with the concision that is possible with tbb, particularly the range-based for loop support, e.g.:

void operator () (blocked_range<typename vector<Integral>::const_iterator>& r) {
    for (const auto& e: r) ++counts[e];
 }

However, when trying something similar with the blocked_range2d, it appears that I am unable to use range-based for loops and must revert to the old-fashioned way of iterating:

void operator () (const blocked_range2d<size_t, size_t>& r) {
    for (size_t i = r.rows().begin(); i != r.rows().end(); ++i) {
        for (size_t j = r.cols().begin(); j != r.cols().end(); ++j) {
            something_something_dark_side[i][j];
        }
    }
 }

I would have thought that this would be possible:

void operator () (const blocked_range2d<size_t, size_t>& r) {
    for (const auto& i: r.rows()) {
        for (const auto& j: r.cols()) {
            something_something_dark_side[i][j];
        }
    }
 }

but that does not appear to be the case.  Is there any good reason this should be so?  If not, could support for range-based for loops with blocked_range2d be added?

Thanks.

Bill

11 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Raf Schietekat's picture

Does it make a difference if you make copies of rows() and cols()?

Raf Schietekat's picture

(Accidental double posting deleted.)

Bill L.'s picture

Quote:

Raf Schietekat wrote:

Does it make a difference if you make copies of rows() and cols()?

No, there is no difference.  I am using gcc 4.8.0 on a mac running mountain lion, and if I change the code to this:

void operator () (const blocked_range2d<size_t, size_t>& r) {
    const blocked_range<size_t>& rows = r.rows();
    const blocked_range<size_t>& cols = r.cols();
    for (const auto& i: rows) {
        for (const auto& j: cols) {
            something_something_dark_side[i][j];
        }
    }
 }

I still get the following errors:

t.cc:34:29: error: invalid type argument of unary ‘*’ (have ‘tbb::blocked_range<long unsigned int>::const_iterator {aka long unsigned int}’)
        for (const auto& i: rows) {
                                   ^
t.cc:35:33: error: invalid type argument of unary ‘*’ (have ‘tbb::blocked_range<long unsigned int>::const_iterator {aka long unsigned int}’)
            for (const auto& j: cols) {
                                       ^

Bill L.'s picture

Quote:

Bill L. wrote:

Quote:

Raf Schietekatwrote:

Does it make a difference if you make copies of rows() and cols()?

No, there is no difference.  [...]

I now note that if I change my one-dimensional example to this:

void operator () (const blocked_range<size_t>& r) {
    for (const auto& e: r) ++data[e];
 }

that it fails as well.  It must be that using STL iterators in the parallel reduce and the '()' operator allows for the use of the '*' operator in the for loop.  I had assumed that blocked_range and blocked_range2d (etc.) would behave similar to a container with begin() and end() methods, and hence be usable in range-based for loops.  Does this seem a poor assumption that could not be supported in future versions?

Bill

Raf Schietekat's picture

Sorry, I wanted to make you test something first, because it didn't make sense to me either, but the reason didn't immediately occur to me. (The copy was to get rid of the reference, though, so your rewrite didn't really help there.)

I tried the linear example and a local-only use of the blocked_range2d, and clang++ on OS X very helpfully explained "error: indirection requires pointer operand ('unsigned long' invalid)". So you can use it with pointers and iterators to something, but not with integral index values. If you want to do that, you have to find or make an iterator that dereferences to those values even if they don't really exist anywhere in memory. I'm sure somebody has made one of those before, perhaps in boost, but I'll leave that to you.

So there's nothing TBB should be doing differently here.

Raf Schietekat's picture

With less distraction now I see that I should have scrolled to the right in your first reply, and looked again for additional postings before finishing my reply just above a while later. Your "poor assumption" is actually quite valid, but there's still nothing TBB should or could be doing differently.

Have a look at boost::counting_iterator for what you want to do.

There's an important caveat, though: the compiler may not be able to deduce anymore that it can optimise those loops, perhaps even vectorise them. Often it makes a big difference if you "hoist" the end() out of the loop, as in "for (int i = r.begin(), r_end = r.end(); i != r_end; ++i)", but now you're taxing the compiler even more, so you may end up paying a big price for that syntactical sugar (it will rot your code's performance...). The same suggestion I made above about making a local copy of the range may very well make all the difference here, because it decreases the scope that the compiler must scan for outside interference, and the compiler may now be able to deduce that end() doesn't change after all, and that it can safely apply those optimisations. Still, that leaves you with the greater burden perhaps of having to profile the outcome (at least initially to validate the approach), so it seems safer to forego all that, even do the hoisting thing on top of it, and know that this at least won't come back to bite you.

Perhaps somebody might want to investigate this further and run some benchmarks?

Raf Schietekat's picture

I've had another look at the Standard about that range-based for loop. Apparently it makes copies of the begin() and end() values, so that should relieve you of making your own copy of the thing that exposes them, but you should probably validate that at least once. Then you have to decide between using an adaptor and pervasive use of that special iterator, if you really want that range-based for loop with integers, but I'm not sure it's the right trade-off.

jimdempseyatthecove's picture

>>for (size_t i = r.rows().begin(); i != r.rows().end(); ++i) {

Consider: for (T i = r.rows().begin(); i != r.rows().end(); ++i) {

Where you do not know what T is nor how the container is organized, You won't know how large the iteration space is, you won't know how to split it (TBB won't).

In your case you have "size_t i =..." with the implication that the container has numerical addresses. The end of for statement has ++i, implying n numerical such as n, n+1, n+2, ... to some value that is != end(), which also means (due to ++i) < end(). Therefore you can prefetch the end() immediately before the for, intptr_t iEnd = r.rows().end();, and then use that in your termination clause in the for( but with using < inplace of !=.

This is what Raf is telling you.

This does have a requirement that the container not alter size during processing, I believe this to be the case, you will have to confirm.

Jim Dempsey

www.quickthreadprogramming.com
Raf Schietekat's picture

The syntactical overhead seems minimal, so I might be warming to the idea. But even though the range-based for statement is supposed to make copies of begin() and end(), what would be their type, and what implications has that for the compiler's ability to optimise? Would anybody like to track that down?

#include <stdlib.h> // EXIT_SUCCESS
#include <iostream> // std::cout
#include <type_traits> // std::is_same
#include <string> // std::to_string()
#include <mutex>
#include <boost/iterator/counting_iterator.hpp>
#include "tbb/tbb.h"

template<typename T> using counting_range =
    tbb::blocked_range<boost::counting_iterator<T>>;

using std::cout;
std::mutex m; // for std::cout

int main(int argc, char* argv[]) {
    typedef unsigned char T;

    // easy enough to use...
    tbb::parallel_for(counting_range<T>(0,20), [] (const counting_range<T>& r) {
        std::lock_guard<std::mutex> l(m);
        for (T i : r) cout << std::to_string(i) << " ";
    });
    cout << "\n" ;
    
    // ...but can it be optimised?
    counting_range<T> r(0,20);
    auto begin = r.begin();
    cout << (std::is_same<decltype(begin),T >::value ? "" : "not ") << "T\n" ;
    cout << (std::is_same<decltype(begin),T&>::value ? "" : "not ") << "T&\n" ;
    cout << ((sizeof(begin)==sizeof(T)) ? "" : "not ") << "same size\n" ;
    
    return EXIT_SUCCESS;
}

(Added 2013-06-29) Not to say that it couldn't still be optimised at a lower level. The iterator type is obviously strictly different from the referent type, but it could still be the same thing underneath (as shown by the changed code that now uses unsigned char instead of int and compares sizes to show that the iterator is not a pointer), and perhaps the compiler exposes that in time for the optimiser to run with it. But this still has to be validated for a representative set of compiler environments for it to be used in applications.

Raf Schietekat's picture

BTW, does anybody know whether the compiler is able to optimise the parallel_for overloads with separate first and last parameters? I see that the implementation uses a parallel_for_body that takes a blocked_range by reference and then loops until end(), so that's taking chances, I would think...

Login to leave a comment.