Aggregators

Aggregators

Dmitry Vyukov的头像

Hi,

TBB features aggregator feature. I've posted some thoughts on aggregators here:

http://software.intel.com/en-us/blogs/2013/02/22/combineraggregator-synchronization-primitive

perhaps you may find them interesting. I think some ideas can be applied to TBB, e.g. better aggregation or prefetching.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
17 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项
Raf Schietekat的头像

I was looking at aggregator.h first (TBB 4.1 update 2, as in tbb41_20130116oss_src.tgz), and I couldn't help wondering about the memory semantics. Would you agree that the synchronisation through aggregator_operation::status on the way in is ficticious, i.e., that "call_itt_notify(releasing, &(op.status));" in aggregator_ext::execute_impl() and "void start() { call_itt_notify(acquired, &status); }" are at best somewhat misleading? Luckily it's not required, as the state appears to have already been synchronised through aggregate::mailbox (on the way in, that is). Only synchronisation on the way out is through status. I'll probably have to study this further, to make sure that this is just a white lie (but why?) and not a sign of trouble (as with task affinity handling a while ago), but it would be nice to synchronise already (lame pun).

Raf Schietekat的头像

Interesting additional performance gains...

It seems to me that an iterator should be provided to traverse the aggregator operations, with a very limited interface, e.g., only consisting of next(), returning a pointer (to an operation or NULL). It can enforce synchronisation (now explicit through start()/finish(), with start() not really functional at this time), next() doesn't have to be called on the operation anymore (so it cannot be called accidentally after finish()), and the iterator can also add prefetching where available. The interface would probably be transparent for reimplementation with some of the combiner ideas, and should decrease the gap between basic and expert use. I may or may not contribute such a proposal (with odds increasing on expressed interest).

(Added 2013-02-26) For clarity: the first paragraph above is about Dmitry's proposal, the second again about TBB's aggregators... or both.

Alexey Kukanov (Intel)的头像

Citation :

Raf Schietekat a écrit :

I was looking at aggregator.h first (TBB 4.1 update 2, as in tbb41_20130116oss_src.tgz), and I couldn't help wondering about the memory semantics. Would you agree that the synchronisation through aggregator_operation::status on the way in is ficticious, i.e., that "call_itt_notify(releasing, &(op.status));" in aggregator_ext::execute_impl() and "void start() { call_itt_notify(acquired, &status); }" are at best somewhat misleading? Luckily it's not required, as the state appears to have already been synchronised through aggregate::mailbox (on the way in, that is). Only synchronisation on the way out is through status. I'll probably have to study this further, to make sure that this is just a white lie (but why?) and not a sign of trouble (as with task affinity handling a while ago), but it would be nice to synchronise already (lame pun).

Raf,

call_itt_notify() is indeed not a synchronization; it's a notification to Intel(R) Inspector that a synchronization either will happen ('releasing') or has happened ('acquired'). This is necessary for the tool to understand custom synchronization algorithms and do not report false data races. The real synchronization of course happens through the mailbox (on the way in). I can get into deeper details why the notifications are placed the way they are, if you wish.

Citation :

Raf Schietekat a écrit :

It seems to me that an iterator should be provided to traverse the aggregator operations, with a very limited interface, e.g., only consisting of next(), returning a pointer (to an operation or NULL). It can enforce synchronisation (now explicit through start()/finish(), with start() not really functional at this time), next() doesn't have to be called on the operation anymore (so it cannot be called accidentally after finish()), and the iterator can also add prefetching where available. The interface would probably be transparent for reimplementation with some of the combiner ideas, and should decrease the gap between basic and expert use. I may or may not contribute such a proposal (with odds increasing on expressed interest).

I'm afraid that it may also lose some possibilities of the expert use, much reducing it to the "traverse the list once" model that the basic use provides under the hood. The current expert model does not couple traversing through the list and processing its elements, and it's rather good for advanced use. Terry provided some thoughts on that in her blog on the TBB aggregator: http://software.intel.com/en-us/blogs/2012/05/02/aggregator-a-new-commun... (see When to use Aggregator and why use Expert Mode? at the very end of her article).

Raf Schietekat的头像

Citation :

Alexey Kukanov (Intel) a écrit :

call_itt_notify() is indeed not a synchronization; it's a notification to Intel(R) Inspector that a synchronization either will happen ('releasing') or has happened ('acquired'). This is necessary for the tool to understand custom synchronization algorithms and do not report false data races. The real synchronization of course happens through the mailbox (on the way in). I can get into deeper details why the notifications are placed the way they are, if you wish.

I knew that the call_itt_notify() calls are not the synchronisation itself (there are other itt_... calls that merge synchronisation and notification), but I was concerned that their appearance might be indicative of another real synchronisation problem underneath (even if x86 would not be affected), which fortunately didn't turn out to be the case (unlike with the other mailbox, if I remember correctly). This is where source-code documentation counts, if nothing else: to avoid distracting (or sometimes even laborious) second-guessing at the expense of more useful work!

If the motivation for the current implementation is anything else than that the Inspector cannot deal with nonsymmetrical synchronisation (in and out through different pathways), do tell.

Citation :

Alexey Kukanov (Intel) a écrit :

I'm afraid that it may also lose some possibilities of the expert use, much reducing it to the "traverse the list once" model that the basic use provides under the hood. The current expert model does not couple traversing through the list and processing its elements, and it's rather good for advanced use. Terry provided some thoughts on that in her blog on the TBB aggregator: http://software.intel.com/en-us/blogs/2012/05/02/aggregator-a-new-commun... (see When to use Aggregator and why use Expert Mode? at the very end of her article).

OK, multiple passes is a good argument... How about an optional iterator that can be initialised from that pointer parameter? When a single pass suffices, which is in most cases, it can still take out some of the noise. Here's an application to the code in test_aggregator.cpp:

classmy_handler {
    pq_t *pq;
public:
    my_handler() {}
    my_handler(pq_t *pq_) : pq(pq_) {}
   voidoperator()(tbb::aggregator_operation* op_list)const{
#if0
       while(op_list) {
            op_data& request =static_cast<op_data&>(*op_list);
            op_list = op_list->next();
            request.start();
#else
       for(tbb::aggregator_iterator it(op_list); !it.eos(); ++it) {
            op_data& request =static_cast<op_data&>(*it);
#endif
           if(request.tid >=0) pq->push(request.tid);
           else{
                ASSERT(!pq->empty(),"queue should not be empty!");
               intelem = pq->top();
                pq->pop();
                shared_data[elem]++;
            }
#if0
            request.finish();
#endif
        }
    }
};

Dmitry Vyukov的头像

Hi,

Regarding memory fences.

} while (mailbox.compare_and_swap(&op, res) != res);
must be at least release.

pending_operations = mailbox.fetch_and_store(NULL);
must be at least acquire.

void finish() { itt_store_word_with_release(status, uintptr_t(agg_finished)); }
must be at least release.

spin_wait_while_eq(op.status, uintptr_t(aggregator_operation::agg_waiting));
must be at least acquire.

itt_store_word_with_release(handler_busy, uintptr_t(0));
must be at least release.

spin_wait_until_eq(handler_busy, uintptr_t(0));
must be at least acquire.

__TBB_store_with_release(handler_busy, uintptr_t(1));
does not need to be release.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov的头像

>I'm afraid that it may also lose some possibilities of the expert use, much reducing it to the "traverse the list once" model that the basic use provides under the hood. The current expert model does not couple traversing through the list and processing its elements, and it's rather good for advanced use.

Are there any other usages of "batch processing" besides the priority queue? Have you measured performance win? I mean it looks rather esoteric.

>call_itt_notify() is indeed not a synchronization; it's a notification to Intel(R) Inspector that a synchronization either will happen ('releasing') or has happened ('acquired'). This is necessary for the tool to understand custom synchronization algorithms and do not report false data races.

Btw, in ThreadSanitizer we use a different approach. The tool unserstands the atomic operations and memory ordering constraints in the code. This eliminates the need to additional annotations and at the same time protectes from incorrect annotations.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Alexey Kukanov (Intel)的头像

Citation :

Raf Schietekat a écrit :

If the motivation for the current implementation is anything else than that the Inspector cannot deal with nonsymmetrical synchronisation (in and out through different pathways), do tell.

Inspector has no problem to deal with non-symmetrical synchronization, the reason is in the protocol used by the aggregator.

A thread that puts an operation request into the aggregator needs to "synchronize with" a handler thread that will process this request. The real synchronization happens through the mailbox, the linked list where all incoming requests are put via compare-and-swap, and then grabbed by the handler thread for processing. Now, by the Inspector notification rules, the requester thread should issue a "releasing" notification with a tag(*) prior to publishing the work, and the handler thread needs to issue an "acquired" notification with the same tag after it grabbed the work.

The first problem arise with the CAS loop in the request publishing protocol. A thread only knows if publishing was successful after checking CAS return value, while the "releasing" notification should be done before. Thus we had a choice: either to re-issue the notification inside the loop, right before CAS (upfront of the physical synch communication), or only do it once before the whole loop, i.e. upfront fo the logical commnication. For the obvious reasons of reducing overhead caused by possible (but not really necessary) duplications, the second variant was chosen.

Another problem to solve was which tag to use. The first choice that came in is the address of the list head pointer modified with CAS, for all threads that operate with the aggregator. The problem of that is "imaginary synchronization" in Inspector's eyes: as the "releasing" annotation is done in advance and the actual time of request publication is not known, it may happen that a handler thread issues an "acquired" after that "releasing" but the submitted request was not yet in the portion the handler grabbed. In this case, Inspector would think threads are synchronized, but logically they aren't (though certainly there is a CAS sequence that allows to build happens-before relations for a particular interleaving). And by the way, even if on the previous step we would choose to issue "releasing" multiple times inside the CAS loop, the tag problem would still remain. Since false negatives (i.e. real races missed by the tool) are much worse than false positives, I wanted to avoid such "imaginary synchronization", and so suggested to use an address associated with the request for the tag, rather than a head address common for all threads. Meanwhile, the list head address is used for notifications, but only to mark synchronization between subsequent handler threads.

Finally, the place to put "acquired" in the handler thread was basically pre-determined: right after it gets the address of a request and before touching any data associated with this request. All in all, these Inspector annotations describe well the logical synchronization through the mailbox, though do not really match the "physical" protocol of atomic operations.

(*) Typically, a memory address associated with the synchronization object is passed as the argument into notification calls; but Inspector only interprets it as a tag to match "paired" notifications.

Alexey Kukanov (Intel)的头像

Citation :

Dmitry Vyukov a écrit :

Are there any other usages of "batch processing" besides the priority queue? Have you measured performance win? I mean it looks rather esoteric.

Yes, it did give additional performance win for priority_queue in our measurements. Whether esoteric or not, we did not want to disable/hide it in the aggregator.

Citation :

Dmitry Vyukov a écrit :

Btw, in ThreadSanitizer we use a different approach. The tool unserstands the atomic operations and memory ordering constraints in the code. This eliminates the need to additional annotations and at the same time protectes from incorrect annotations.

Certainly different approaches have their own pros and cons, but my knowledge in this area is rather limited so I will restrain from a discussion of this topic.

Alexey Kukanov (Intel)的头像

Citation :

Raf Schietekat a écrit :

OK, multiple passes is a good argument... How about an optional iterator that can be initialised from that pointer parameter? When a single pass suffices, which is in most cases, it can still take out some of the noise.

It needs to be an input_iterator then. Feel free to contribute :) The aggregator is not in active development and we don't have other extension requests for it (perhaps the basic mode suites most people :)), so an optional feature like this iterator have little chances to get attention, unless some code is contributed.

Also note this comment in basic_handler:
    // IMPORTANT: need to advance op_list to op_list->next() before calling request.finish()
 This is important because request.finish() signals the waiting owner of the request about its completion, after which the request object can be deleted at any time, so it's unsafe to use any its field, including next(). [<outdated, supposed to be stroken-through>I am not sure how to handle that in the iterator paradigm, except maybe if operator++ is a noop while operator*, after obtaining the value, advances the iterator to the next node. But that would be against the existing practice and may result in bad surpises.</outdated>]
Update: silly me, I got it now: operator++ would first call next() and then call finish() :)

Raf Schietekat的头像

Citation :

Dmitry Vyukov a écrit :

Regarding memory fences.

[…]

I had the same for mailbox and status. I also agree on the changes for handler_busy (that I have so far neglected). But wouldn't the notifications there cause the same problems that precluded their use with mailbox as described by Alexey?

Citation :

Alexey Kukanov (Intel) a écrit :

Inspector has no problem to deal with non-symmetrical synchronization, the reason is in the protocol used by the aggregator.

[…]

Thanks for the nice explanation.

Citation :

Alexey Kukanov (Intel) a écrit :

It needs to be an input_iterator then. Feel free to contribute :) The aggregator is not in active development and we don't have other extension requests for it (perhaps the basic mode suites most people :)), so an optional feature like this iterator have little chances to get attention, unless some code is contributed.

Perhaps expert mode is just too cumbersome to contemplate right now (for those who use aggregator at all)? If there really is extra performance to be had, better usability should attract more users.

Should it really be an official STL-style iterator, though? Conjuring up out of thin air an end() to compare with just to evaluate end-of-sequence is probably a bit contrived and awkward… like writing code for parallel_do() instead of parallel_while(). The following code works with (a modification of) test_aggregator.cpp:

class aggregator_iterator : public std::iterator<std::input_iterator_tag, aggregator_operation> {
    aggregator_operation* my_op;
    // TODO: keep my_op->next() here as well, or just look it up each time?
    void start() {
        __builtin_prefetch(my_op->next()); // suggested by Dmitry Vyukov
        my_op->start();
    }
public:
    aggregator_iterator(aggregator_operation* op) : my_op(op) {
        __TBB_ASSERT(my_op!=NULL, NULL); // at least this thread's own operation should be there
        start();
    }
    ~aggregator_iterator() { __TBB_ASSERT(!my_op, NULL); } // don't quit before exhausting the list
    aggregator_iterator& operator++() {
        __TBB_ASSERT(my_op!=NULL,NULL); // don't increment past end of list
        aggregator_operation* op = my_op->next(); // call before finish()
        my_op->finish(); // *my_op now invalid
        my_op = op;
        if(my_op!=NULL) start();
        return *this;
    }
    // post-increment has no valid use (just doing "it++;" is bad style anyway)
    aggregator_operation& operator*() {
        __TBB_ASSERT(my_op!=NULL, NULL); // don't dereference when list exhausted
        return *my_op;
    }
#if 0
    bool eos() const { return my_op!=NULL; }
#else
    aggregator_iterator() : my_op(NULL) {}
    bool operator==(const aggregator_iterator &other) const {return my_op==other. my_op; }
    bool operator!=(const aggregator_iterator &other) const {return my_op!=other. my_op; }
#endif
};

Citation :

Alexey Kukanov (Intel) a écrit :

Also note this comment in basic_handler:
// IMPORTANT: need to advance op_list to op_list->next() before calling request.finish()
This is important because request.finish() signals the waiting owner of the request about its completion, after which the request object can be deleted at any time, so it's unsafe to use any its field, including next(). [<outdated, supposed to be stroken-through>I am not sure how to handle that in the iterator paradigm, except maybe if operator++ is a noop while operator*, after obtaining the value, advances the iterator to the next node. But that would be against the existing practice and may result in bad surpises.</outdated>]
Update: silly me, I got it now: operator++ would first call next() and then call finish() :)

All that would be hidden by the iterator. At first I thought only of letting finish() return the next operation, but that would not allow for prefetching, and it wouldn't hide a whole lot nor allow a nice and concise for loop header. I would prefer to omit the postincrement operator, though. There's no reason to use it++ instead of ++it even in a language rather inappropriately called C++: it's a silly imitation of whoever started this, it betrays ignorance of what goes on underneath (there's wasted extra work unless the optimiser can get rid of it), and even the declaration indicates that it plays second fiddle to the preincrement operator. But, more than enforcing better style, it would help avoid the accidental use of *it++.

(Added 2013-02-28) Er, prefetching doesn't require an iterator, of course...

(Added 2013-03-11) Updated aggregator_iterator code (copy&paste from working code, then manually corrected whitespace mishaps): added "public std::iterator<std::input_iterator_tag, aggregator_operation>" parent, clarified two "bad use" comments, presented alternative for eos(). An input iterator supposedly also requires the postincrement operator: this code has been tested on OS X with std::for_each without it, but other implementations addicted to postincrement may still fail, unless it is implemented returning void or a more convoluted implementation is provided that progresses based on the observed use of dereferencing. An arrow operator is probably never used, the copy constructor must be available, the copy assignment operator probably depends on algorithm implementation, but probably it won't give rise to problems if both are kept. Still, maybe it would be better, if only for peace of mind, to just forget about iterator category and official signature anyway, and instead require an explicit for loop with eos()...

Raf Schietekat的头像

(2013-03-03 Removed: g++ on OS X not recent enough for lambda support, presumed partial support was based on mistake.)

Raf Schietekat的头像

TBB team: please report at least once (per build) if lambda testing is being skipped.

(Edited.)

Raf, Could you clarify what you ask us to do? Do you think we silently skip lambda testing on certain platforms?

Raf Schietekat的头像

When I do "make test" in the current version TBB 4.1 update 2 (tbb41_20130116oss), I see "Known issue: some tests are skipped on Mac OS* X" several times for test_malloc_compliance.exe and there are some others in the source code (like "Known issue: GCC builtins aren't available"), so there's at least an expectation to be alerted about such matters. While there are several positive REMARKs (only with -v) about lambda testing, it would be nice to have at least one negative general REPORT (even without -v) about skipping lambda in all tests, e.g., from test_tbb_version.exe, which would provide some clarity when testing in an unfamiliar environment. It took a while before I became aware that g++ as shipped is still pre-lambda on OS X (which favours Clang instead).

Likewise, if mainly for people less familiar with TBB, it would be nice to be alerted if graphical output is being skipped in the examples (from examples/common/gui/convideo.cpp), as in a contribution I made a while ago (for another environment). Please also confirm whether you have received my recent contribution to restore graphical output on OS X (tbb41_20130116oss.Raf_Schietekat.2013-02-14.tgz).

I would give an opposite suggestion about some flow graph examples output, which should depend on "-verbose".

(While I have your attention, there's a typo "//! prepare wait by inserting 'thr' into the wai[l]t queue" ([emphasis] mine) in src/tbb/concurrent_monitor.h.)

Thanks a lot, Raf.

The last one (about the typo) is easy, I will fix it.

For the rest, I think you have fair points. We will discuss them among outselves, and let you know how we are going to address them.

Thanks!

 

Hi, Raf,

Please ask Alexey K about your contiribution tbb41_20130116oss.Raf_Schietekat.2013-02-14.tgz if he has not told about it.

As for the other suggestions, we are mostly o.k. with them. Some are already done (independently) in the trunk, and the others will trickle out in upcoming releases.

Thanks!!

登陆并发表评论。