Clean up inside gate.h

Clean up inside gate.h

AJ's picture

Well, I get to be the jerk to call out a piece of very confusing code inside TBB. This master-piece took a half hour to understand, and I'm still very confused as to what this function does. Inside function:

void try_update( intptr_t value, intptr_t comparand, bool flip=false )

There is this gem:

if( flip ? old!=comparand : old==comparand )

It took me a while to parse out how a ternary-operator was being used within an if statement. Once that was out of the way, I am still very confused as to why try_update takes a flip argument at all... the documentation says this:

//! Update state=value if state==comparand (flip==false) or state!=comparand (flip==true)

I am still very confused... so this function seems to be something that will flip around the logic... complement perhaps. Why can't this be split into two functions? Is there a really good performance reason for doing this? I don't want to clean up code inside TBB that is like that for a very good reason... so other than possibly minimizing the size of the instruction cache, why is the function this way?

Thanks,

AJ

27 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Raf Schietekat's picture

"Why can't this be split into two functions?" It took me a half hour to understand why flip might be so confusing... no, I still don't get it. :-)

AJ's picture

Because it looks like try_update is trying to do too much... the functionality depends upon a last parameter, and I just don't see a real reason to do that. In addition, that last parameter has a default argument set. For someone reading it for the first time, it's just confusing to try to understand this strange function. The function isn't used in many places, so I think a better name and perhaps splitting the function can be done without any harm.

Alexey Kukanov (Intel)'s picture

Quoting - AJ
I am still very confused... so this function seems to be something that will flip around the logic... complement perhaps. Why can't this be split into two functions? Is there a really good performance reason for doing this? I don't want to clean up code inside TBB that is like that for a very good reason... so other than possibly minimizing the size of the instruction cache, why is the function this way?

It could be split of course; but if all code except one condition is exactly the same for two cases, why would we duplicate this code, andconduse ourselves with two names for essentially the same thing?

AJ's picture

No reason other than to make it easier to trace through the code for people (like me) not familiar with it.

robert-reed (Intel)'s picture

Quoting - AJ

No reason other than to make it easier to trace through the code for people (like me) not familiar with it.

I guess you're familiar with it now? ;-)

Dmitry Vyukov's picture

Quoting - AJ

There is this gem:

if( flip ? old!=comparand : old==comparand )

Is this better (note - no ternary-operator)?

if(flip^old==comparand)

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
robert-reed (Intel)'s picture

Quoting - Dmitriy Vyukov

Is this better (note - no ternary-operator)?

if(flip^old==comparand)

Good one, Dmitriy! No matter how obscure, there's always a more obscure way to do most things.

Dmitry Vyukov's picture

I think there are better ways to spend time in gate.h. For example:

- add fenced version of get_state() with preceding store-load fence

- move it to public interface

- reuse it in queue as wait logic

Also probably:

- make it fine-grained to prevent thundering herd in queue (which is currently there). Here is description and algo:

http://groups.google.com/group/comp.programming.threads/browse_frm/threa...

or

- make it portable general-purpose low-overhead gate/eventcount/condition_variable. Here is algo:

http://groups.google.com/group/comp.programming.threads/msg/16987ca1f707...

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
AJ's picture

Quoting - Dmitriy Vyukov

I think there are better ways to spend time in gate.h. For example:

- add fenced version of get_state() with preceding store-load fence

- move it to public interface

- reuse it in queue as wait logic

Also probably:

- make it fine-grained to prevent thundering herd in queue (which is currently there). Here is description and algo:

http://groups.google.com/group/comp.programming.threads/browse_frm/threa...

or

- make it portable general-purpose low-overhead gate/eventcount/condition_variable. Here is algo:

http://groups.google.com/group/comp.programming.threads/msg/16987ca1f707...

I don't know how to work on any of those issues, but I do know how to clean up unclear code. I'll put this on my TODO list, to make a quick patch for this area of gate.h.

Code review is an important part of open source software, unless there is a good reason not to adjust the code and interface. Many open source projects have their commits posted to a mailing list for review, to ensure that what enters the repository is clear, and peer-reviewed.

Raf Schietekat's picture

"I do know how to clean up unclear code" That's why yesterday, in another project, I introduced some instances of "if(A?B:C)" into existing code, because I dislike redundancy even more than ternary operators in predicates. :-)

robert.jay.gould's picture

Quoting - Raf Schietekat

"I do know how to clean up unclear code" That's why yesterday, in another project, I introduced some instances of "if(A?B:C)" into existing code, because I dislike redundancy even more than ternary operators in predicates. :-)

I agree that ternary operators are nice to keep code simple and clean, not obscure at all for myself. But then again ternary operators might be loosing popularity with younger programmers, who knows? One day ternary operators might get the GOTO treatment, and be considered stuff best left out, unless utterlynecessary.

AJ's picture

It's not the ternary operator, it was the ternary operator within an if statement. It was the first time I have ever seen this usage. If it's an accepted pattern, that's fine, I just have never seen it and it seemed overly cryptic.

It also seemed to me that this function is trying to do too much... the behaviour of the function changes with an argument.

Now, it could be that I just don't understand exactly what this function is trying to do, and some better documentation would make it a lot clearer for me. However, separating this function into something like try_update_if_equal, and try_update_if_different or something like swap_if_equal(), swap_if_not_equal() would be easier to read and understand.

Raf Schietekat's picture

"the behaviour of the function changes with an argument"

I generally like it when functions don't ignore what I tell them to do. :-)

Alexey Kukanov (Intel)'s picture

Remember, the stuff was initially developed as strictly internal, so it was enough a few TBB scheduler mavens understand it :). It's good that itsunclearnesswas caught by your eyes - a chance to improve. I agree with you that better documentation inside the code would be helpful for everyone who like you tries to understand what's going on there. I also agree that better naming is possible, both for the function and its arguments. Still, duplicating the code just for sake of function naming is redundant, and our internal peerreview would I believe say no-go to any such proposal.

Arch D. Robison (Intel)'s picture

I'm with Dmitriy - time is best spent fixing the overall algorithm (blockingthreads until there is work to do). I will study Dmitriy's references in more detail, as I'm alreadytrying to redo the algorithm.

AJ's picture

Quoting - Alexey Kukanov (Intel)

Remember, the stuff was initially developed as strictly internal, so it was enough a few TBB scheduler mavens understand it :). It's good that itsunclearnesswas caught by your eyes - a chance to improve. I agree with you that better documentation inside the code would be helpful for everyone who like you tries to understand what's going on there. I also agree that better naming is possible, both for the function and its arguments. Still, duplicating the code just for sake of function naming is redundant, and our internal peerreview would I believe say no-go to any such proposal.

Could somone spend a moment, and provide a clear explanation of the purpose of this function?

Also, I will be going through a lot more of the messy details of the scheduler. With everyone's help, I'd like to write a bit of a guide to help others understand its internals.

Dmitry Vyukov's picture

Quoting - AJ

Could somone spend a moment, and provide a clear explanation of the purpose of this function?

It's really hard to explain the purpose of this single function, because it's a part of entire hardcore lock-free blocking/signaling algorithm.

Condition variables are usually used for blocking/signalling in lock-based algorithms. But condvars are not suitable for lock-free algorithms, because they require a mutex and thus eliminating all benefits of lock-freedom.
Eventcounts/gates are basically "condition variables for lock-free algorithms". The main point is that they make fast-path non-blocking (and preferably atomic-free) (slow-path is still blocking because it's irrationally to have non-blocking blocking algorithm).

I think you can start studying from this excellent Chris Thomasson's eventcount algorithm:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/aa8c62ad06dbb380
(see Joe Seigh's comments about memory fences and usage example)

I think Chris Thomasson can say more about eventcounts/gates and provide more links.

Btw, I was thinking about writing an article for ISN about eventcounts and their usage in conjunction with lock-free containers (queues, stacks, etc). Because currently my queue is only half-useful:
http://software.intel.com/en-us/articles/single-producer-single-consumer-queue

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture

Quoting - Arch Robison (Intel)

I'm with Dmitriy - time is best spent fixing the overall algorithm (blockingthreads until there is work to do). I will study Dmitriy's references in more detail, as I'm alreadytrying to redo the algorithm.

I can describe what interface for eventcount I am currently using:

class eventcount
{
public:
    typedef uint32_t state_t;

    eventcount();
    ~eventcount();

    state_t prepare_wait();
    void retire_wait();
    void wait(state_t cmp);

    void signal();
    void signal_relaxed();
};



//Here is a little helper class for blocking threads:

class eventcount_blocking
{
public:
    eventcount_blocking(eventcount& ec)
        : ec_(ec)
    {
        cmp_ = ec_.prepare_wait();
        wait_ = false;
    }

    void wait()
    {
        assert(false == wait_);
        wait_ = true;
        ec_.wait(cmp_);
    }

    ~eventcount_blocking()
    {
        if (false == wait_)
            ec_.retire_wait();
    }

private:
    eventcount& ec_;
    eventcount::state_t cmp_;
    bool wait_;

    eventcount_blocking(eventcount_blocking const&);
    eventcount_blocking& operator = (eventcount_blocking const&);
};

And here is usage example:

queue q;
eventcount ec;

void produce(void* data)
{
    q.enqueue(data);
    ec.signal(); // or signal_relaxed() depending on queue implementation
    // overhead for signaling is: [MFENCE] + load + branching
}

void* consume()
{
    void* data = 0;
    if (data = q.dequeue())
        // fast-path (no overhead)
        return data;
    for (;;)
    {
        eventcount_blocking block (ec);
        if (data = q.dequeue())
            return data;
        block.wait();
        if (data = q.dequeue())
            return data;
    }
}

The main point is that (1) eventcount logic is completely non-intrusive wrt lock-free container and (2) it's possible to aggregate several containers with single eventcount:

void* consume_impl()
{
    return high_prio_queue.dequeue()
      || normal_prio_queue.dequeue()
      || work_stealing_dequeue.pop()
      || low_prio_queue.dequeue()
      || global_root_task_queue.dequeue();
}

void* consume()
{
    void* data = 0;
    if (data = consume_impl())
        // fast-path (no overhead)
        return data;
    for (;;)
    {
        eventcount_blocking block (ec);
        if (data = consume_impl())
            return data;
        block.wait();
        if (data = consume_impl())
            return data;
    }
}

Impressive! IMHO, it's extremely flexible and powerful.

P.s. I'm not aware of legal details, but if you need I can submit some of my algorithms as official TBB contributions as hash map.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Alexey Kukanov (Intel)'s picture

Quoting - AJ

Could somone spend a moment, and provide a clear explanation of the purpose of this function?

Also, I will be going through a lot more of the messy details of the scheduler. With everyone's help, I'd like to write a bit of a guide to help others understand its internals.

Wanna become one of those scheduler mavens? :):)

The gate is basically a state machine. The states are described in task.cpp as three constants: SNAPSHOT_EMPTY means there is no tasks available for stealing, SNAPSHOT_FULL means there is at least one task (yes, you might find the name misleading :)), and SNAPSHOT_PERMANENTLY_OPEN means all masters leave, there will be no more work and workers should just exit. There is the fourth state, BUSY, which means some thread is checking all task pools to find if there are some tasks; this state does not have a permanent constant for it but instead is represented by a unique ID of the thread doing the exercise.

Between states, some transitions are possible and some are not. Transitions are performed by the function called try_update; as you might guess from its name, it does not guarantee a new state is set because some other thread could change the current state. The semantics of try_update is almost the same as for compare-and-swap: it changes the state of the gate to Value (the first argument) if it is currently equal to Comparand... or if it isnot equal, in some cases, depending on which transitions to the new state are possible. To distinguish inside try_update what logic is required for a particular operation, the third argument (flip)is used. If flip is false (by default), the regular test-and-set logic is used. If flip is true (which should be explicitly specified), the opposite logic is used, i.e. the current value is changed iff it is not equal to Comparand.

Let's see some examples.

inline void GenericScheduler::mark_pool_full() {
    // ...
    Gate::state_t snapshot = arena->prefix().gate.get_state();
    if( snapshot!=SNAPSHOT_FULL ) 
        arena->prefix().gate.try_update( SNAPSHOT_FULL, SNAPSHOT_PERMANENTLY_OPEN, true );
}

As you might guess from its name, mark_pool_full signals that some task pool is "full", i.e. there is some job. Transition to FULL is possible from EMPTY and BUSY states, but not from PERMANENTLY_OPEN (which can not change to anything else, by the way). Thus we set the third argument instructing try_update to only make the change if the state is not PERMANENTLY_OPEN, i.e. using the "flipped" logic.

Now let's look at the piece of code from wait_while_pool_is_empty:

            case SNAPSHOT_FULL: {
                // Use unique id for "busy" in order to avoid ABA problems.
                const Gate::state_t busy = Gate::state_t(this);
                // Request permission to take snapshot
                arena->prefix().gate.try_update( busy, SNAPSHOT_FULL );

Currently, the state is FULL (as seen in the very first line); but a worker could not steal a task for a while, and so wants to check whether tasks to steal really exist, and refresh the state. To do that, it first tries to set the state to BUSY; and the change is only allowed if the state is still FULL, because otherwise either some other thread already does/have done the global check, or all masters quit and the gate is permanently open. Therefore "normal" test-and-set logic is used.

Hope it will help you to understand this piece of the scheduler.

Arch D. Robison (Intel)'s picture

If someone is planning to rewrite the entire snapshot/gate logic, here is what we want to accomplish in a scalable manner:

  1. A thread should sleep if there is no task in the system to work on.
  2. If a thread puts a task in its pool, and there are worker threads asleep, one of the workers should be woken up.

Your solution gets extra credit if when a task is mailed to another thread, (2) wakes up that thread. (1) may get more complicated if(1) takes into account the restrictions discussed for solving multiple pipeline deadlock.

You may gain or lose points for the number of ternary operators used, depending on who is grading.

Dmitry Vyukov's picture

Quoting - Arch Robison (Intel)

If someone is planning to rewrite the entire snapshot/gate logic, here is what we want to accomplish in a scalable manner:

  1. A thread should sleep if there is no task in the system to work on.

Do you mean that threads must sleep on depth levels > 0? Currently I am no sure whether it's worth doing... it will definitely substantially complicate implementation.

Quoting - Arch Robison (Intel)

If someone is planning to rewrite the entire snapshot/gate logic, here is what we want to accomplish in a scalable manner:

  1. A thread should sleep if there is no task in the system to work on.
  2. If a thread puts a task in its pool, and there are worker threads asleep, one of the workers should be woken up.

Your solution gets extra credit if when a task is mailed to another thread, (2) wakes up that thread. (1) may get more complicated if(1) takes into account the restrictions discussed for solving multiple pipeline deadlock.

What exactly do you mean here? If threads don't sleep on depth levels > 0, then employing can't affect blocking logic, because sleeping threads will be unemployed. And if threads do sleep on depth levels > 0, then I think it (depth) will be cause of most difficulties.

Quoting - Arch Robison (Intel)

You may gain or lose points for the number of ternary operators used, depending on who is grading.

:)

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture

Quoting - Arch Robison (Intel)

If someone is planning to rewrite the entire snapshot/gate logic, here is what we want to accomplish in a scalable manner:

  1. A thread should sleep if there is no task in the system to work on.
  2. If a thread puts a task in its pool, and there are worker threads asleep, one of the workers should be woken up.

Your solution gets extra credit if when a task is mailed to another thread, (2) wakes up that thread. (1) may get more complicated if(1) takes into account the restrictions discussed for solving multiple pipeline deadlock.

You may gain or lose points for the number of ternary operators used, depending on who is grading.

Also the interesting case is spawning of a list of tasks - spawn( task_list& list ) - and some of that tasks can be mailed to some mailboxes. The straightfowrard approach will be to handle each task individually, just like there are N independent spawn( task& child ) calls. But this can be suboptimal.

And another interesting moment is interaction of mailboxes, blocking and employment. Spawning thread is unable to determine whether target thread will have right employer at the time of dequeueing of task from mailbox. So it is unable to correctly decide whether it has to wake up another thread or not. It can only make prediction based on current target thread's employer.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Arch D. Robison (Intel)'s picture

To clarify: The worker hreads only have to sleep at depth levels <= 0. I.e., the conditions under which the current logic in task.cpp calls "wait_while_pool_is_empty()".

AJ's picture

Quoting - Alexey Kukanov (Intel)

Wanna become one of those scheduler mavens? :):)

Yes I do actually :-) Could I ask you to blog on this topic? I've enjoyed your previous blogs on concurrent_vector, and task cancellation. I would like to understand the internals of the scheduler in detail, and this post has answered some of my questions.

Also, judging by this thread I sense that others have no issue with try_update and the ternary function. I'm glad I stimulated a conversation anyways. Many open source projects feature commit reviews, and my intention was simply to review a part of code that I felt could have been better. I accept that the majority disagrees with me, and I'll learn your way :-)

Alexey Kukanov (Intel)'s picture

Quoting - AJ
Yes I do actually :-) Could I ask you to blog on this topic? I've enjoyed your previous blogs on concurrent_vector, and task cancellation. I would like to understand the internals of the scheduler in detail, and this post has answered some of my questions.

Honestly, I did not write any of those blog topics. I will deliver your praise to Anton and Andrey, though:)

Internals of the scheduler could be the subject of a whole lot of blog posts, but I amnot sure whether those would be of interest for many people. Answering questions at the forum gives me the feeling that I directly address the need of those who asked. Writing a blog post about scheduler internals full of technical details seems much less useful because there is nofeeling of matching the audience needs, so no reward.

Quoting - AJ
Also, judging by this thread I sense that others have no issue with try_update and the ternary function. I'm glad I stimulated a conversation anyways. Many open source projects feature commit reviews, and my intention was simply to review a part of code that I felt could have been better. I accept that the majority disagrees with me, and I'll learn your way :-)

Thanks. In fact, that piececode might be made better - through better documentation, or reimplementing it wholly, e.g. as Dmitry V. suggested.

Dmitry Vyukov's picture

Quoting - Alexey Kukanov (Intel)

Honestly, I did not write any of those blog topics. I will deliver your praise to Anton and Andrey, though:)

Alexey, you've already described the TBB scheduler ;)

http://www.intel.com/technology/itj/2007/v11i4/5-foundations/4-tbb.htm

Also the MUST READ info is Grandfather Cilk's "Scheduling Multithreaded Computations by Work Stealing":

http://supertech.csail.mit.edu/papers/steal.pdf

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Login to leave a comment.