Under the hood: Learning more about task scheduling

I’m back with another challenge, encountered during my support work for Intel® Threading Building Blocks.  I’ve been working with several TBB users who appreciate the general philosophy of Cilk task scheduling embodied in TBB but have run into some practical challenges applying it to their applications.  Often the issue revolves around the need to block some computations until other computations complete.  It may be that they need to handle either inter-object or intra-object threading-their application may at different times encounter a bunch of objects to run in parallel or just one big object that could use several threads to chew on simultaneously.  Or they may have objects they need to compute upon which other objects are dependent-here it would be great to suspend dependent object processing while the weight of the associated threads is thrown to computing the shared object.

These are tough problems and I don’t have answers for them yet.  But I have a few ideas and I hope in the next few posts to share them and get community feedback that might inspire some solutions.  So bear with me as I stumble about and maybe we can all learn something.

Anyone who has had much exposure to presentations about Intel TBB has probably heard in one form or another the motto, “process locally, steal globally.”  The Cilk philosophy embodied in the TBB task scheduler is to constrain active threads to their own local region in memory to exploit any memory that may already reside in the caches of the processing element running that thread; meanwhile, idle threads should try to steal work from memory regions as yet untouched by the active thread(s) to avoid interrupting those running threads.

When one of my customers presented an example using a nested pair of parallel_for statements, I realized that all my knowledge on subject of scheduling was theoretical: I hadn’t dug down into the guts of this code.  Now might be the time. 

First, here’s a sketch of the test code.  You’ll note an outer and inner loop with a lock on the outer and some spinning work on the inner.  This is intended to represent the structure of a program operating on a set of objects that block because of some postulated resource contention, and the inner loop represents the work done to compute that shared resource.  You’ll see that the lock is currently commented out.  When running this code on an 8-core machine, it will sometimes lock up.  Some ideas have been bounced around about why, but I’m curious whether I can demonstrate the workings of the problem rather than just speculating about it.  There’s also explicit references to the auto_partitioner, but with the advent of the affinity_partitioner, use of the auto_partitioner has been deprecated.  Still, we’ll start with this one and then see if we can tell the difference when looking under the hood.
b08050501.JPG

So what’s going on here?  As is my wont, I turned to Intel® Thread Profiler for a first look at the processing:

b08050502.JPG

Hrumph!  I can see eight threads cranking at the work, which is good because I’m running on an 8-core processor.  32% of the run time is spent at concurrency level 8.  Most of the time is spent in that serial tail that must represent startup processing, but the whole run is under 0.2 seconds so I suspect that’s a fixed overhead that will be negligible compared to the overall time of real work.  But that’s about all.  Zooming in on the high concurrency zone:
b08050503.JPG

Where the work is done, the test is keeping 8 threads busy 94% of the time. But still there’s no clue about how the task scheduler is dividing the work, though it looks like it gets interesting towards the end of the run. Guess I’ll have to revert to an old standard, inserting print statements to expose what is happening in the code:
b08050504.JPG

Unfortunately, while I can print out the loop bounds that tell me where I am in the nested executions of the array, I can’t print which thread is handling which range.  The TBB task object that might contain that information lies outside the scope of the outer_loop class.  Stuck.

But not for long.  In the evolving library that is Intel Threading Building Blocks, there is a new feature, available in at least the latest open source release (upgraded to Stable as tbb20_20080408oss), called task_scheduler_observer.  I’ve implemented an observer that lets me identify which thread executes which range.  In my next post I describe it and start exploring the behavior of the scheduler.

For more complete information about compiler optimizations, see our Optimization Notice.

13 comments

Top
robert-reed's picture

Most of the codes I know of are proprietary, not open-source, so I have nothing I can hand you. The examples I know of in scene graphs and other dependency networks come from customers whose code I cannot share. Anyone else out there who might have a sharable example?

jsukha's picture

Not sure if there are more recent posts on this topic, but I have been interested in investigating how to schedule in parallel computations with these DAG-like dependencies, and someone referred me to this post.

Are there examples of existing open-source applications out there which have this kind of structure that one might want to write using TBB, Cilk, etc.? I am curious to know if there are any somewhat-realistic benchmarks out there which can be experimented with to see whether a scheduling algorithm is going to be effective or not?

As a student I'm ideally looking for codes which are just complicated "enough" for a small project, but not "too" complicated. :)

anonymous's picture

I have just been looking at the code for the task scheduler and I 'think' the following would work...

Add another state, similar to reexecute, perhaps delayed_reexecute. Normally a reexecute calls an immediate spawn on the task. delayed_reexecute would set a local task* delayed_task variable (initially set to NULL) and then after the next successful get_task or steal_task the delayed_task would be spawned and the delayed_task reset to NULL.

This method adds a couple of if tests against the delayed_task variable, one in the get_task middle loop and one in the steal_task outer loop, so reasonably small overhead.

Unfortunately, my app is not is in a position to easily test this, but I will do so as soon as I can.

Dmitry Vyukov's picture

Re [Charles E. Leiserson]: Work-stealing need not be balanced. The key issues are work (total ops) and span (critical-path length). Parallelism is work/span. As long as there's substantially more parallelism than processors (work/span >> P), work-stealing will give linear speed-up. Your [Dmitriy V'jukov] example has a large span compared to work, and so the parallelism is minimal. For more details, see
http://supertech.csail.mit.edu/cilk/lecture-1.ppt.
Other lectures and materials can be found here:
http://supertech.csail.mit.edu/cilk/.

I've read Cilk papers some time ago. Can you, please, explain why
(A) work must be structured as balanced tree (I mean not ideally balanced, just sufficiently balanced)
not followed from
(B) work/span >> P
?

I can't imagine unbalanced tree with work/span >> P. Also take into account that stealing is expensive operation.

Dmitriy V'jukov

anonymous's picture

I have a similar issue. My app is effectively a OLAP type database. The task I am parallelising is resolution of user data selections to a data set. So I have a tree of operators on picked items from dimensions. A node in the tree (and possibly occuring more than once in a tree) could be a previously saved data set. This saved data set may or may not exist in memory and may require construction from its initial definition and thereby generating a new resolution of data set task. As soon as such a saved data set appears more than once I could have a second tbb thread looking to use it whilst it is being constructed. Without getting more sophisticated it would have to block and wait.

Wouldn't the following provide a solution?

If a tbb thread could recycle what is left of its task as a new task and then force a task steal rather than immediately checking to see what other work it had available then there is a chance it would do some useful work - possibly even helping the first tbb thread as it is busy resolving the saved data set (assuming that was broken into many tasks, which mine will be). Once it had completed the stolen task it would go back to normal processing of its task set.

I wouldn't claim my TBB knowledge to be anywhere near 100% yet... but I believe a TBB thread will always execute all task in its task set before looking to steal, so I don't think there is a way of simulating this process by playing around with levels in the stack of task lists.

It may be better if this thread moved onto some other task in its task set if such tasks were available and only looked to steal if there were no other tasks. Then what I am proposing is that the task that would otherwise block would temporarily become invisible to the thread, probably only for one 'get next task' call.

I.e. Have a task.spawntemporarilyinvisible() call

Mick

robert-reed's picture

Thanks very much, Charles! I think I understand the basics of your algorithm and I'd like to dig deeper as I have the time. I'll certainly report back here anything I find interesting. Translating from the theoretical DAG to a practical object computing code, it seems to me that the grey-following threads would reexecute the serial regions in the object computation functions that without the following would only get executed once. So once-per-object operations like reference counting, memory allocations and such would have to be guarded from the follower(s). Also in that list, there'd need to be some mechanism for the code to figure out that the initial spawn for a successor had already occurred and a way to find the outgoing spans for that parallel region in, say the TBB task tree. Another complication would be serial regions between parallel regions within a particular function that may require a synchronizer to ensure tasks associated with the first parallel block were complete before spawning tasks for the second parallel block.

Unfortunately, I do not have any data sets to share. The customers suggesting these models have their own concerns about privacy and have either not shared more than the sample code I show above or have constrained me under a nondisclosure agreement. But I can share anything I come up with personally, and I can point them to this exchange in the hope that they may participate to the level they are able.

Thanks for your comments. I hope we'll continue to keep your interest.

anonymous's picture

I think you can actually do the algorithm I proposed easily using only synchronization through memory, because the state bits change monotonically: white => grey => black and no-value => value. You might get some redundant computation, but it would likely be minimal.

Work-stealing need not be balanced. The key issues are work (total ops) and span (critical-path length). Parallelism is work/span. As long as there's substantially more parallelism than processors (work/span >> P), work-stealing will give linear speed-up. Your [Dmitriy V'jukov] example has a large span compared to work, and so the parallelism is minimal. For more details, see http://supertech.csail.mit.edu/cilk/lecture-1.ppt. Other lectures and materials can be found here: http://supertech.csail.mit.edu/cilk/.

anonymous's picture

Code formatting in previous post is lost. Damn!

Re [Charles E. Leiserson]: Unfortunately, not all computational parallelism falls into simple packages of tree-structured tasks and data dependencies.

Btw, work-stealing scheduler supports only parallelism structured as *balanced* tree. Am I right here?
So user MUST structure parallelism as *balanced* tree. Provided extreme case of unbalanced tree, user can get extremely bad performance.

For example following structure *allows* parallelism, but will be executed extremely badly by work-stealing scheduler:

work11 forks work21, work22

work21 forks nothing
work22 forks work31, work32

work31 forks nothing
work32 forks work41, work42
etc

Is my understanding correct?

anonymous's picture

Re [Charles E. Leiserson]: But before that, let me back up and repose your question: do you see a means to efficiently compute in the face of such object dependencies? I'd love to find that I'm overlooking an obvious solution that doesn't require locks. Got an idea?

What do about following approach?

// pseudo-code
void task::operator()
{
secondary_t* sec = get_associated_secondary_element();
if (sec)
{
if (false == sec->is_computed())
{
if (sec->try_lock())
{
sec->compute();
sec->unlock();
}
else
{
// in fifo order, not in lifo!
postpone_task(this);
// maybe next time
return;
}
}
}
// sec is computed, can use it
fork_child1(sec);
fork_child2(sec);
}

anonymous's picture

You're right that there are issues of memory use. A properly implemented work-stealing scheduler guarantees that if S1 is the serial stack space, with P processors the stack space is at most P*S1. If you repurpose a processor every time it hits somewhere another processor is executing, you can easily blow out memory.

You can model the computation to performed as a dag (directed acyclic graph). Each vertex has a value to be computed which depends on the values of its successors in the dag. Sinks (vertices with out-degree 0) can be computed with no further exploration. A vertex whose value has been assigned can provide its value to its predecessors without further computation.

Here's a randomized nondeterministic algorithm to compute the dag which should work well using Cilk or TBB, although I haven't implemented it. It uses at most P*S1 stack space. Recursively walk the dag in parallel, spawning each exploration of a successor. Label edges as white, grey, and black. All edges are initially white. If you traverse a white edge, color it grey. If you return across a grey edge, color it black. Now, in your parallel recursive tree walk, when you visit a vertex, if its value has already been computed, return the value. If not and there exists at least one white edge, take a _random_ white edge. Otherwise (all edges are black or grey), take a _random_ grey edge. (You're now chasing the tail of the processor that colored the edge grey speculatively hoping to help out with that subdag.) Use locks only to guarantee the atomicity of these operations on a single node.

I'll look at this problem with my students, both theoretically and in practice. If you have any data sets, let me know. Also, if you implement it, I'd be curious as to the results.

Pages

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.