LIFO vs FIFO

LIFO vs FIFO

MADadrobiso:

TBB uses an approximation of Cilk-style
scheduling (work LIFO; steal FIFO).It has good locality and load
balancing properties while making certain space guarantees. It enables
clean nesting of software components, analogous to the way serial code
composes by nested subroutine calls.

FIFO has terrible locality and often has no space guarantees, which
often leads to complex throttling mechanisms. Nonetheless FIFO
scheduling definitely has its uses, notably for soft real-time.
Finding a rationale marriage ofCilk-style and FIFO scheduling remains
an open problem. If done rigorously it's a worthyfor aPh.D. thesis.

The "two roots" idea suggests that perhaps said marriage could be
accomplished by some kind of epoch mechanism, where each task belongs
to an epoch and older epochs are given precedence. There's a long way
between a notion and a practical system.

There are two kinds of user algorithms.
First is a 'Cilk-style' computational algorithms ('HPC'). They doesn't require nor won't benefit from 'fair' processing. Even worser, they will degrade and will require more memory.
Second is a kind of general-purpose algorithms (games, client/server software). Sometimes they just require fair (at least to some degree) processing. Otherwise they can basically deadlock.
So, I think, the first decision one have to made is how to cope with those kinds of user algorithms. For example TBB can postulate that it doesn't support second kind at all. Or TBB can have 2 modes of operation, i.e. user will have to setup proper type of scheduler at the beginning of program. Or TBB can support 2 kind simultaneously, for example on per task basis, i.e. those tasks are HPC, and those are general-purpose. Or just provide some single compromise scheduler.

I'm curious what .NET TPL team will decide on this problem:
http://blogs.msdn.com/pfxteam/archive/2008/08/01/8800195.aspx
I hope they will publish some results. Btw, I see some interesting words in comments, like PRIO, SIZE and AGE.

53 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Is it too restricting to eliminate programs that make assumptions about order of execution beyond the dependencies expressed in the task tree? cancel_group_execution() as a means to schedule an end to the execution of a program should only be "spawned" in a tbb_thread.

A round-robin between LIFO and FIFO seems like a recipe for space requirement explosions: a worker thread should always first exhaust its own LIFO task tree before doing anything else (like stealing or executing FIFO tasks).

MADadrobiso:

The "two roots" idea suggests that perhaps said marriage could be
accomplished by some kind of epoch mechanism, where each task belongs
to an epoch and older epochs are given precedence. There's a long way
between a notion and a practical system.

Hmm.... Scheduling based on epochs looks interesting...
When I was designing task-processing framework, I was thinking about something like: N tasks precessed in LIFO order, then 1 task is dequeued in FIFO order, then again N tasks precessed in LIFO order.
But anyway this can blow up space requirements...

The interesting thing is try to determine whether user algorithm makes forward progress or not. False negatives are Ok, false positives are unacceptable. If user algorithm makes forward progress then no need to change epoch, or use FIFO order. If no forward progress, then epoch must be changes, or some tasks must be dequeued in FIFO order. This can ensure maximum performance and space guarantees for HPC tasks, and at the same time guarantees of forward-progress for general-purpose tasks.

I wonder how much different an epoch-based scheme is from simply having both a LIFO and a FIFO, and giving the LIFO preference. I.e., the outer-most task graph would run FIFO. The LIFO would handle all the inner levels. That yields a system that would sort of be like a bunch of sequential communicating processes, where each process is running a traditional call stack.

Quoting - raf_schietekat That would be the general idea, but there are obviously ways to trade absolute fairness for better performance. Using a distributed queue could be one. The merge could be done with a coarse timestamp granularity (did somebody say "epochs"?), and an idle thread could take a bigger bite than just the front task. A thread could keep tasks local until the end of the execute() or beyond (based on a maximum delay)

Hmm... it looks like plain FIFO. I.e. no locality, no space-guarantees...

I think it's unacceptable trade-off for computational tasks which don't require any fairness at all.

"Hmm... it looks like plain FIFO. I.e. no locality, no space guarantees..." That would be where "keep tasks local until the end of the execute() or beyond (based on a maximum delay)" comes in (see added emphasis), and I guess it is appropriate to first execute locally queued tasks (after a quick check that other threads are not getting too far behind). Perhaps it may still make sense to have affinity in the global queue, but between FIFO and stealing the locality would have already decreased.

I wonder how radical a revision of the scheduler should be: somewhere between absolute fairness (everything in a global queue, which would be inefficient) and mere guaranteed progress (no global queue, where one thread can get behind while other threads are having a good time doing nothing in particular, all with maximum efficiency). But what elements should be taken into consideration? Maybe tasks can have different priorities (GUI focus, GUI non-focus, background computation), maybe they have soft real-time constraints, maybe they have ordering constraints, ...

"I wonder how much different an epoch-based scheme is from simply having both a LIFO and a FIFO": tasks and LIFO work would be unbounded, so what would be the details of an epoch here vs. with a pre-emptive scheduler?

Quoting - ARCH ROBISON (Intel)I wonder how much different an epoch-based scheme is from simply having both a LIFO and a FIFO, and giving the LIFO preference. I.e., the outer-most task graph would run FIFO. The LIFO would handle all the inner levels. That yields a system that would sort of be like a bunch of sequential communicating processes, where each process is running a traditional call stack.

I think that this scheme still can deadlock... if I am not missing something.

Assume that 'critical task' (which will resolve deadlock) is spawned at "inner levels". If inner levels use LIFO than this task can never be executed. So to be fair scheduler must guarantee eventual execution of ALL tasks, spawned on any level, at any time.

Scheduler based on epochs provides such guarantee.

"If inner levels use LIFO than this task can never be executed." What, no stealing anymore? Why not? Or is this a scenario where TBB would deadlock too?

Quoting - raf_schietekat

"If inner levels use LIFO than this task can never be executed." What, no stealing anymore? Why not? Or is this a scenario where TBB would deadlock too?

It can deadlock even with work stealing.

The most trivial situation is when we have only 1 thread (processor).

If we have more than 1 thread (processor), then assume all threads are livelocked by infinitely resurrecting tasks.

TBB will deadlock here.

Quoting - raf_schietekat

I wonder how radical a revision of the scheduler should be: somewhere between absolute fairness (everything in a global queue, which would be inefficient) and mere guaranteed progress (no global queue, where one thread can get behind while other threads are having a good time doing nothing in particular, all with maximum efficiency). But what elements should be taken into consideration? Maybe tasks can have different priorities (GUI focus, GUI non-focus, background computation), maybe they have soft real-time constraints, maybe they have ordering constraints, ...

I think the main requirement for "non-computational" tasks is that every task must be eventually executed.

First of all sorry for the weird stuff in some of my writings since the forum was changed (which also left me unable to go in and correct anything). With Firefox (1.5 and 2.0 alike, between Linux, OpenSolaris and Windows) it's a war zone you would not believe, yet the weird stuff that got inserted through copy&paste was invisible on it except for its side effect of causing wrapping at character boundaries. Apparently Intel has never heard of this strange little browser that nobody uses (imagine smiley for irony here), but it also has to be said that Firefox gets in a strange trance when visiting the forum, forcing me to kill it (no site should be able to do that; in its defence, I cannot remember having experienced this with Firefox before).

Also, coming back to: "If inner levels use LIFO than this task can never be executed.", my reply "What, no stealing anymore?" was rather thoughtless. Here's a new attempt:

Considering that TBB is for optional parallelism, where programs must make no assumptions about the number of available threads, would any existing program written for it deadlock with the new scheme, and why? Would any programs written for the new scheme deadlock, and why? I guess I don't understand exactly what you mean with "critical task (which will resolve deadlock)" etc. Or is optional parallelism itself a problem?

Quoting - raf_schietekat

Considering that TBB is for optional parallelism, where programs must make no assumptions about the number of available threads

While parallelism is optional, order of execution of tasks still has significance.

Program can work without parallelism with one order of execution, and the same program can deadlock without parallelism with another order of execution.

Quoting - raf_schietekat

would any existing program written for it deadlock with the new scheme, and why?

NO!

Some programs will deadlock with current TBB scheduler. But it's NOT the bug in TBB. Such programs are just not supported (supporting such programs DO will sacrifice performance for other programs).

So the main question is whether it's possible to support and those programs too with strictly insignificant performance degradation for other programs?

Quoting - raf_schietekat

Would any programs written for the new scheme deadlock?

YES.

Quoting - raf_schietekat

and why? I guess I don't understand exactly what you mean with "critical task (which will resolve deadlock)" etc. Or is optional parallelism itself a problem?

"critical task" is one which, for example, executes task_group_context::cancel_group_execution() (or manually interrupts execution of group of tasks) when it is executed. If it is NOT executed then other tasks will be executing infinitely (assume they recursively respawn themselves).

Quoting - Dmitriy V'jukov

Some programs will deadlock with current TBB scheduler. But it's NOT the bug in TBB. Such programs are just not supported (supporting such programs DO will sacrifice performance for other programs).

So the main question is whether it's possible to support and those programs too with strictly insignificant performance degradation for other programs?

The more I think about the problem the more I trend toward that it's impossible to solve (at least in single scheduler).

For every scheme I can come up with, I can come up also with counter example on which scheme has unbounded space requirements and/or worse locality.

So solutions I see:

1. Don't support fairness at all (good variant - no need to change anything :) )

2. Just incorporate 2 schedulers: 1 LIFO + 1 FIFO. Scheduler determined in task description, or specified as parameter to spawn(). Worker thread can just process N tasks from LIFO scheduler, and then N tasks from FIFO scheduler.

Quoting - raf_schietekat

Is it too restricting to eliminate programs that make assumptions about order of execution beyond the dependencies expressed in the task tree?

I can't answer this question. This depends on TBB's aims. But .NET TPL team take it very seriously:

http://blogs.msdn.com/pfxteam/archive/2008/08/01/8800195.aspx

(also read user comments)

Quoting - raf_schietekat

A round-robin between LIFO and FIFO seems like a recipe for space requirement explosions: a worker thread should always first exhaust its own LIFO task tree before doing anything else (like stealing or executing FIFO tasks).

I meant that tasks in FIFO scheduler can't spawn tasks in LIFO scheduler, i.e. it's completely separate types and graphs of tasks. So space requirements are 2*N, so to say. I.e. N tasks in LIFO scheduler and N tasks in FIFO scheduler.

"I can't answer this question." Well, it was mostly a rhetorical question related to the sentence that follows it. I would add more explicitly that no program should assume exact LIFO for correctness (only for performance and cache locality). Likewise, when I say that the LIFO tree should be exhausted before anything else is done, that would only be an advice to the scheduler, but again a program should not rely on it: the fewer assumptions, the better. But other than that, the TBB project should go to work on adding forward progress, priorities, time limits (best effort), etc.

Separating LIFO and FIFO seems rather limiting, so I would allow them to freely interoperate. The only thing that needs to be prevented is a scheduler-induced space requirement explosion (the user does not need to be protected from himself any more than for, e.g.,normal recursive calls), and that seems to be assured by exhausting the local "LIFO" tree before executing any "FIFO" task.

Quoting - raf_schietekat

Separating LIFO and FIFO seems rather limiting, so I would allow them to freely interoperate. The only thing that needs to be prevented is a scheduler-induced space requirement explosion (the user does not need to be protected from himself any more than for, e.g.,normal recursive calls), and that seems to be assured by exhausting the local "LIFO" tree before executing any "FIFO" task.

The problem is exactly when local LIFO tree is infinite... so it will be quite difficult to exhaust it :)

"The problem is exactly when local LIFO tree is infinite..." I would consider that a bug like any old infinite loop, and not TBB's rightful concern. Any programs that are, e.g., running a dispatch loop or somesuch from inside a task for lack of specific support from TBB would continue to work; any adaptation of such programs needs specific attention to its pseudo scheduler anyway to either make any new parts tap into it or overhaul the whole thing. Did I miss anything?

Quoting - raf_schietekat

"The problem is exactly when local LIFO tree is infinite..." I would consider that a bug like any old infinite loop, and not TBB's rightful concern. Any programs that are, e.g., running a dispatch loop or somesuch from inside a task for lack of specific support from TBB would continue to work; any adaptation of such programs needs specific attention to its pseudo scheduler anyway to either make any new parts tap into it or overhaul the whole thing. Did I miss anything?

Then I don't understand what problem you are trying to solve. Can you clarify and maybe describe some examples?

Why you want to reject LIFO in favor of FIFO?

Hey, the forum has eaten my reply (I think)! An attempt at reconstructing its main elements:

I am merely proposing a combination of LIFO and FIFO (both are needed, I'm rejecting neither), like Arch's post #2 above, but without any guarantees other than respecting user-specified task dependencies (support for diamonds would be nice), i.e., a program must not fail if the task graph is randomly rearranged (although it can be assumed for concerns of locality that this happens rarely if ever). Likewise, only forward progress would be guaranteed for queued tasks, not strict FIFO, which would be too expensive (an implementation could construct a queue of task buckets, which may be stolen by other worker threads). Queued tasks may already have LIFO dependencies before they are transferred to the current LIFO tree/dag, executing tasks can queue tasks as well as spawn dependencies, worker threads repeatedly grab a batch of queued tasks that are then executed to completion in LIFO order.

For this to work, tasks must execute to completion (including closure of dependencies) in a finite amount of time, just like serial programs, but any existing programs that don't know about the new TBB-provided FIFO functionality and cheat out of necessity by, e.g., serving a concurrent_queue from inside a long-running execute(), will continue to work, so the change would not be disruptive. Stopping a runaway execution can only be done synchronously (perhaps by polling), not by a scheduled task (neither spawned nor queued), or else by an external intervention, e.g., a tbb_thread, although this should probably be frowned upon, like using exceptions as part of the expected flow of control.

Is this clearer, even without examples?

Quoting - raf_schietekat

Hey, the forum has eaten my reply (I think)! An attempt at reconstructing its main elements:

I am merely proposing a combination of LIFO and FIFO (both are needed, I'm rejecting neither), like Arch's post #2 above, but without any guarantees other than respecting user-specified task dependencies (support for diamonds would be nice), i.e., a program must not fail if the task graph is randomly rearranged (although it can be assumed for concerns of locality that this happens rarely if ever). Likewise, only forward progress would be guaranteed for queued tasks, not strict FIFO, which would be too expensive (an implementation could construct a queue of task buckets, which may be stolen by other worker threads). Queued tasks may already have LIFO dependencies before they are transferred to the current LIFO tree/dag, executing tasks can queue tasks as well as spawn dependencies, worker threads repeatedly grab a batch of queued tasks that are then executed to completion in LIFO order.

For this to work, tasks must execute to completion (including closure of dependencies) in a finite amount of time, just like serial programs, but any existing programs that don't know about the new TBB-provided FIFO functionality and cheat out of necessity by, e.g., serving a concurrent_queue from inside a long-running execute(), will continue to work, so the change would not be disruptive. Stopping a runaway execution can only be done synchronously (perhaps by polling), not by a scheduled task (neither spawned nor queued), or else by an external intervention, e.g., a tbb_thread, although this should probably be frowned upon, like using exceptions as part of the expected flow of control.

Is this clearer, even without examples?

Not entirely...

If we are assuming that there are no 'infinitely resurrecting' tasks (i.e. infinite recursion), then forward progress and guarantees that every task will be eventually processed are ensured by current LIFO scheduler... until I am completely missing something.

So if we will add a portion of FIFO into scheduler, it can only decrease locality of processing...

My initial intention wrt FIFO was to provide guarantees of eventual processing for ALL situations. Although now I see that it's impossible w/o substantial degradation of properties for 'normal well-behaved' tasks...

It is impossible for a purely LIFO scheduler to guarantee forward progress unassisted. Why not provide an integrated scheduler to more easily keep the worker threads busy while providing this guarantee?

Each FIFO batch would run in LIFO fashion to exploit locality.

Quoting - raf_schietekat

It is impossible for a purely LIFO scheduler to guarantee forward progress unassisted.

Why? Whole stack in the work-stealing deque will be eventually drained. This invietably means that all tasks will be processed. What am I missing here?

Quoting - raf_schietekat

Each FIFO batch would run in LIFO fashion to exploit locality.

As I understand, locality is when you process task AND its CHILDS one after another, not when you process just some batch of tasks one after another.

"This invietably means that all tasks will be processed." Yes, but also that the LIFO scheduler isn't the whole story. How about a worthy superstructure? It takes some effort to provide efficient forward progress, not to mention handling priorities and timing constraints and whatnot. I would like a Building Block that handles that for me.

"As I understand, locality is when you process task AND its CHILDS one after another, not when you process just some batch of tasks one after another." Yes, "in LIFO fashion": while(true){ task::spawn_root_and_wait(new_batch); }

Quoting - raf_schietekat

"This invietably means that all tasks will be processed." Yes, but also that the LIFO scheduler isn't the whole story. How about a worthy superstructure? It takes some effort to provide efficient forward progress, not to mention handling priorities and timing constraints and whatnot. I would like a Building Block that handles that for me.

"As I understand, locality is when you process task AND its CHILDS one after another, not when you process just some batch of tasks one after another." Yes, "in LIFO fashion": while(true){ task::spawn_root_and_wait(new_batch); }

I still can't catch your point. Can you briefly outline high-level pseudo-code for the scheduler, please?

A worker thread always first exhausts its LIFO tasks.

A very simplistic ultra-fair but not terribly efficient addition would be a global queue. Just before stealing from another worker thread, a thread tries to grab a task from the global queue instead. This avoids interfering with locality, thus possibly improving throughput compared to stealing first.

Maybe just before that, a thread could try to steal from any thread that is falling behind based on the last time it tried to steal or grab a new task.

Communication costs could be amortised by also using (fixed-size) task buckets. This may mean that a task that is added later could be processed earlier than some tasks in preceding buckets, but there is still global progress (no task left behind).

Quoting - raf_schietekat

A worker thread always first exhausts its LIFO tasks.

A very simplistic ultra-fair but not terribly efficient addition would be a global queue. Just before stealing from another worker thread, a thread tries to grab a task from the global queue instead. This avoids interfering with locality, thus possibly improving throughput compared to stealing first.

Maybe just before that, a thread could try to steal from any thread that is falling behind based on the last time it tried to steal or grab a new task.

Communication costs could be amortised by also using (fixed-size) task buckets. This may mean that a task that is added later could be processed earlier than some tasks in preceding buckets, but there is still global progress (no task left behind).

Who and when adds tasks to global queue?

Btw, now this strongly recalls .NET Task Parallel Library scheduler:

  1. A new thread pool thread needs to allocate its work stealing queue.
  2. When queuing a new work item, we must check to see if were on a pool thread. If so, we will queue the work item into the work stealing queue instead of the global queue.
  3. When a pool thread looks for work, it needs to:
    • First consult its local work stealing queue.
    • If that fails, it then looks at the global queue.
    • Lastly, if that fails, it needs to steal from other work stealing queues.

http://www.bluebytesoftware.com/blog/2008/09/17/BuildingACustomThreadPoolSeriesPart3IncorporatingWorkStealingQueues.aspx

"Who and when adds tasks to global queue?" A task or a tbb_thread, basically when they detect work that is not a dependency of what they are currently doing (with some care, limited amounts of work can still be stacked locally, and a realistic solution would be more than just a global queue).

"Btw, now this strongly recalls .NET Task Parallel Library scheduler" I'm sure they also use two's complement numbers and other inescapable concepts. But I prefer being able to say I came up with them all by myself (also seems better for making a contribution), so I'll ignore this particular reference.

Quoting - Raf Schietekat

"Who and when adds tasks to global queue?" A task or a tbb_thread, basically when they detect work that is not a dependency of what they are currently doing (with some care, limited amounts of work can still be stacked locally, and a realistic solution would be more than just a global queue).

Now I am starting getting your point. I think this can make sense. For big "user-level tasks" (not TBB tasks) fairness is definitely good.

Though one must be very careful here, how are you going to detect whether task depends on what thread currently doing? Currently I don't see any ways...

Quoting - Raf Schietekat

"Btw, now this strongly recalls .NET Task Parallel Library scheduler" I'm sure they also use two's complement numbers and other inescapable concepts. But I prefer being able to say I came up with them all by myself (also seems better for making a contribution), so I'll ignore this particular reference.

Oh, I see... then to be honest you must not look into current TBB scheduler implementation too :)

"Now I am starting getting your point. I think this can make sense. For big "user-level tasks" (not TBB tasks) fairness is definitely good." Well, with appropriate scheduling available there's no reason to make that distinction anymore.

"Though one must be very careful here, how are you going to detect whether task depends on what thread currently doing? Currently I don't see any ways..." I meant the other way around: if you don't "need" to wait for the new task to complete, then maybe you shouldn't.

"Oh, I see... then to be honest you must not look into current TBB scheduler implementation too :)" Hmm, if I want to contribute to something, I had better know how it works now.

Quoting - Raf Schietekat

"Now I am starting getting your point. I think this can make sense. For big "user-level tasks" (not TBB tasks) fairness is definitely good." Well, with appropriate scheduling available there's no reason to make that distinction anymore.

It seems that conversation start cycling :)

Total fairness (FIFO) is not compatible with locality. Absence of locality is not compatible with performance. Hence total fairness is not compatible with performance.

What I am missing here?

Quoting - Raf Schietekat

"Though one must be very careful here, how are you going to detect whether task depends on what thread currently doing? Currently I don't see any ways..." I meant the other way around: if you don't "need" to wait for the new task to complete, then maybe you shouldn't.

Assume one implementing fork/join algorithm, but with empty join part, i.e. there are only forks. Your algorithm will decide that all tasks are independent. This will destroy performance and space guarantees (i.e. user will catch out-of-memory exception).

"What I am missing here?" Perhaps that tasks can be the common currency for things that need locality for performance and things that need forward progress?

"Assume one implementing fork/join algorithm, but with empty join part, i.e. there are only forks. Your algorithm will decide that all tasks are independent. This will destroy performance and space guarantees (i.e. user will catch out-of-memory exception)." It's not an algorithm, it's a guideline for a human programmer, who should recognise this situation as a need to wait.

Quoting - Raf Schietekat

"What I am missing here?" Perhaps that tasks can be the common currency for things that need locality for performance and things that need forward progress?

Please, rephrase above somehow. I am unable to catch the meaning :(

Quoting - Raf Schietekat

"Assume one implementing fork/join algorithm, but with empty join part, i.e. there are only forks. Your algorithm will decide that all tasks are independent. This will destroy performance and space guarantees (i.e. user will catch out-of-memory exception)." It's not an algorithm, it's a guideline for a human programmer, who should recognise this situation as a need to wait.

Waiting do introduces overheads too (reference counting for 'parent' task, second execution for parent etc). So it looks like trading bad for worse...

"Please, rephrase above somehow. I am unable to catch the meaning :(" Whether you need to do something that needs locality for performance, or something that needs forward progress, they're all tasks. The difference is in how you allocate/spawn them so that the scheduler knows how to handle them.

"Waiting do introduces overheads too (reference counting for 'parent' task, second execution for parent etc). So it looks like trading bad for worse..." Hmm, the opportunity for a simple optimisation hardly seems worse than a memory explosion.

Quoting - Raf Schietekat

"Please, rephrase above somehow. I am unable to catch the meaning :(" Whether you need to do something that needs locality for performance, or something that needs forward progress, they're all tasks. The difference is in how you allocate/spawn them so that the scheduler knows how to handle them.

Are you talking about something similar to what I've described in #12:

http://software.intel.com/en-us/forums/showpost.php?p=66177

(item 2)

?

Quoting - Raf Schietekat

"Waiting do introduces overheads too (reference counting for 'parent' task, second execution for parent etc). So it looks like trading bad for worse..." Hmm, the opportunity for a simple optimisation hardly seems worse than a memory explosion.

Currently with LIFO scheduler there is no memory explosion. And no need to obligatory wait for task completions.

"Are you talking about something similar to what I've described in #12" No round robin.

"Currently with LIFO scheduler there is no memory explosion. And no need to obligatory wait for task completions." Oh yes... Well, luckily I did say "maybe" before to leave an out for myself. :-) So the guideline should be "whichever style avoids the memory explosion should be chosen", which can then be elaborated further.

Quoting - Raf Schietekat

"Are you talking about something similar to what I've described in #12" No round robin.

Well, if user will be manually explicitly distinguishing what tasks require FIFO and what are Ok with LIFO, then I think it's viable idea.

Currently I am sceptical only about automatic distinguishing based on some heuristics.

I wrote: "which can then be elaborated further". And before you challenge me on that, it is meant to imply that something like a chain reaction should be scheduled for LIFO, because the need to keep memory requirements down to O(log n) (right?) sort of implies the soft kind of "wait" I intended, i.e., depth-first scheduling, while external inputs (with externally imposed granularity) in a continuous process imply FIFO even if only for the simple reason that starving earlier tasks would cause them to pile up (fairness for self-preservation instead of for its own sake). Well, the effect of FIFO here is only partial of course: finishing up business with your earlier customers first gets them out of the store faster, which does clear up space, but without throttling you can still be overrun.

Just an opinion as a library user, programming a non HPC application. Instead working on a discreet simulation server.

After I read that tbb::task works with simple LIFO manner (of course optimized and all) I decided on the spot it wasn't for me.

Which is good because I knew immediately that it wouldn't work in my scenario, so I ended upwritingmore or less a custom task manager (FIFO + epochs), that suits my needs.
Basically I accumulate tasks into epochs (within which task order is required to be irrelevant) and then operate on the epoch in a random manner (just happens to be LIFO). But I do want the epochs to be FIFO of course as time is supposed to flow forward :)

On the other hand if tbb::tasks were processed in a pure FIFO manner, I probably wouldn't have used it either :)

Basically what I'm trying to say is that anyone working on a non HPC simulation will likely not use a default task manager anyways, simply because it can't possibly map directly to their needs. So I don't think tbb needs a FIFO manager, it would be an addition no one probably use.

"So I don't think tbb needs a FIFO manager, it would be an addition no one probably use." Well, LIFO isn't strictly LIFO either, just a general characterisation, but I would not rule out the occasional use for strict FIFO. (Note that I only mentioned the idea of a global queue to avoid the nitty-gritty of implementation details, not to suggest that it might somehow provide strict FIFO.)

Quoting - Raf Schietekat

"So I don't think tbb needs a FIFO manager, it would be an addition no one probably use." Well, LIFO isn't strictly LIFO either, just a general characterisation, but I would not rule out the occasional use for strict FIFO. (Note that I only mentioned the idea of a global queue to avoid the nitty-gritty of implementation details, not to suggest that it might somehow provide strict FIFO.)

Hmm, I seem to be a bit careless lately: I linked this sentence with the previous paragraph. Now I would have to simply disagree: forward progress is a very simple and general concept to start with.

Quoting - robert.jay.gould

On the other hand if tbb::tasks were processed in a pure FIFO manner, I probably wouldn't have used it either :)

Basically what I'm trying to say is that anyone working on a non HPC simulation will likely not use a default task manager anyways, simply because it can't possibly map directly to their needs. So I don't think tbb needs a FIFO manager, it would be an addition no one probably use.

That pretty much sums up the difficulty. If we are going to add support in TBB for this sort of thing, the support has to be general enough to be useful by many users. Furthermore, it has to survive composition in the following sense. If two modules A and B each have their own custom scheduling, they need to be able to coexist in the applicaton.

I've occasionally thought about this from two angles. One is adding a bottom-level FIFO to the scheduling rules. Each thread would follow the first rule below that applies. The items in bold blue are the additions to the current rules.

  1. Pop a task from my deque (LIFO)
  2. Pop a task from my affinity mailbox
  3. Pop a task from my queue (FIFO)
  4. Steal a task from a randomly chosen victim.
    1. Take from victims queue (FIFO) if it is non-empty
    2. Otherwise take from its deque (FIFO).

      But because victims are randomly chosen, this scheme does not guarantee fairness, which is likely to make it unacceptable to many users. Alas, strictly fair shared FIFOs are probably not scalable without hardware support for scalabe fetch-and-add.

      The other angle I've considered, and never completely baked, is "virtual steal". Currently,a thief removes a stolen task from it's victim's task pool. But what if there were another kind of task where the stealing action could be customized? It could, for example, not remove the stolen task, and instead execute a special virtual function defined by the task. Some examples of how this might be used:

      • The task could contain a queue, and the "steal" would be defined to pop an element from the queue. So the original task would continue to sit in its thread's pool, until the pool's thread popped it.
      • The special task could generate a clone of itself in response to a steal unless some condition was met. This has the general behaviorof causing each idle thread to get a clone until the condition is met.

      Ihave not thought through all the impliciations, either for implementation or usage models. It seems like ahook worth thinking about some more, though it may still be too LIFO-ish to be generally useful.

      Would either of these hooks do any good for your simulator, or are they still too far off the mark. I appreciate that a library cannot hope to anticipate all possible wants, though I'd like to anticipate more.

There are degrees of fairness. Simply doing "pop front, steal back" (the original Cilk?) already provides forward progress and has the power of conceptual elegance. The first "angle" you describe is slightly fairer by splitting up the deque into a front and a queue-only back (plus the details of the affinity mailbox, which I have not yet examined). Neither will directly support Robert's simulator's concept of epochs, but the mere fact that not everyone will immediately be fully gratified is no reason to hold back on the basics. I am somewhat bewildered by what the "other angle" is supposed to achieve.

Sorry for accidentally giving five stars to my own message just above (why does the forum even allow that?), I was just looking around for a way to edit.

Actually, perhaps Robert might instead prefer a more scalable way to spawn lots of tasks at once, by proactively/recursively distributing them among the worker threads eagerly waiting behind the gate, instead of letting the workers steal tasks one by one? But the reality with TBB now is probably that the user deals with such issues by, e.g., running parallel_for on a concurrent_vector, which seems fine for problems with implicit barriers (like a simulation), maybe even better (by amortising allocation overhead), but "suboptimal" otherwise.

But maybe I should not be leading the witness...

Quoting - Arch Robison (Intel)

    1. The other angle I've considered, and never completely baked, is "virtual steal". Currently,a thief removes a stolen task from it's victim's task pool. But what if there were another kind of task where the stealing action could be customized? It could, for example, not remove the stolen task, and instead execute a special virtual function defined by the task. Some examples of how this might be used:

      • The task could contain a queue, and the "steal" would be defined to pop an element from the queue. So the original task would continue to sit in its thread's pool, until the pool's thread popped it.
      • The special task could generate a clone of itself in response to a steal unless some condition was met. This has the general behaviorof causing each idle thread to get a clone until the condition is met.

      Ihave not thought through all the impliciations, either for implementation or usage models. It seems like ahook worth thinking about some more, though it may still be too LIFO-ish to be generally useful.

      Would either of these hooks do any good for your simulator, or are they still too far off the mark. I appreciate that a library cannot hope to anticipate all possible wants, though I'd like to anticipate more.

That idea does sound very interesting indeed. Currently I create transactions on the stack that copy the data to be modified, and rollback if there was an error (using RAII with the transactions destructor doing the rollback). So if the task could throw or have some other way to fail it could easily rollback and nothing would get broken. A combination of virtual stealing and rollbacks would be a good solution.

As for stealing, currently most of my tasks are grouped in "families of tasks". There are several "pipeline-like" stages. But its not a real pipeline its more like a state-machine. So the state-machine's states progress in a FIFO manner but each state processes data in LIFO. And once the state-machine completes a cycle the epoch gets kicked up and all new pending work gets pushed into the machine. In my case I send newly spawned tasks back to theback-burnerto be processed during the next epoch, or to a separate thread if its IO( that will create a new task when its finished), but I could imagine other people would want to have an option to add tasks to the current LIFO queue. So both options would be needed.

Another solution I'm aware of is to "waterfall", this way tasks get into groups with each group depending on an ancestor in a tree-like fashion. Kind of like several state-machines grouped together. The paradigm isn't the fastest, but its very easy for programmers to understand and provides structured flow to the program that is easy to grasp. An example I looked into was Emergent's Gamebryo that has a system that does this its called floodgate.

Quoting - Raf Schietekat

Sorry for accidentally giving five stars to my own message just above (why does the forum even allow that?), I was just looking around for a way to edit.

Actually, perhaps Robert might instead prefer a more scalable way to spawn lots of tasks at once, by proactively/recursively distributing them among the worker threads eagerly waiting behind the gate, instead of letting the workers steal tasks one by one? But the reality with TBB now is probably that the user deals with such issues by, e.g., running parallel_for on a concurrent_vector, which seems fine for problems with implicit barriers (like a simulation), maybe even better (by amortising allocation overhead), but "suboptimal" otherwise.

But maybe I should not be leading the witness...

Indeed I use parallel_for for most things when I know tasks are of about equal weight, the issue comes when some tasks take longer than others, so simply slicing them up fairly doesn't assure good distribution. So what I do is have several vectors that hold tasks of the same "family", not exactly the same type, but that take roughly the same amount of instructions to complete. And when a vector of a task family gets processed the next one starts (within a single state/stage/phase), when all tasks are finished I move them to the next state/stage/phase, and when all phases are complete start processing the next epoch.

Anyways the system scales nicely right now, not perfect linearly, but since each task isindependentwithin the same "phase" it scales with about a 90% slope. The problem is I need one epoch to complete within a timeframe, so I use multiple machines and"hash" new tasks to each machine, based on a function of source id, and locality and n of machines. This way most machines get similar load, and cross communication (locality) is somewhat assured, and since I "hash" I can easily add or remove new machines. But thats grids and not really multi-threading.

And it gets quite complicated :)

I think the problem ismany people arebeginningto look into threading very complicated systems. So general algorithms aren't going to be a perfect match, but TBB could possibly provide a few more constructs somewhere between the task manager and parallel_for, then I'm sure most programmers could figure outad-hocsolutions easier.

"simply slicing them up fairly doesn't assure good distribution" Normally parallel_for should gracefully deal with dissimilar execution times (the power of recursive parallelism, although I have not yet examined auto_partitioner's strategy), so I don't see what good it does to separate tasks by "family" now. Maybe what you want is to have the big grains executed first and the idle workers start processing smaller grains next so that all workers arrive at the phase barrier at approximately the same time, but that seems like just another thing the scheduler doesn't do for you (yet)?

"a few more constructs" Global progress, priorities (the problem above is solved by giving bigger grains a higher priority), ... Any other widely applicable ideas, anyone?

"priorities" Please ignore (for being too vague at this time).

Quoting - Raf Schietekat

"simply slicing them up fairly doesn't assure good distribution" Normally parallel_for should gracefully deal with dissimilar execution times (the power of recursive parallelism, although I have not yet examined auto_partitioner's strategy), so I don't see what good it does to separate tasks by "family" now. Maybe what you want is to have the big grains executed first and the idle workers start processing smaller grains next so that all workers arrive at the phase barrier at approximately the same time, but that seems like just another thing the scheduler doesn't do for you (yet)?

"a few more constructs" Global progress, priorities (the problem above is solved by giving bigger grains a higher priority), ... Any other widely applicable ideas, anyone?

Preciselybig grains/tasks first, and then work their way down to smaller tasks, so hopefully no thread gets to idle at the barrier. And yes this is something a custom piece-meal scheduler would likely do for me if it could be setup that way.

As for making such a piece-meal task scheduler, I think tbb's pipeline provides a good example for an API, just that instead of having threads working on all stages simultaneously, they would all work together to complete each stage, then move on to the next, and looping back to the first stage would probably occur in a manner similar to how you recycle tasks in the task_manager right now.

A very rough design could be a phased/staged scheduler, to which the user adds classes derived from say "tbb::stage" (like tbb:filter) and then strings them together like with pipeline, to define a custom state-machine. And, in my case at least, as with pipeline the compromise of a linear execution against complex piping would be acceptable. Now if one really wants to get fancy the construct could provide the ability to define state-machines inside of state-machines (Hierarchical State Machine) probably making it even more user friendly. And inside each stage tbb would be allowed to exec tasks in any order it thinks will provide optimal performance.

Robert, as Raf absolutely correctly said, TBB parallel algorithms work very well with ranges that have non-uniform work distribution. I would even go as far as to claim that ensuring good load balance in case of non-uniform work distribution was the primary design goal of TBB algorithms. Advanced partitioners (auto_partitioner and affinity_partitioner) allow to keep grains coarse enough most of the time (reducing the parallel overhead) and dynamically fall back to finer granulatity when it is necessary to maintain the load balance (usually when the last chuncks are processed).

Affinity partitioner will be useful when your work fits in the total cache capacity of the machine (of all cores in all sockets) and when two or more parallel algorithms are sequentially applied to the same workload. Both conditions must be met to achieve a gain in comparison with auto partitioner.

Preprocessing data in order to form ranges with more uniform work distribution entails the following negative effects:

  • Preprocessing usually constitutes overhead by itself
  • If it is done serially then Amdal's law takes its toll on your app. (In your case you probably can do most of it on the fly during the previous parallel stage, or you could use pipelining to alleviate the problem)
  • In general the resulting ranges are still non-uniform (though in some particular cases, e.g. when there only 2 or 3 kinds of data element, one can achive good uniformity). BTW, even if processing each data element takes the equal number of the same instructions, uniformity may be easily broken by difficult-to-predict-and-account-for cache and memory effects (especially if you data elements are more sophisticated than mundane dense matrices, e.g. involve pointers).
  • The most nocuous one: Since imbalance of parallel_for is born mostly at the end of the range processing, its impact tends to decrease with the range size (amount of work) increasing. When you split one big range into several smaller ones, you introduce new serialization points, and each of them contributes some imbalance of its own. Thus the cumulative imbalance you wind up with will normally be higher that what you could get with simply running parallel algorithm on the original big range.

Pages

Leave a Comment

Please sign in to add a comment. Not a member? Join today