Need help with pipeline deadlock

62 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Quoting - Raf Schietekat
I think it's a shame that TBB started with a promising approach (task-based optional concurrency), started with a mistake (stealing LIFO instead of FIFO, thus only supporting finite workloads), and now, in a panic, bolts off in the opposite direction.

We do steal FIFO; who said the opposite? The problem of only supporting finite workloads is different than that.

And we are not in a panic; neither we bolt off "in the opposite direction". Support for multiple master threads is in TBB since the very first version; it is not being invented to solve the problem that initiated this thread, as you might think. We do not see a reason why customers should not start TBB algorithms from different threads if they wish so, and so this mode is supported. The plan Arch proposed is intended to improve this support, becauseit's not the first time our customers get surprised that two independent jobs (I will call aset of tasksto solve some problem a job)become "tightly coupled" inside TBB so that they only finish together. The algorithm Dmitry proposed is also intended to "decouple" what should not be coupled; and we considersomething like thatas well.

The ability to start a few independent jobs from a single thread is also a known and popular feature request for us. The very simple and natural thing would be to convert it into nested parallelism, by spawning a task for each job, let the task be stolen, and an algorithm started by the thief; and this will likely be the solution we will implement first of all.However, it does not suite the additional requirement of making simultaneous progress in each of the jobs, becausein general does not provide such guaranteeif there are more jobs than available HW threads.

As I already said, the requirement of making simultaneous progress means that either the jobs are not really independent, or for some reason they require mandatory concurrency (one may argue that it is essentially the same in this case - the jobs are not really independent, and thus require mandatory concurrency, as e.g. in a producer-consumer pattern). The mandatory concurrency does not scale to thousands of jobs. I can't imagine the need to start 100000 pipelines simultaneously and require that every single one should constantly make progress. If those are truly independent, it should not matter in which order the execution happens (and then this thing would be easy to convert into nested parallelism. Otherwise,the whole thing should better be reworked to group dependent jobs together.

Quoting - Raf Schietekat

"Please, describe how "never-ending" story can happen if we will apply my proposed patch? (I don't see)" New master steals task, task spawns children and waits_for_all(), child spawns grandchildren and waits_for_all(), some descendant gets the brilliant idea to run an unbounded client-server operation, ...

Damn! You are 200% right!

Blocking wait for children is so awful, so I even forgot about it. In my library I just don't supporting blocking waits, this solves many problems. And IIRC original Cilk automatically transforms blocking waits to continuations...

Hmmm... But how FIFO will help here? If task uses blocking wait, then you can't return from it until all children tasks will complete. You can't force it anyhow. Your FIFO trick will work only if tasks don't use blocking waits too. No?

Quoting - Raf Schietekat
"NO! Master thread is free to exit the game as soon as he wants (even during "never-ending" story)." How would you abandon a stolen task's execute() before it returns? Are we still talking about 1:1 schedulers?

I can't.

So the only solution I see for now is Arch's proposal about "Mini Concurrency Runtime"...

Quoting - Raf Schietekat
"FIFO-capable scheduler won't help in this situation (IIRC we already was here). Master thread can be executing infinite sequence of tasks in any order, but if it will be not checking root task's reference counter it will be in infinite loop!" We already were here, but it seems that some additional effort is required to convince you. :-) With a FIFO-capable scheduler, we wouldn't be using a master thread for running the pipeline in the first place

I think it's irrelevant to scheduling order. LIFO scheduler can gain from continuation style the same way as FIFO scheduler.

Quoting - Raf Schietekat
(you can do it all in one thread if need be). It is probably worthwhile repeating that: running 10000 pipelines at the same time, i.e., with overlapping activation periods, does not mean mandatory concurrency in the sense of requiring multithreading (whether they be kernel threads or user threads or a combination) and/or parallelism (multiple cores and/or hyperthreading), and all "context switches" are where the user chooses them to be.

Hmmm... I have to think more on this... but if tasks will be using blocking waits... I think they can blow up thread's stack... or be subject to induced deadlocks...

Quoting - Raf Schietekat

"This is different problem, and I an NOT trying to solve it now." It was meant as a general remark, but you're also thinking threads instead of tasks here.

"If you have 10000 threads with independent work, then why you need TBB at all?! You already have sufficient parallelism. Just execute every work in its thread." The thing is that I don't want 10000 threads, only as many worker threads as will keep the hardware busy.

It's you who mention threads first. If you don't want 10000 threads then don't start them. Start pipelines one-by-one (or N-by-N). I think that is not very good idea to start 10000 root tasks simultaneously anyway.

Quoting - Alexey Kukanov (Intel)

The algorithm Dmitry proposed is also intended to "decouple" what should not be coupled; and we considersomething like thatas well.

As Raf noted, there is inherent problem with blocking waits in my proposal. Basically, if task executes wait_for_all() in the middle of execute() method, then only Almighty can force it to return from its execute() method.

I will think more on this, but initially I don't see any potential fixes for my proposal.

Quoting - Alexey Kukanov (Intel)

The ability to start a few independent jobs from a single thread is also a known and popular feature request for us. The very simple and natural thing would be to convert it into nested parallelism, by spawning a task for each job, let the task be stolen, and an algorithm started by the thief; and this will likely be the solution we will implement first of all.

Doesn't spawn_root_and_wait( task_list& root_list ) do the thing?

Quoting - Alexey Kukanov (Intel)

As I already said, the requirement of making simultaneous progress means that either the jobs are not really independent, or for some reason they require mandatory concurrency (one may argue that it is essentially the same in this case - the jobs are not really independent, and thus require mandatory concurrency, as e.g. in a producer-consumer pattern). The mandatory concurrency does not scale to thousands of jobs. I can't imagine the need to start 100000 pipelines simultaneously and require that every single one should constantly make progress. If those are truly independent, it should not matter in which order the execution happens (and then this thing would be easy to convert into nested parallelism. Otherwise,the whole thing should better be reworked to group dependent jobs together.

I also think that the number of simultaneously running root tasks (or jobs) must be determined by library, because in some sense it's the same as the number of threads.

The interesting area for research in this context is:

1. If there are too many master threads or just executing root tasks (if asynchronous submission of root tasks will be supported), temporary hold up newcoming master threads - just block in the beginning of spawn_root_and_wait(). Here is obvious problem that obligatory concurrency (which user can be relying on) is not respected (thus can lead to deadlocks - special care is required to prevent deadlocks).

2. ... Damn! I've already forgot while writing first one...

Quoting - Dmitriy Vyukov

I also think that the number of simultaneously running root tasks (or jobs) must be determined by library, because in some sense it's the same as the number of threads.

The interesting area for research in this context is:

1. [...]

2. ... Damn! I've already forgot while writing first one...

Oh Ok, I've remembered.

2. Assume we have N logical processors. And we already have N simultaneously working master threads. Well, we can just turn off all worker threads, and let each master thread do its own root task from beginning to the end. No stealing, no decrease in locality, no oversubscription, no ..., only perfect utilization of hardware.

But the process is still under control. I.e. if number of master threads is decreased, or just application is changed, or hardware platform is changed, run-time is ready to take everything into it's own hands.

Quoting - Dmitriy Vyukov

Quoting - Alexey Kukanov (Intel)

The ability to start a few independent jobs from a single thread is also a known and popular feature request for us. The very simple and natural thing would be to convert it into nested parallelism, by spawning a task for each job, let the task be stolen, and an algorithm started by the thief; and this will likely be the solution we will implement first of all.

Doesn't spawn_root_and_wait( task_list& root_list ) do the thing?

It probably does most of it, but for sake of ease of use it should be wrapped into something with a nice name and accepting lambdas :)

Just a few reactions at this time (it's Friday night for me):

Alexey: "We do steal FIFO; who said the opposite?" Maybe having another look at the code would convince you?

Alexey: "The problem of only supporting finite workloads is different than that." I must apologise here. I have been fretting ever since I wrote this what exactly was the problem with stealing LIFO. Maybe it's about fairness, maybe it's in conjunction with using a queue, but now I wish I had removed it again because I don't remember exactly and I don't want to be thinking about it now. If I don't come up with a rationale at some later point, feel free to rub it in, though. :-)

Alexey: "Support for multiple master threads is in TBB since the very first version" Sure, but I'm referring to it as a weakness here (because TBB simply has no other way to run multiple pipelines), fraught with problems that now need a hurried fix.

Dmitriy: "It's you who mention threads first. If you don't want 10000 threads then don't start them. Start pipelines one-by-one (or N-by-N). I think that is not very good idea to start 10000 root tasks simultaneously anyway." But I didn't start 10000 threads, the scheduler takes care of running those 10000 pipelines for me, whether it has 1 core and 1 thread or 16 cores and 32 threads to play with. Note that this is not something TBB can do at this time.

Dmitriy: "Doesn't spawn_root_and_wait( task_list& root_list ) do the thing?" If you don't mind spending a lot of time waiting...

Quoting - Raf Schietekat
Just a few reactions at this time (it's Friday night for me):

It's Friday night for me too. Actually, already Saturday :)

Quoting - Raf Schietekat
Alexey: "We do steal FIFO; who said the opposite?" Maybe having another look at the code would convince you?

Oh, yeah. At the same level of depth, we steal LIFO, you are right. This is going to be fixed.

Quoting - Raf Schietekat
Alexey: "Support for multiple master threads is in TBB since the very first version" Sure, but I'm referring to it as a weakness here (because TBB simply has no other way to run multiple pipelines), fraught with problems that now need a hurried fix.

As I said, some customers want to run TBB algorithms from separate threads, for whatever reasons. One of the reasons might be e.g. that they already applied functional thread-level parallelismand are not going to change it for a moment, while could exploit inner-level data parallelism in those threads.There could plenty of other reasons. And TBB was designed to coexist with other threading approaches. You might think of supporting multiple masters as a weakness; I prefer to think of it as support of customer needs.

And I do not understand who is "fraught with problems that now need a hurried fix" - you, or you think us? For support of multiple pipelines running, see previous posts and below.

Quoting - Raf Schietekat
But I didn't start 10000 threads, the scheduler takes care of running those 10000 pipelines for me, whether it has 1 core and 1 thread or 16 cores and 32 threads to play with. Note that this is not something TBB can do at this time.

Why not? Run parallel_for(blocked_range(0,10000), MyPipelineStarter()); this won't provide simultaneous progress with each one, but I already said enough on this.

Quoting - Raf Schietekat
Dmitriy: "Doesn't spawn_root_and_wait( task_list& root_list ) do the thing?" If you don't mind spending a lot of time waiting...

Not waiting, but executing the job that need to be executed. That's what relaxed sequential execution is about: you need to do some job now, and you do, but if spare resources are available, they are used as well.

If you want to offload the jobs and do something else, and then come back and check if the job is done, usually it means that there are some other threads that do the job asynchronously.TBB supports this model as well, though currently on task level only. If you don't mind to wait (i.e. call wait_for_all) at the check point, which is the usual case with asynchronous execution. At this point, if the job was not done, the originating thread will do it itself.

But if you do mind to wait at any given moment, now or later, it's called mandatory concurrency. Surprise, TBB supports this as well - via starting a separatetbb_thread :)

What TBB does not support is "let me offload the job and do something else, then may be do piece of the job expected to be done by workers, then do something else again, and so on". There is no way back from wait_for_all until all existing job is done. Are you looking for that?

"I prefer to think of it as support of customer needs." TBB supports both task-based concurrency and thread-based concurrency... as long as the customer chooses the latter to run multiple pipelines even if he doesn't want to... but it doesn't actually work anyway.

'And I do not understand who is "fraught with problems that now need a hurried fix" - you, or you think us?' The support for multiple master threads is fraught with problems.

"Why not? Run parallel_for(blocked_range(0,10000), MyPipelineStarter()); this won't provide simultaneous progress with each one, but I already said enough on this." Well, that's what I'm saying: TBB doesn't support simultaneous progress yet (as in overlapping activation periods of optional concurrency as I defined it earlier).

"But if you do mind to wait at any given moment, now or later, it's called mandatory concurrency." No, it's called wasting performance by mandatory idle waiting if the only way to provide simultaneous progress is to loop over spawn_root_and_wait_for_all() in whatever guise. Even if I do the work of retrofitting pipeline with a perform_work() method (like ::CORBA::ORB's plus its invisible ramifications), and register my 10000 pipelines with a manager, looping over "parallel_for(blocked_range(0,10000), myManager);" may spend much of its time waiting at the gate, unless I pay the price associated with a low-enough grain size (and the more processors I have, the lower the grain size needs to be). And I have to do it that way, because my only alternative is to run each pipeline in its own thread... but then I can't use TBB's pipeline anyway, because 10000 of those would only trip over each others' shoe laces.

It doesn't have to be like that. Please take the next logical step. Proposing thread partitioning is embarrassing.

Quoting - Raf Schietekat
"But if you do mind to wait at any given moment, now or later, it's called mandatory concurrency." No, it's called wasting performance by mandatory idle waiting if the only way to provide simultaneous progress is to loop over spawn_root_and_wait_for_all() in whatever guise.

I did not mean the current TBB state when I wrote the above quoted sentence. I meant the general situation when some job was offloaded and (since no desire to wait) expected to be performed by someone else (another thread, usually).

Also at the moment I wrote that I did not fully realize that the key point is potentiallyinfinite nature of data source for pipelines, which you probably had in mind from the beginning; thus the misunderstanding. Everything I said before was applicable for pipelines with a priory known finiteness ofdata.

Quoting - Raf Schietekat
TBB supports both task-based concurrency and thread-based concurrency... as long as the customer chooses the latter to run multiple pipelines even if he doesn't want to... but it doesn't actually work anyway.

Correction: multiple pipelines that never end unless stopped externally, and so are desired to provide simultaneous progress. I agree we do not yet do it right. Still, I think these infinite work pipelines are rather for the design withasyncronous agents than for a (relaxed) sequential program. And I agree that TBB does not have support for (a big number of) asynchronous agents either.

Quoting - Raf Schietekat
The support for multiple master threads is fraught with problems.

If we wouldtry to make TBB ideal from the very beginning, either there still would be no TBB, or rather it would be there somewhat later andnot ideal anyway, because we simply can not foresee all ways the customers try to use it. Thoughthe need to isolate masters from each other might come to our mind.

We might as well adjust the number of active workers depending on the number of active masters.

By the way, the topicstarter did not explain why he starts pipelines from different threads; might be because of no alternative, but might as well be because his program is designed this way for whatever other reasons.

My point is that asynchronous agents, or at least a useful subset of them, including pipelines with overlapping activations, can and should be served by a relaxed sequential scheduler.

Pages

Leave a Comment

Please sign in to add a comment. Not a member? Join today