Need help with pipeline deadlock

Need help with pipeline deadlock

Hi everyone,

I have a problem concerning proper termination of pipelines and I'm running out of ideas on where the problem might be.

The setup is as follows. I have one class that handles loading and decompressing of video streams. This process is realised via the TBB pipeline mechanism. Once the start() method of my class is called, it begins loading and decompressing frames and delivers them to a follow-up component.

Calling start on the class spawns a thread which initialized the scheduler and calls pipeline.run() (which blocks until the pipeline is done). Calling stop on the class tells the first tbb::filter to stop loading frames and to return with NULL (to terminate the pipeline). There are multiple instances of this class running (with different input streams).

The problem I have is that once in a while when calling stop, the first filter returns with NULL but the pipeline is not stopped which results in the main thread method (the one that called pipeline::run()) not returning.

Inspecting the threads, there are a couple of tasks waiting :

tbb::internal::GenericScheduler::wait_while_pool_is_empty ()

but none of the threads hangs in the instance that blocks.

Any help is apreciated,

- ALex

62 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

One additional note, this looks like a similiar problem as bug #115 posted on the TBB site, which is not yet resolved unfortunately.

You've said that no other thrreads hang, but what do they do when you inspect them with the debugger? In particular the thread that called pipeline::run?

You've also said that there multipleinstances of your class running. Do you mean that there are multiple simultaneously running pipelines? If so, are they started each from its own thread?

Quoting - Andrey Marochko

You've said that no other thrreads hang, but what do they do when you inspect them with the debugger? In particular the thread that called pipeline::run?

You've also said that there multipleinstances of your class running. Do you mean that there are multiple simultaneously running pipelines? If so, are they started each from its own thread?

There are multiple simultaneously running pipelines, each started from it's own thread.

After doing a lot of debugging, I found the problem but I don't quite understand the cause.

Debugging in the deadlock situation showed that one pipeline is not running in the thread it should be running. The following stack trace might make this a bit clearer (I omitted everything not relevant like code references and memory locations etc). Basically, the pipeline I started in #7 is not the one I'm in at #3 (TBBSinkFilter is the last filter in the pipeline).

#0 in nanosleep () from /lib64/libc.so.6
#1 in usleep () from /lib64/libc.so.6
#2 in VideoMixerFilter::deliverPacket
#3 in MovieFileSourceFilter::TBBSinkFilter::operator()
#4 in tbb::internal::stage_task::execute ()
#5 in tbb::internal::CustomScheduler::wait_for_all () from /opt/intel/tbb/current/lib/libtbb.so.2
#6 in tbb::pipeline::run () from /opt/intel/tbb/current/lib/libtbb.so.2
#7 in MovieFileSourceFilter::run
#8 in non-virtual thunk to MovieFileSourceFilter::run()
#9 in Thread::run
#10 in Thread::StartThreadST
#11 in start_thread () from /lib64/libpthread.so.0
#12 in clone () from /lib64/libc.so.6

Any idea on how this can happen? Any help is apreciated :-)

- ALex

Quoting - alexbenz

There are multiple simultaneously running pipelines, each started from it's own thread.

After doing a lot of debugging, I found the problem but I don't quite understand the cause.

Debugging in the deadlock situation showed that one pipeline is not running in the thread it should be running. The following stack trace might make this a bit clearer (I omitted everything not relevant like code references and memory locations etc). Basically, the pipeline I started in #7 is not the one I'm in at #3 (TBBSinkFilter is the last filter in the pipeline).

#0 in nanosleep () from /lib64/libc.so.6
#1 in usleep () from /lib64/libc.so.6
#2 in VideoMixerFilter::deliverPacket
#3 in MovieFileSourceFilter::TBBSinkFilter::operator()
#4 in tbb::internal::stage_task::execute ()
#5 in tbb::internal::CustomScheduler::wait_for_all () from /opt/intel/tbb/current/lib/libtbb.so.2
#6 in tbb::pipeline::run () from /opt/intel/tbb/current/lib/libtbb.so.2
#7 in MovieFileSourceFilter::run
#8 in non-virtual thunk to MovieFileSourceFilter::run()
#9 in Thread::run
#10 in Thread::StartThreadST
#11 in start_thread () from /lib64/libpthread.so.0
#12 in clone () from /lib64/libc.so.6

Any idea on how this can happen? Any help is apreciated :-)

- ALex

This just means that thread has stealed 'pipeline task' from another thread. See there is 'CustomScheduler::wait_for_all()' entry in the call stack, wait_for_all() function can steal and execute tasks from other threads.

I think that's Ok in itself.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Are you using Win32 or POSIX?

Where exactly are threads blocked? Whether they are blocked or spinning?

What is the value of GenericScheduler::arena->prefix().gate.state variable?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Dmitriy V'jukov

Are you using Win32 or POSIX?

Where exactly are threads blocked? Whether they are blocked or spinning?

What is the value of GenericScheduler::arena->prefix().gate.state variable?

I'm using POSIX threads under Redhat Enterprise Linux 5 (64bit).

The problem I have is with thread termination. The stack trace above is from the thread that's supposed to terminate but it isn't, because the pipeline is still working. But that pipeline in the stack trace is belonging to a different MovieFileSourceFilter Instance and I don't understand why it's showing up in the stack trace of a totally different thread...

How do I query this variable you speak of and when should I query it ?

Quoting - alexbenz

I'm using POSIX threads under Redhat Enterprise Linux 5 (64bit).

The problem I have is with thread termination. The stack trace above is from the thread that's supposed to terminate but it isn't, because the pipeline is still working. But that pipeline in the stack trace is belonging to a different MovieFileSourceFilter Instance and I don't understand why it's showing up in the stack trace of a totally different thread...

How do I query this variable you speak of and when should I query it ?

Again, this is the situation :

T1 Source1, Pipeline1

T2 Source2, Pipeline2

T3 Source3, Pipeline3

after a while it looks like this

T1 Source1, Pipeline1

T2 Source2, Pipeline2

T3 Source3, Pipeline2 !!!

and all I'm doing is stoping and starting the threads which in turn run the pipelines...

Quoting - alexbenz

Again, this is the situation :

T1 Source1, Pipeline1

T2 Source2, Pipeline2

T3 Source3, Pipeline3

after a while it looks like this

T1 Source1, Pipeline1

T2 Source2, Pipeline2

T3 Source3, Pipeline2 !!!

and all I'm doing is stoping and starting the threads which in turn run the pipelines...

If the situation is really the way I think, then it's funny.

You start the thread to process the pipeline, then, when pipeline terminates, you are expecting that the pipeline::run() will return. Right?

It seems that some other thread has stolen some task from the pipeline. Then current thread start stealing while waiting for the stolen task to complete. The thread successfully steals some big 'self-respawning' task (other pipeline's task), and start processing it to the end.

This way pipeline::run() can not return while there is some work to do in the whole system.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Dmitriy V'jukov

If the situation is really the way I think, then it's funny.

You start the thread to process the pipeline, then, when pipeline terminates, you are expecting that the pipeline::run() will return. Right?

It seems that some other thread has stolen some task from the pipeline. Then current thread start stealing while waiting for the stolen task to complete. The thread successfully steals some big 'self-respawning' task (other pipeline's task), and start processing it to the end.

This way pipeline::run() can not return while there is some work to do in the whole system.

Yes exactly and I really don't understand how something like this can happen. The pipelines should be autonomous and as far as I can see, I'm using them as described in the Refrence and Tutorial manuals. Is there a way to debug what happens in the scheduler because this is the only component in my system that knows about all the pipelines ?

I'm not totally sure how the scheduler works here. Despite the fact that I'm having an scheduler_init object in each of my "pipeline run threads", there is a common (static) pool of worker threads right ? So there is no way to make the piplines totally autonomous ?

Let me to explain what Dmitry tried to say in other words.

When you start a pipeline it produces a bunch of tasks, some of which may be stolen and executed by TBB worker threads.

When you start several pipelines each in a therad of its own, they produce a bunch of tasks each, some of which are stolen and executed by TBB worker threads.

So far so good.

Now you stop one of the pipelines. Its input filter returns zero, so no new tasks are produced. But there are still a few previously created tasks in the flight somewhere in the TBB thread pool. And the pipeline::run method cannot return until all the tasks it engenederd are finished.

Instead of idly waiting until all its old tasks finish, method run (or actually wait_for_all that was called inside it) goes and steals some work from other threads. It may be its own task (stolen back) (if you are lucky), but it also may be a task of another pipeline (more probably when you have several pipelines).

Since that other pipeline was not cancelled, that task generates new ones, and your original call to say Pipeline3's run() method gets trapped under a pile of work generated by the other pipeline. And until this other pipeline is stopped, our Pipeline 3 will not return from its method run.

Alex, am I right to understand that your pipelines are endless by themselves, and require an external signal to stop them? And until the first pipeline stops your program does not stop other ones?

Quoting - Andrey Marochko
Alex, am I right to understand that your pipelines are endless by themselves, and require an external signal to stop them? And until the first pipeline stops your program does not stop other ones?

Hmmm... reminds of... deadlock

;)

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - alexbenz

I'm not totally sure how the scheduler works here. Despite the fact that I'm having an scheduler_init object in each of my "pipeline run threads", there is a common (static) pool of worker threads right ? So there is no way to make the piplines totally autonomous ?

I don't see how this can be accomplished with current pipeline API.

To support such usage pipeline API can be extended with asynchronous execution:

pipeline p;

... // fill pipeline

HANDLE event = CreateEvent(...);

p.async_run(new(allocate_root()) completion_task(event));

WaitForSingleObject(event, INFINITE);

struct completion_task : public tbb::task

{

HANDLE event;

...

virtual void execute()
{

SetEvent(event);
}
};

What do you think? Remember: continuation-based style is always more efficient ;)

Also such usage eliminates oversubscription of threads, i.e. if user creates 3 pipelines on 2-core system, only 2 threads will be working instead of 5.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Let me see... The program requires parallelism (with the current API, anyway). The scheduler supports (optional) parallelism. Twice the word "parallelism": we have a go! :-)

I don't see what this async_run is supposed to do: completion_task cannot be executed when the first filter is exhausted (that would not make a difference), it cannot be executed when the pipeline has been fully drained (that would not make a difference), so what did I miss?

But I think I've advocated a possible solution enough recently, repeating it would only make me sound like a broken record.

Quoting - Raf Schietekat

Let me see... The program requires parallelism (with the current API, anyway). The scheduler supports (optional) parallelism. Twice the word "parallelism": we have a go! :-)

I don't see what this async_run is supposed to do: completion_task cannot be executed when the first filter is exhausted (that would not make a difference), it cannot be executed when the pipeline has been fully drained (that would not make a difference), so what did I miss?

But I think I've advocated a possible solution enough recently, repeating it would only make me sound like a broken record.

Do you think that FIFO processing will help here?

I think that the problem is not in order of processing. I beleive that all tasks from "deadlocked" pipeline have processed, but run() method just don't returns, because wait_for_all()'s loop spins processing other tasks and doesn't check parent task's reference counter.

Possible solution is to move check of parent task's reference counter into inner-most loop in wait_for_all(). Probably this must be done only for non-worker threads... and probably some other problems will emerge.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

"Do you think that FIFO processing will help here?" Well, at least global progress, but it probably depends on how you want to balance latency vs. throughput.

"probably some other problems will emerge" That only seems likely if the root cause of the problem is not attacked.

Quoting - Raf Schietekat

"Do you think that FIFO processing will help here?" Well, at least global progress, but it probably depends on how you want to balance latency vs. throughput.

If it won't help here, then how it can be a solution to the problem?..

Quoting - Raf Schietekat

"probably some other problems will emerge" That only seems likely if the root cause of the problem is not attacked.

The root cause of this problem is attacked (hypothesis), but some completely independent problems can arise. And root causes of those problems are, of course, *not* attacked, because nobody is aware of them now.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

"If it won't help here, then how it can be a solution to the problem?" Ah, a band-aid... maybe when a pipeline is stopped an alarm can be set, and, if run() doesn't return soon enough, everything is stopped and restarted. Not very elegant, but it may do the trick.

"The root cause of this problem is attacked" No, I meant the real root cause. :-)

Quoting - Raf Schietekat

"If it won't help here, then how it can be a solution to the problem?" Ah, a band-aid... maybe when a pipeline is stopped an alarm can be set, and, if run() doesn't return soon enough, everything is stopped and restarted. Not very elegant, but it may do the trick.

I think there are schemes that (1) are acceptable and (2) solve the problem at the same time. Don't you think so?

Quoting - Raf Schietekat

"The root cause of this problem is attacked" No, I meant the real root cause. :-)

Work-stealing? I don't think that it's possible to create something viable w/o work-stealing. Yeah, there are things like work-distribution and work-requesting... but try to get them work in real-life... Oh, and there is scheduler based on just single fifo-queue...

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

"I think there are schemes that (1) are acceptable and (2) solve the problem at the same time. Don't you think so?" The latter generally implies the former in my book. :-)

"Work-stealing?" I never said that.

Quoting - Raf Schietekat

"I think there are schemes that (1) are acceptable and (2) solve the problem at the same time. Don't you think so?" The latter generally implies the former in my book. :-)

At the moment you've proposed (1) something secret (probably to use FIFO processing, which IMVHO won't solve the problem) and (2) to restart the process which is not acceptable... Probably you have 2 books...

Quoting - Raf Schietekat

"Work-stealing?" I never said that.

Please, say explicitly what you mean. I would not consider you as a broken record.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

The band-aid would not restart the process, only the pipelines, which is probably acceptable if it only has to intervene occasionally, for lack of a real solution at this time. "Not very elegant, but it may do the trick."

A solution would use "FIFO" for all unbounded pipelines, well, all unbounded workloads in the program, really (a corrollary of a principle I've presented before), with "bounded" probably more a qualitative concept than a quantitative one. Such pipelines may only spawn global-progress tasks for the first filter, preferably with a grainsize larger than one to trade low latency for high throughput (sic?), i.e., with a higher grainsize both latency and throughput will increase. Downstream work would still be spawned on the "LIFO" part of the scheduler, as well as any internal work (dependencies of the filters). A worker thread will therefore always come back to its own pipeline after a bounded amount of time, so it will always "return from run()" (of course, run() requires parallelism if more than one pipeline is started, so an API change would be welcome, but that's a different matter). Note that this is just for global progress, so there can still be annoying delays if a worker thread gets stuck in a slow lane, but it's still a useful first step because traffic will never come to an unbounded halt (you may notice that my daily commute is a useful source of inspiration).

Quoting - Dmitriy V'jukov
Hmmm... reminds of... deadlock

;)

Hmmmm... smells more to me like a livelock. No thread is starved for resources and it's not really like either because progress is being made, just not on the terminated thread.

Quoting - Raf Schietekat

A solution would use "FIFO" for all unbounded pipelines, well, all unbounded workloads in the program, really (a corrollary of a principle I've presented before), with "bounded" probably more a qualitative concept than a quantitative one. Such pipelines may only spawn global-progress tasks for the first filter, preferably with a grainsize larger than one to trade low latency for high throughput (sic?), i.e., with a higher grainsize both latency and throughput will increase. Downstream work would still be spawned on the "LIFO" part of the scheduler, as well as any internal work (dependencies of the filters).

I don't see how this would behave differently than the current schemeif the initiating thread of one pipeline ever momentarily runs out of work and steals from another pipeline. How does the FIFO prevent an errant thread from getting stuck in an abundance of work on some other LIFO?

"Hmmmm... smells more to me like a livelock." I would call it starvation: livelock means being very busy doing nothing at all. It happens to be intra-thread instead of inter-thread, but that does not seem essential.

"How does the FIFO prevent an errant thread from getting stuck in an abundance of work on some other LIFO?" Between a worker stealing FIFO (like Cilk?) by using deques instead of singly linked lists, and programmer discipline to only spawn bounded workloads for LIFO scheduling, would there still be an opportunity for a task to be starved forever?

Andrey Marochko, Andrey Kukanov, and myself have been discussing this issue. Hereis my summary of our thoughts.

The root problem is "may" versus "must" parallelism. The TBB scheduler assumes "may" parallelism, meaning that parallelism is not required for correctness, and thus scheduling can be as unfair as possible. "Must" parallelism, in contrast, requires some degree of fairness. "May" parallelism enables the efficiency properties of Cilk-style scheduling. But "must" parallelism is sometimes necessary for correctness (notably to avoid deadlock!).

In TBB, we call threads created by the user "masters" and threads created by the TBB scheduler "workers". The following policy makes sense:Parallelism between masters must be treated as "must" parallelism. Threads employed behalf of masters must respect the "must" parallelism between their employers. All other parallelism is "may" parallelism.

A way to implementthe policyis to prevent threads from simultaneously having two employers ("moonlighting"). The formal rules are:

  1. A thread is either employed by a thread (possibly itself) or unemployed.
  2. A master thread is always employed by itself.
  3. A worker thread is unemployed if it is not running any tasks and its task pool is empty. Otherwise a worker is employed.
  4. A worker can change from unemployed to employed by stealing work. The employer of the thief is the employer of the victim. This rule also covers tasks obtained via advertisements in a thread's mailbox.
  5. For two distinct employers X and Y,a thief employed by X cannot steal a task from avictim employed by Y.

The "employed thief" case arises only in nested steals. That is when a thread X executing a task has to wait for other threads to finish stolen subtasks, and is looking for work to do in the meantime.

I believe the policy would fix the reported problem. We'll have to work out the data structures. My impression is that if there are only a few masters, there is a relatively quick hackto implement the policy with tolerable efficiency. The hack is to implement rule 4 by tagging threads with their employer id, and implementing 5 as a rejection rule in the stealing logic. It's less than ideal efficiency because with M masters the probability of success for a steal is 1/M.

We could use the quick hack to verify that the policy indeed solves the problem, then work on an efficient implementation.

- Arch Robison

My position is that a scheduler should allow the user to specify the "QoS"-related trade-offs to be made, and that global progress should be one of several options. The proposed partitioning skips global progress in the trade-efficiency-for-fairness direction (it will also solve the problem at hand, but it will often be less efficient, unless it is the only acceptable option because of a low tolerance for high latency in the program). How can it be used for serving multiple mouths other than with oversubscription and/or capitulating to old-style multithreading (to which I expected TBB to be the answer)? How will the number of workers at any one time be managed to avoid oversubscription, and how will they be scheduled between the masters? I think that work should instead be devoted to allow the user to avoid the need for more master threads: running multiple pipelines at once does not inherently require parallism, nor should TBB's implementation.

Quoting - Raf Schietekat
I think that work should instead be devoted to allow the user to avoid the need for more master threads: running multiple pipelines at once does not inherently require parallism, nor should TBB's implementation.

Running multiple pipelines at once does inherently require parallelism, otherwise why not just run them one by one? Of course there is some valid sequential execution that has the property of making progress in every simultaneous pipeline; as we know, time sharing and preemption make semblance of concurrent execution even on a single core. But if making simultaneous progress is a requirement, then concurrency is mandated, even if an execution engine resolves it via time sharing & preemption.

On the other hand, I agree that users should have an option to choose the behavior they need.

"otherwise why not just run them one by one" Because that would not be simultaneous: even round-robin would be better.

Quoting - Raf Schietekat
"otherwise why not just run them one by one" Because that would not be simultaneous: even round-robin would be better.

Exactly. That's my point: despite existing valid sequential execution (such as round-robin), the requirement on simultaneous progress is nothing else but mandated concurrency.

Surely you're joking, Mr. Kukanov!

No joke: "simultaneous progress"by definitionimplies "mandatory concurrency".

Now I believe TBB is in trouble, if the chief architect does not see that... unless you thought that I meant that TBB's present API does not require parallelism to operate multiple pipelines at the same time? Indeed, that would have been a mistake, but actually my point is that this may be so now, but also that it should not be so: in addition to run(), there should be a method that means "go register yourself with the nice scheduler so that it will occasionally call you", and then the scheduler would do just that. Any parallelism between pipelines would be entirely optional: on a single-core system, only one thread would run, everybody would be served (no starvation), and things would proceed smoothly (no deadlock). But of course the organisation should be somewhat more sophisticated than round-robin, for a maximally efficient exploitation of concurrency when the opportunity presents itself.

Quoting - Raf Schietekat

Now I believe TBB is in trouble, if the chief architect does not see that... unless you thought that I meant that TBB's present API does not require parallelism to operate multiple pipelines at the same time? Indeed, that would have been a mistake, but actually my point is that this may be so now, but also that it should not be so: in addition to run(), there should be a method that means "go register yourself with the nice scheduler so that it will occasionally call you", and then the scheduler would do just that. Any parallelism between pipelines would be entirely optional: on a single-core system, only one thread would run, everybody would be served (no starvation), and things would proceed smoothly (no deadlock). But of course the organisation should be somewhat more sophisticated than round-robin, for a maximally efficient exploitation of concurrency when the opportunity presents itself.

If a pipeline filter is free to make kernel calls that can block, then a single thread cannot guarantee progress of two pipelines. There has to be at least two software threads, one to serve each pipeline. The concurrency is not optional. Of course the underlying OS is free to context switch a single hardware thread between the software threads, hence there might be no parallelism.

The semantics requested by Alex Benz is entirely reasonable. What we (Alexey, Andrey, and I) are proposing is to delineate "must" versus "may" parallelism by treating user-created threads as "must" parallelism. If we implement this distinction in the scheduler, then the "go register yourself with a nice scheduler" becomes a matter of starting the pipelines on separate threads.

"If a pipeline filter is free to make kernel calls that can block, then a single thread cannot guarantee progress of two pipelines." Until TBB provides specific, targeted support for filters that can block, they should probably not be used (note that this is related to but separate from the issue of running multiple pipelines simultaneously, because starvation is not necessarily related to blocking, I suppose). A workaround seems to be to have a "master thread" detect work from each input and fire up a related pipeline when required, which will close down when no pending input is found; pooling should alleviate some overhead.

"There has to be at least two software threads, one to serve each pipeline. The concurrency is not optional. Of course the underlying OS is free to context switch a single hardware thread between the software threads, hence there might be no parallelism." Multiple threads in this case are mandatory only because of the current limitations of TBB. Blocking in a filter is a "forced error" for lack of appropriate support from TBB (but see workaround above).

"The semantics requested by Alex Benz is entirely reasonable." Absolutely. If only TBB didn't have these limitations that forced him to use old-style multithreading. At this point TBB looks promising, but there's still work ahead.

"What we (Alexey, Andrey, and I) are proposing is to delineate "must" versus "may" parallelism by treating user-created threads as "must" parallelism. If we implement this distinction in the scheduler, then the "go register yourself with a nice scheduler" becomes a matter of starting the pipelines on separate threads." There will always be situations where multiple threads are mandatory, or just inherited from legacy software. But TBB should not unnecessarily impose them on the user because of a lack of support for "proper" task-based optional parallelism.

I wrote 'A workaround seems to be to have a "master thread" detect work from each input and fire up a related pipeline when required, which will close down when no pending input is found; pooling should alleviate some overhead.'

To avoid any misunderstanding: this is only a workaround to avoid the use of blocking input filters, it offers no direct relief to pipelines starved for other reasons, but such an arrangement should improve the odds and at least you'll have the consolation that all worker threads are doing something useful, which may also help against a starvation problem in the first place. Worth a try, perhaps. With a progress-capable scheduler there would be no unbounded starvation if unbounded filters are avoided. Thread partitioning would solve both problems at the same time, but how many more cores are there than active pipelines for this to still be different from old-style multithreading, assuming that TBB then actively manages the number of worker threads to avoid immediate oversubscription problems?

Quoting - Andrey Marochko

Alex, am I right to understand that your pipelines are endless by themselves, and require an external signal to stop them? And until the first pipeline stops your program does not stop other ones?

Yes, the first part is true. I need to tell the pipeline from a different thread to quit, that is a flag will be set which will be queried by the first filter in the pipeline. But I can quit a random filter chain (pipeline) any time I want and restart it as long as the described situation is not met.

Thanks for giving this topic so much thought. This is really apreciated! I'll give you a short heads up on where I stand.

To get on with our software, I implemented a simple "alternative" for the pipeline part. It works and is not much slower that the pipeline. BUT it''s not nearly as comfortable as using a tbb:pipeline, so keep up with the work :-) The problem for me is I can't use the pipeline in other parts of our software as long as I can't get around the deadlock issues. I already put a lot of time into this to really find the root cause. Unfortunately I also can't extract the problem into a simple example for you to do some debugging. But from the discussions so far, I guess you're on the right track.

Basically, the software we're develping is an HD a/v playout solution which internally uses a filter graph system similiar to the DirectShow framework. The tbb:pipelines are used for basic low level tasks in the filter chain that need to be done as quick as possible, like loading and decoding frames (encapsulated in one single high level filter). I think this should give you a basic idea of what' I'm trying to achieve.

- ALex

Quoting - Robert Reed (Intel)

Hmmmm... smells more to me like a livelock. No thread is starved for resources and it's not really like either because progress is being made, just not on the terminated thread.

Think of the deadlock on spin-mutexes with processing of other pipeline's tasks as yield :)

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

"Think of the deadlock on spin-mutexes with processing of other pipeline's tasks as yield :)" I'm surprised: no observations more substantial than this about a thread that goes to the heart of current and future usability of TBB?

Quoting - Arch Robison (Intel)

A way to implementthe policyis to prevent threads from simultaneously having two employers ("moonlighting"). The formal rules are:

  1. A thread is either employed by a thread (possibly itself) or unemployed.
  2. A master thread is always employed by itself.
  3. A worker thread is unemployed if it is not running any tasks and its task pool is empty. Otherwise a worker is employed.
  4. A worker can change from unemployed to employed by stealing work. The employer of the thief is the employer of the victim. This rule also covers tasks obtained via advertisements in a thread's mailbox.
  5. For two distinct employers X and Y,a thief employed by X cannot steal a task from avictim employed by Y.

Oh, I see, Mini Concurrency Runtime? Will Master threads be placing orders for worker-threads? :) And since MSVC2010 there will be Mini Concurrency Runtime on top of Big Concurrency Runtime :)

It's interesting idea, probably it will even increase performance in some cases by improving locality of processing... But what do you think about potential induced stalls (employed worker thread has to stand idle when there is more work to do)?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Dmitriy Vyukov

It's interesting idea, probably it will even increase performance in some cases by improving locality of processing... But what do you think about potential induced stalls (employed worker thread has to stand idle when there is more work to do)?

We won't know the answer to that until we try it out. If the stall issue turns out to be significant, we may have to create >P software threads on a system with P hardware threads, and somehow block/unblock the software threads so that there are only P running at at time. Or maybe it will turn out that a little occasional oversubscription is not a big problem. There's much to be learned by implementing and trying it out.

"There's much to be learned by implementing and trying it out." I think you should delegate such half-cocked exploration to other projects and only do innovation with a clear direction to further the strengths of task-based concurrency, but I guess taking a detour to unblock a situation is better than standing still.

Quoting - Arch Robison (Intel)

We won't know the answer to that until we try it out. If the stall issue turns out to be significant, we may have to create >P software threads on a system with P hardware threads, and somehow block/unblock the software threads so that there are only P running at at time. Or maybe it will turn out that a little occasional oversubscription is not a big problem. There's much to be learned by implementing and trying it out.

Context switches are indeed cheap now. So limited oversubscription can be the answer. But there are other problems - most notably per thread memory cosnumption in scalable alloc (it seems that then you nevertheless will have to add something like scalable_alloc_flush()), and other per-thread resources (stack, maybe some user per-thread resources).

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Dmitriy Vyukov

A way to implementthe policyis to prevent threads from simultaneously having two employers ("moonlighting"). The formal rules are:

  1. A thread is either employed by a thread (possibly itself) or unemployed.
  2. A master thread is always employed by itself.
  3. A worker thread is unemployed if it is not running any tasks and its task pool is empty. Otherwise a worker is employed.
  4. A worker can change from unemployed to employed by stealing work. The employer of the thief is the employer of the victim. This rule also covers tasks obtained via advertisements in a thread's mailbox.
  5. For two distinct employers X and Y,a thief employed by X cannot steal a task from avictim employed by Y.

And why not just add additional check into inner-most master-thread loop? Now it looks like:

void master_thread_loop()
{
for (;;)

{
while (task* t = pop_from_my_stack())
{
process(t);
}

if (parent_task->reference_counter == 0)

break;

steal_and_block_and_etc();

}
}

Loop must be modified this way:

void master_thread_loop()
{
for (;;)

{
while (task* t = pop_from_my_stack())
{
process(t);

if (parent_task->reference_counter == 0)

break;

}

if (parent_task->reference_counter == 0)

break;

steal_and_block_and_etc();

}

if (my_stack_is_not_empty)

transfer_my_stack_to_somewhere();
}

I believe this will also fix the problem. But... hmmm... this looks simpler, and additional overheads in inner-most loop are negligible. And there will be no induced stalls, and no oversubscription, and no additional memory consumption in scalable alloc.

Master thread is just free to exit the game as soon as it wants.

What do you think?

IIRC I've seen something similar in Doug Lea's Java Fork/Join Framework (which is also Cilk clone).

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Well, worker threads can (must?) also check parent tasks' counter more frequently, but only in recursive scheduler loops. And master threads must transfer their uncompleted tasks only in non-recursive scheduler loops. So here is patched version:

void scheduler_loop()
{
for (;;)

{
while (task* t = pop_from_my_stack())
{
process(t);

if ((recursive || master) && parent_task->reference_counter == 0)

break;

}

if (parent_task->reference_counter == 0)

break;

steal_and_block_and_etc();

}

if (recursive && my_stack_is_not_empty)

transfer_my_stack_to_somewhere();
}

I've not worker out all the details, but I beleive the idea is working.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

"What do you think?" I think that I don't understand what you're trying to do, but maybe I'm the only one? (I am a bit distracted by very strange problems on the way to floating-point atomics, where I only recently managed to solve a crash in optimised builds by using unions instead of reinterpret_cast(), even before any floating-point type was in sight, so that g++ version I'm using is looking mighty suspicious to me now.)

Quoting - Raf Schietekat

"What do you think?" I think that I don't understand what you're trying to do, but maybe I'm the only one?

I am trying to fix the error described in the first post in this topic.

The master thread doesn't return from spawn_root_and_wait() because it is processing tasks from other pipeline, and is not even checking whether its root task (pipeline) has already finished. If the thread will be checking whether its root task has already finished, then the deadlock is resolved and the problem is solved.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

"If the thread will be checking whether its root task has already finished, then the deadlock is resolved and the problem is solved." I can see how that might sometimes help, but not how this would actually prevent the thread from still sometimes/often getting caught up in a never-ending story. You would need to guarantee that task closures (a task and its descendants) execute in a bounded time, which happens to be the basic requirement to make global progress possible, but then why run all these additional master threads instead of an arguably more efficient "FIFO"-capable scheduler? How will they be reined in from potentially massive oversubscription (how would you serve 10000 pipelines?), making suboptimal context-switch choices, littering the process with stacks, etc.? Isn't this just retracing the earlier history of multithreading?

I think it's a shame that TBB started with a promising approach (task-based optional concurrency), started with a mistake (stealing LIFO instead of FIFO, thus only supporting finite workloads), and now, in a panic, bolts off in the opposite direction.

Quoting - Raf Schietekat

"If the thread will be checking whether its root task has already finished, then the deadlock is resolved and the problem is solved." I can see how that might sometimes help, but not how this would actually prevent the thread from still sometimes/often getting caught up in a never-ending story.

Please, describe how "never-ending" story can happen if we will apply my proposed patch? (I don't see)

Quoting - Raf Schietekat

You would need to guarantee that...

NO! Master thread is free to exit the game as soon as he wants (even during "never-ending" story).

Quoting - Raf Schietekat

... instead of an arguably more efficient "FIFO"-capable scheduler?

FIFO-capable scheduler won't help in this situation (IIRC we already was here). Master thread can be executing infinite sequence of tasks in any order, but if it will be not checking root task's reference counter it will be in infinite loop!

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Raf Schietekat

How will they be reined in from potentially massive oversubscription (how would you serve 10000 pipelines?), making suboptimal context-switch choices, littering the process with stacks, etc.? Isn't this just retracing the earlier history of multithreading?

I think it's a shame that TBB started with a promising approach (task-based optional concurrency), started with a mistake (stealing LIFO instead of FIFO, thus only supporting finite workloads), and now, in a panic, bolts off in the opposite direction.

This is different problem, and I an NOT trying to solve it now.

If you have 10000 threads with independent work, then why you need TBB at all?! You already have sufficient parallelism. Just execute every work in its thread.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

"Please, describe how "never-ending" story can happen if we will apply my proposed patch? (I don't see)" New master steals task, task spawns children and waits_for_all(), child spawns grandchildren and waits_for_all(), some descendant gets the brilliant idea to run an unbounded client-server operation, ...

"NO! Master thread is free to exit the game as soon as he wants (even during "never-ending" story)." How would you abandon a stolen task's execute() before it returns? Are we still talking about 1:1 schedulers?

"FIFO-capable scheduler won't help in this situation (IIRC we already was here). Master thread can be executing infinite sequence of tasks in any order, but if it will be not checking root task's reference counter it will be in infinite loop!" We already were here, but it seems that some additional effort is required to convince you. :-) With a FIFO-capable scheduler, we wouldn't be using a master thread for running the pipeline in the first place (you can do it all in one thread if need be). It is probably worthwhile repeating that: running 10000 pipelines at the same time, i.e., with overlapping activation periods, does not mean mandatory concurrency in the sense of requiring multithreading (whether they be kernel threads or user threads or a combination) and/or parallelism (multiple cores and/or hyperthreading), and all "context switches" are where the user chooses them to be.

"This is different problem, and I an NOT trying to solve it now." It was meant as a general remark, but you're also thinking threads instead of tasks here.

"If you have 10000 threads with independent work, then why you need TBB at all?! You already have sufficient parallelism. Just execute every work in its thread." The thing is that I don't want 10000 threads, only as many worker threads as will keep the hardware busy.

Pages

Leave a Comment

Please sign in to add a comment. Not a member? Join today