Pipeline

Pipeline

In tbb21_20081109oss, shouldn't tbb::pipeline::end_of_input be atomic (at least for pedantic reasons), and what's the deal with tbb::pipeline::token_counter (next_token_number() is executed with a lock on stage_task, not pipeline)? Is there or could somebody write a user guide that makes clear how the parts fit together?

63 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Here are the sources preprocessed for my walnut-sized brain to have any chance of understanding what's going on (hey, I was watching TV at the same time...). Perhaps they might inspire one of you to study and explain them (I'll need some more energy after I get to understand them myself)? There's a list of changes in pipeline.h.

Attachments: 

AttachmentSize
Downloadtext/x-chdr pipeline.h6.81 KB
Downloadtext/x-c++src pipeline.cpp12.74 KB

Quoting - Raf Schietekat

Here are the sources preprocessed for my walnut-sized brain to have any chance of understanding what's going on (hey, I was watching TV at the same time...). Perhaps they might inspire one of you to study and explain them (I'll need some more energy after I get to understand them myself)? There's a list of changes in pipeline.h.

After a cursory look, I can say that some of the changes would break binary compatibility with older TBB versions, and thus they would have to wait to the moment for pipeline being reimplemented within a new set of classes (that will have to have a version suffix according to TBB versioning rules, e.g. pipeline_v4).

Meanwhile we continue improving the pipeline functionality, though the issue of simultaneous progress on multiple pipelines discussed in another thread is not yet targeted. The last changes introduced support for serial out-of-orderfilters, and more work is in progress.

Quoting - Raf Schietekat

In tbb21_20081109oss, shouldn't tbb::pipeline::end_of_input be atomic (at least for pedantic reasons), and what's the deal with tbb::pipeline::token_counter (next_token_number() is executed with a lock on stage_task, not pipeline)? Is there or could somebody write a user guide that makes clear how the parts fit together?

end_of_input only changes once per a pipeline run, from false to true to indicate there is no more input. For serial input filters, there is no data race between this write and other reads, because another task to take the next input item will not be spawn after the end of input was reached. For parallel input filters, there is a data race but it is benign, because parallel input filters should be tolerant to latecomers anyway, i.e. the filter might be called a few times even after the end of input was reached.

The token counter is incremented in the very first serial ordered filter in the pipeline. Since the filter is serial, it contains a lock on its internal buffer of pending items. So it makes sense to increase the token number when holding this lock, rather than introduce a special lock in the pipeline. In other words, the lock is there for more reasons than just counting tokens.

The idea of explaining more about how the parts of pipeline implementation fit together definitely makes sense. As I said in another message, we are working on further improvements in the pipeline, which might involve some refactoring as well. During the course of this work, we will put more comments in place, and might be write some formal design documents.

"there is a data race but it is benign" I tend to side with Hans Boehm here ("no such beast"), no matter if the atomic may well make no difference at the machine code level.

"The token counter is incremented in the very first serial ordered filter in the pipeline. Since the filter is serial, it contains a lock on its internal buffer of pending items. So it makes sense to increase the token number when holding this lock, rather than introduce a special lock in the pipeline. In other words, the lock is there for more reasons than just counting tokens." If there's a unique ordered_buffer (not a "stage_task" like I wrote by mistake) whose mutex governs access to the pipeline's token_counter, that would indeed do the trick, but I probably need more input before I'll be able to see the light (it doesn't help that it's well into the night now).

"The idea of explaining more about how the parts of pipeline implementation fit together definitely makes sense. As I said in another message, we are working on further improvements in the pipeline, which might involve some refactoring as well. During the course of this work, we will put more comments in place, and might be write some formal design documents." That may be too late for me. The problem is that it takes too much time and energy to come up with various hypotheses and test them, compared to verifying documentation, before getting to the creative part.

Quoting - Raf Schietekat
"there is a data race but it is benign" I tend to side with Hans Boehm here ("no such beast"), no matter if the atomic may well make no difference at the machine code level.

My argument was not about difference at machine code level, but about no harm due to this race. Its only consequence is that some threads may receive the end_of_input signal somewhat later in case of parallel input, and it's harmless because the parallel input filter must be tolerant to latecomers anyway.

Quoting - Raf Schietekat
If there's a unique ordered_buffer (not a "stage_task" like I wrote by mistake) whose mutex governs access to the pipeline's token_counter, that would indeed do the trick, but I probably need more input before I'll be able to see the light (it doesn't help that it's well into the night now).

The buffers are per serial filter, and the lock is per buffer. Every serial filter uses the lock to serialize buffer operations. The very first serial ordered filter also uses this lock to put tokens in order. As only one filter might be "the first", then there is a unique lock governing access to the pipeline's token counter.

Quoting - Raf Schietekat
That may be too late for me. The problem is that it takes too much time and energy to come up with various hypotheses and test them, compared to verifying documentation, before getting to the creative part.

I understand and agree. At this point, however, I see no big sense in formal documentation for the current design which is being changed; though I must admit that the set of classes will likely remain, though responsibilities may change.

Let me suggest you to ask more questions here (including your hypotheses), and I will be glad to answer those if it saves your time.

To start, I will briefly descrive the classes and their main intent (probably not much news there):

- tbb::pipeline is the main class for the algorithm; it contains a sequense of filters, and provides the APIand encapsulates the mechanics of how to form and start the pipeline.

- tbb::filter is the base class for user-defined filters; it represents filter abstraction for the pipeline. Its pure virtual function call operator should be overridden by user-defined descendant classes to implement the logic of data processing on different pipeline stages.

- tbb::internal::ordered_buffer represents the temporary storage for items that can not be processed immediately by a serial filter. It's name resembles historical fact that serial filters were always ordered until very recently; now serial filters can be out-of-order, but still the requirement of processing items one at a time dictates the need for a buffer. The buffer is dynamically created for every serial filter; it can grow in size during execution.Operations on the buffer are protected by the buffer-specific lock.

- tbb::internal::stage_task is a descendant of tbb::task that represents pipeline operations to the scheduler. Execution ofa stage_task applies one filter to one data item, and spawns other stage_tasks for subsequent operations on the data, as well as fetching new data (as defined by the very first, or input,filter).

"My argument was not about difference at machine code level, but about no harm due to this race." And mine that this should still be formalised by using an atomic even if it can be fenceless (only relaxed operations).

"The very first serial ordered filter also uses this lock to put tokens in order. As only one filter might be "the first", then there is a unique lock governing access to the pipeline's token counter." That's an example of a high-level constraint that is tiresome to find just by induction.

"Let me suggest you to ask more questions here (including your hypotheses), and I will be glad to answer those if it saves your time." Hmm, OK, thanks, but feel free to volunteer some more must-know stuff about the current implementation (things that are not obvious from the code).

Quoting Raf Schietekat

> this should still be formalised by using an atomic even if it can be fenceless (only relaxed operations).

Oh, of course, since your reworked atomics have the means to express that.

> That's an example of a high-level constraint that is tiresome to find just by induction.

I agree.

> ... feel free to volunteer some more must-know stuff about the current implementation (things that are not obvious from the code).

Sure, when something related comes to my mind... Though the best way to help this happening is to ask questions :)

Alexey:
"end_of_input only changes once per a pipeline run, from false to true to indicate there is no more input. For serial input filters, there is no data race between this write and other reads, because another task to take the next input item will not be spawn after the end of input was reached. For parallel input filters, there is a data race but it is benign, because parallel input filters should be tolerant to latecomers anyway, i.e. the filter might be called a few times even after the end of input was reached."
Actually, serial input filters in a non-trivial pipeline must also be prepared to be called after returning NULL, because of a race on end_of_input (I'm not really happy about C++0x's limited definition of a race). Maybe you should document this, because it does not seem to be clear to everybody... :-)

Below this paragraph,
[Tutorial] means Tutorial (Open Source).pdf" revision 1.11, and
[Reference] means "Reference Manual (Open Source).pdf" revision 1.9.

[Tutorial] p. 26:
"// Must remove filters from pipeline before they are implicitly destroyed.
pipeline.clear();"
[Tutorial] p. 27:
"This top-level code also shows the method clear that removes all stages from the pipeline. This call is required if the filters have to be destroyed before the pipeline. The pipeline is a container that holds filters, and as with most containers in C++, it is illegal to destroy an item while it is in the container."
[Reference] p. 37:
"A filter must be removed from the pipeline before destroying it. You can accomplish this by destroying the pipeline first, or calling pipeline::clear()."
In tbb21_20081109oss filters remove themselves from their pipeline before they are destroyed, as specified in [Reference] p. 39, so clear() is merely a few cycles faster and the erroneous statements should be modified.

[Tutorial] mentions "pipeline template" twice, but this is merely confusing because pipeline is not a C++ class template and I don't see a useful other meaning (surely objects designed for use together are not a "template"?).

[Reference] p. 37:
"The parameter max_number_of_live_tokens puts an upper bound on the number of stages that will be run concurrently."
Well, sure, but that's not what it means.

Why does task depth (seem to?) increase (roughly or exactly?) linearly with the number of stages traversed?

In stage_task::execute(), is "recycle_as_continuation(); return this;" (or code to that effect) the same as "goto restart;" (with a restart label at the top)? If not, what's the difference?

Quoting - Raf Schietekat
> Actually, serial input filters in a non-trivial pipeline must also be prepared to be called after returning NULL,
> because of a race on end_of_input (I'm not really happy about C++0x's limited definition of a race).
> Maybe you should document this, because it does not seem to be clear to everybody... :-)

I will try to prove that the data race does not exist for serial input.

In the code executed during the pipeline run, there are 2 places where end_of_input is read, and 2 places where it is written. One of each is for the parallel input filter, which I covered before. So one write (end_of_input is set to true after the input filter returned NULL) and one read (at the very end of stage_task::execute, after an item was processed by every stage, the flag is tested before spawning a new task to start processing another item) are left for consideration. I claim there is no data race between these two, because it is protected by my_pipeline.input_tokens atomic variable acting as a semaphore:
- input_tokens starts as the user-specified maximal number of items under processing allowed at any given time;
- each time a new item is taken from the input filter, input_tokens is decremented;
- each time an item is processed by the last filter, input_tokens is incremented;
- it is used to decide if a task should be spawned to take a new item from the input.

When the input filter is serial, a new task is spawned right after the input filter returned a valid item. For the parallel input filter, a new task is optimistically spawned before the input filter returned. In both cases, however, the semaphore value is first decreased, and tested; if it reached 0, the new task is not spawned. In this case, a new task is spawned the first time the semaphore is open (i.e. changes from 0 to 1), and this is the place where end_of_input is tested.

Let's now see how it works for the serial input. At any time, only a single task runs the input filter (see the spawning rules above). Also at any time, the input filter only runs if the input_tokens semaphore is opened (>0). When the input ends, the end_of_input flag is set, but the semaphore value is untouched - which means, it is still open. Effectively it means that, after end_of_input was set, its value won't ever be read at the end of the execute method. Well, I must admit this conclusion surprised me :) So it seems for the serial input filter, the flag is excessive. Need to address it during the course of refactoring.

By the way, a little unrelated detail came to my mind: the constructor of stage_task takes either one or two parameters. With one parameter only (the reference to the pipeline), the task will run the input filter. Constructed with the second parameter being a pointer to a filter, it will run that filter.

> In tbb21_20081109oss filters remove themselves from their pipeline before they are destroyed, as specified in
> [Reference] p. 39, so clear() is merely a few cycles faster and the erroneous statements should be modified.

The documents on the Web appear outdated; I will askto update it. I will also check the most actual docs for the issues you reported; thanksa lot!

> Why does task depth (seem to?) increase (roughly or exactly?) linearly with the number of stages traversed?

Depth is increased for every next filter in the chain, to prioritize processing of existing items over taking new ones. When a thread hasexecuteda serial filter, it might then spawn a task to process the next awaiting item; this can happena few times before the end of the pipeline. At the end, a task to take out a new item might be spawned as well, as I described above. Increasing the depth makes items closer to the end of pipeline being processed earlier. Also as you remember stealing is "shallow tasks first out" with regard to the depth, and LIFO(bad bad bad :)) on the same level of depth; so the trick with increased depth also makes stealing de facto FIFO (except for input filter tasks, which we _want_ to be stolen earlier).

> In stage_task::execute(), is "recycle_as_continuation(); return this;" (or code to that effect)
> the same as "goto restart;" (with a restart label at the top)? If not, what's the difference?

A few additional things are done inside the scheduler, of whichthe most important one is the check for cancellation; also tracing, gathering statistics etc. (now only for developers' use, controlled by compile-time macro switches). There is a very little additional overhead that could be scratched out with goto; we considered that and decided it is not worth even duplicating thecancellation checks inside the stage_task itself.

"I will try to prove that the data race does not exist for serial input." And so you did: I stand corrected.

"check for cancellation" Of course...

Thanks a lot, I'll be doing some more staring at the code now before my next question.

I'm having a problem with my changes to pipeline: test_pipeline.exe seems to take forever to execute. I just do "make all", and much later, when I execute "top", I see over 86 minutes in the "TIME+" column. No, wait, it's the original version!

The curious thing is that when I do other things on the system, like switching to a different terminal to run "top", and typing this text, the test seems to get unstuck somehow. And it's not the first time that I've seen this with TBB, either.

P.S.: In the first paragraph, I was in fact verifying that I didn't introduce this particular problem, but I felt like introducing a bit of dramatic effect. Any idea what causes this?

Some more revisited and new questions...

I suppose that ordered filters are only relevant relative to each other, and that if only one occurs in the pipeline it might as well be merely serial? If the first serial filter in the pipeline is not ordered, currently the sequence in which it processes tokens is normally not guaranteed to be reproduced in any ordered filters down the line, except if that first serial filter is also the input filter: intentional or an oversight?

So it appears that a serial input filter will not be invoked after it has returned NULL, but is this going to be documented as part of the contract, or is it just how it happens to work right now but without any guarantees for the future and therefore without a need to preserve this behaviour if the implementation is changed?

Don't overlook the matter of excessive execution times described in the previous posting #11, but I think I may have seen it happen also where a pipeline was not involved.

Why the fences on input_tokens? Doesn't spawn() do a release already (and can you put a timeline on the documentation of how fences are used with tasks, a question that has arisen earlier)? But again, regardless of what spawn() does, why the fences? Is it a mere statement of fact that the current implementation has no unfenced RMW?

At the end of the pipeline, what does set_depth() do that would affect the additional child of the counter task? It only seems meaningful with a recycle_as_continuation() instead.

Quoting - Raf Schietekat

> I'm having a problem with my changes to pipeline: test_pipeline.exe seems to take forever to execute. I just do "make all", and much later, when I execute "top", I see over 86 minutes in the "TIME+" column. No, wait, it's the original version!

> The curious thing is that when I do other things on the system, like switching to a different terminal to run "top", and typing this text, the test seems to get unstuck somehow. And it's not the first time that I've seen this with TBB, either.

> P.S.: In the first paragraph, I was in fact verifying that I didn't introduce this particular problem, but I felt like introducing a bit of dramatic effect. Any idea what causes this?

So in your system, with the vanilla TBB (which package?), the test_pipeline gets stuck, unless you switch to some other application? What is your system? Does the effect take place if you work in text-only console mode (as opposed to some windowing GUI?)

So far, I have no idea what might cause this.

Alexey: "So in your system, with the vanilla TBB (which package?), the test_pipeline gets stuck, unless you switch to some other application? What is your system? Does the effect take place if you work in text-only console mode (as opposed to some windowing GUI?)"

x86 (dual core)/Ubuntu 6.06 (which uses Gnome)/g++ 4.0.3/tbb21_20081109oss; it doesn't reliably get stuck, but this has happened a few times if I leave it alone, and starting another "make all" tends to get things unstuck. If you think you're onto something (what exactly?), I might look for a way to log in without the GUI, or open a remote shell, but it was over 86 minutes of "CPU Time" measured by "top", not a frozen display. It may have happened with earlier versions and other TBB programs (I've actually mentioned it before and Dmitriy wrote at the time that there might be variability in benchmarking), but maybe I just haven't left the program alone for so long yet. Then again, I have no idea why my system doesn't show the button to upgrade the O.S. without doing a fresh install, so maybe it's just earth rays...

>>
So in your system, with the vanilla TBB (which package?), the test_pipeline gets stuck, unless you switch to some other application? What is your system? Does the effect take place if you work in text-only console mode (as opposed to some windowing GUI?)

So far, I have no idea what might cause this.
<<

Alexey,

(hypothetical postulation here)

One potential cause for the peculiarity of the lock-up being "fixed" by the introduction of an additional load on the systemn could be a situation where multiple mutex or other synchronization objects are required and in this particular case two threads are competing for the multiple mutex/synch objects, where each thread is located on seperate hardware threads (dual core system), and where the retry code is optimized to use SwitchToThread(), and since during lock-up condition there are no other pending threads on each core, SwitchToThread immediately returns without context switch. Further, the particular sequencing of obtaining the multiple mutex/synch objects is such that the dependent task has faster access to the multiple mutex/synch objects (butting in line so to speak), such that the other task is inhibited from advancing the state (pipeline in this case). The "fix" is observed/occures when an additional thread is injected into the scenario thus causing the SwitchToThread to cause a delay to be introduced into the mix.

Jim Dempsey

www.quickthreadprogramming.com

How can I spawn the current task again and return NULL? recycle_as_continuation() causes a complaint about ready, and recycle_to_reexecute() causes a complaint about NULL. Should I use a hack using recycle_to_reexecute() and return a dummy task that does recycle_as_continuation() for reuse as a dummy or something?

(Added) Or can the assert with recycle_to_reexecute() simply be relaxed to allow NULL?

Quoting - Raf Schietekat

> How can I spawn the current task again and return NULL? recycle_as_continuation() causes a complaint about ready, and recycle_to_reexecute() causes a complaint about NULL. Should I use a hack using recycle_to_reexecute() and return a dummy task that does recycle_as_continuation() for reuse as a dummy or something?

> (Added) Or can the assert with recycle_to_reexecute() simply be relaxed to allow NULL?

Spawning the task that is currently executed is dangerous, because of possible execution race if it is stolen while the current execute() not yet completed.
Recycle_to_reexecute complains about NULL returned because in this case it is highly likely that the same task will be taken for execution right away. Usually, a task is recycled to repeat execution after some other tasks got executed, and this means some other task could possibly be returned from execute() to immediately start with.

What do you want to achieve byspawning the currently executedtask from itself? In particular, why recycle_to_reexecute won't work for you? It causes the task to get spawned right after its execution completes.
After understanding your needs, I might be able to suggest the way to achieve it. If nothing else works, adding another recycling method is possible.

P.S. I also owe you an answer onyour post #12; it requires more thinking and writing, while I am still on holidays and notyet back to working mood :)

"Spawning the task that is currently executed is dangerous, because of possible execution race if it is stolen while the current execute() not yet completed." Sorry for the inaccurate formulation, I did mean having it spawned again after execute() returns, as in recycle_to_reexecute().

[...]

It seems that relaxing the assert to allow NULL works fine, so "refurbish(); recycle_to_reexecute(); return NULL;" is a presumedly significant improvement over the alternative.

"P.S. I also owe you an answer on your post #12; it requires more thinking and writing, while I am still on holidays and not yet back to working mood :)" OK!

Another question: Roughly how much do allocation and destruction of tasks contribute to their overhead? Am I chasing a red herring trying to recycle them?

Quoting - Raf Schietekat
It seems that relaxing the assert to allow NULL works fine, so "refurbish(); recycle_to_reexecute(); return NULL;" is a presumedly significant improvement over the alternative.

Another question: Roughly how much do allocation and destruction of tasks contribute to their overhead? Am I chasing a red herring trying to recycle them?

Tasks are allocated and deallocated off a thread's local free list, so they are fairly cheap. Nonetheless, sometimes a task has heavy user-defined state associated with it, and recycling will be worth the trouble.

It may be time to weaken the assertion to allow NULL. The assertion was written was motivated by two concerns:

  • We had a narrow use case of recycle_to_reexecute() in mind, specifically parallel_while (now parallel_do). When in doubt, we start out with conservative assertions because it's always eaiser to weaken an assertion than to strengthen it and break existing code.
  • There was a nannyish intent to avoid inefficient code. If all the code is doing is re-executing the task without returning another task to execute, it might as well loop within the task, or use recycle_as_continuation as described below.

But I have found other nannies (notably Microsoft's /W4!) to be repulsive, so I think we should stop being nannyish here and weaken the assertion.

Note, however, that there is a way to accomplish reexecution of a task within the current TBB. If a task is to be re-executed immediately, it is essentially a continuation of itself with no children. The direct way to say this in code is for method execute() to call recycle_as_continuation() and return a pointer to the task. Below is an example.

#include
#include"tbb/task_scheduler_init.h"
#include"tbb/task.h"
classloop:publictbb::task{
inti;
public:
loop(){i=0;}
tbb::task*execute(){
std::printf("i=%dn",i);
if(++i!=1000){
recycle_as_continuation();
returnthis;
}
returnNULL;
}
};
intmain(){
tbb::task_scheduler_initinit;
tbb::task*t=new(tbb::task::allocate_root())loop;
tbb::task::spawn_root_and_wait(*t);
return0;
}

"Tasks are allocated and deallocated off a thread's local free list, so they are fairly cheap." Roughly how cheap, as a fraction of total task overhead? It obviously can't all be local, e.g., there's a (potentially costly) little shake to the shared parent's reference count that refurbishing would avoid. Perhaps no urgent concern for normal applications, but still something a toolkit would do well to at least take into consideration.

"Nonetheless, sometimes a task has heavy user-defined state associated with it, and recycling will be worth the trouble." So it's all just to avoid using a pointer to external state? Really?

pipeline.cpp already uses "recycle_as_continuation(); return this;", but "recycle_to_reexecute(); return NULL;" will also yield to higher-priority tasks (at greater depths in the local array), as required for (local) progress.

Quoting - Raf Schietekat
"Tasks are allocated and deallocated off a thread's local free list, so they are fairly cheap." Roughly how cheap, as a fraction of total task overhead? It obviously can't all be local, e.g., there's a (potentially costly) little shake to the shared parent's reference count that refurbishing would avoid. Perhaps no urgent concern for normal applications, but still something a toolkit would do well to at least take into consideration.

"Nonetheless, sometimes a task has heavy user-defined state associated with it, and recycling will be worth the trouble." So it's all just to avoid using a pointer to external state? Really?

pipeline.cpp already uses "recycle_as_continuation(); return this;", but "recycle_to_reexecute(); return NULL;" will also yield to higher-priority tasks (at greater depths in the local array), as required for (local) progress.

Right. Recycling avoids the reference count modification, which could easily dominate the time for the rest of the task allocation/deallocation bookkeeping. I've never measured the times.

The priority issue is a good point, more so because we plan to deprecate the "depth()" attribute and use a deque like Cilk does. If so, what are the implications for the pipeline implementation? My impression was that there was not a problem as long as we chose the order of spawning (pushing on the deque) carefully, so that higher priority tasks were pushed last.

It seems a radical change you're contemplating (not merely replacing the array of stacks with an array of deques but really only having one deque? what would happen to depth-selective stealing?)... one that could do with a bit of explanation (from Cilk to something entirely different and back again: the people, their motives and their adventures along the way).

Meditate on this, I will.

Lesson learnt last week: RTFM (again and again). I made a change to have a task reexecutedat the later of two events: return from execute() (obviously), and notification from another task. To simulate this as yet nonexisting functionality I useda dummy child empty_task just for reference count administration, but for the notification I just destroy()'ed this dummy task. Unfortunately, failure only happened after an annoyingly long time of running the test code, so it was not obvious that it should be something so simple. After instead spawn()'ing the dummy task (because destroy() doesn't take any action when the reference count reaches zero, for reasons that are not clear to me), things worked fine... except that that other problem mentioned earlier was still occurring, so I have to run, e.g.,another TBB "make all" to get to the end (could it perhaps be something in the test code and its simulated waiting?).

So now I have a pipeline that (conceptually) only recycles tasks instead of spawning new ones and throwing old ones away, which should not make it any more sluggish, I suppose. I would now recommend some tweaks to the scheduler to make this official and optimised, but I'll have to clean up the code first. It also seemed nice to let pipelines with parallel input filters start up in O(log n) time instead of O(n), so to say.

There's one more change (that I remember without having the code at hand) that seems appropriate before embarking on the more interesting stuff of achieving global progress (this was just to get acquainted): a serial_out_of_order filter somewhere between two serial_in_order filters should pick the oldest buffered token, because otherwise the token that it chooses based on local "first come, first served" is likely to have to waitin the downstream serial_in_order filter's input buffer until the older token finally gets processed. Well, it may not make much of a difference most of the time, but it's too obvious to leave as-is.

I don't have any other changes in the pipeline (pun intended), just a remark at this time.

In all but the first filter, the object value is ignored, but the source code has a comment "/** Returns NULL if filter is a sink. */", and [Reference] p. 40 says "The first filter in a pipeline should return NULL if there are no more items to process. The result of the last filter in a pipeline is ignored.", which doesn't seem quite accurate. I suppose that the source code comment can be removed and that the second sentence in the reference should read "The result of other filters is ignored."?

I still have to add task::release_ref() and do some cleanup before I can show the code.

Quoting - Raf Schietekat
It seems a radical change you're contemplating (not merely replacing the array of stacks with an array of deques but really gonly havin one deque? what would happen to depth-selective stealing?)

I've recently finally got the idea of depth based stealing (thanx to Alexey Kukanov), and I think that absence of depth based stealing is not as bad as one might think. Java Fork/Join uses plain deque, AFAIK .NET TPL uses plain deque, Cilk uses plain deque. And I going to investigate Cilk++ source today, probably it also uses plain deque (Yes, it's already available for downloading for free: http://www.cilk.com/home/get-cilk).
If one's application runs into stack overflow w/o depth-based stealing, then probably there is something utterly wrong with it... And all one has to do is to rewrite application w/o that already doomed synchronous waiting for children. Btw, what about deprecating synchronous waiting at one? ;)

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Alexey Kukanov (Intel)

Quoting Raf Schietekat

> this should still be formalised by using an atomic even if it can be fenceless (only relaxed operations).

Oh, of course, since your reworked atomics have the means to express that.

Relaxed atomics are also required for formal verification tools - which may otherwise bark at such places, and also is very useful documentation - otherwise it will be completely obscured that the variable is used from several threads concurrently. So I personally consider them as very important thing. Actually I want as much syntactic noise around accesses to shared variables as possible :)

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

"If one's application runs into stack overflow w/o depth-based stealing, then probably there is something utterly wrong with it..." Why?

Quoting - Raf Schietekat

"If one's application runs into stack overflow w/o depth-based stealing, then probably there is something utterly wrong with it..." Why?

(1) First of all, it uses blocking wait for children instead of continuations, it's already can be considered as a crime :)
(2) Then, other people's programs run on flat deques, so why his program is unable to run?
Hmmm... I think it's enough to justify the statement starting with "probably" :)

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Hmm, I'm one of those guys who think that a toolkit should work for him instead of the other way around (even though I have a disturbing weakness for tinkering with toolkits). I'm not very happy with continuations as a requirement.

Maybe it's a bit like distributed programming... the idea that you can hide object locations and pretend that everything is local turned out to be a doomed approach, however appealing initially.

Still, why be so quick to throw in the towel this time? And specifically for TBB, why not have an array of deques?

Quoting - Raf Schietekat
Hmm, I'm one of those guys who think that a toolkit should work for him instead of the other way around (even though I have a disturbing weakness for tinkering with toolkits). I'm not very happy with continuations as a requirement.

Maybe it's a bit like distributed programming... the idea that you can hide object locations and pretend that everything is local turned out to be a doomed approach, however appealing initially.

Still, why be so quick to throw in the towel this time? And specifically for TBB, why not have an array of deques?

Ok, I flew into a passion. Nothing wrong with one's application. Especially taking into account that stack overflow provided flat deque can occur even with application that structures its work DAG into exemplary balanced binary tree.

Just some other means for preventing stack overflow must be employed.

Why not have an array of deques? I think that Arch must have some good answer ;) IMVHO Single plain deque is simpler and potentially faster.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

"Why not have an array of deques? I think that Arch must have some good answer ;)" I'm not so optimistic.

"IMVHO Single plain deque is simpler and potentially faster." That may be the case, but simplicity in the application should not be traded away for simplicity in the toolkit (unless you plan to only have a few applications).

I have attached a snapshot of some changes (to be patched into tbb21_20081109oss). Anyone for benchmarking them?

Attachments: 

Quoting - Raf Schietekat
"Why not have an array of deques? I think that Arch must have some good answer ;)" I'm not so optimistic.

"IMVHO Single plain deque is simpler and potentially faster." That may be the case, but simplicity in the application should not be traded away for simplicity in the toolkit (unless you plan to only have a few applications).

None of the other algorithm templates require multiple deques. Having multiple deques would tax users of all other algorithms for sake of a particular pipeline implementation. We think that any of the significant benefits of the depth parameter can be similarly achieved by carefully choosing the order in which items are pushed onto the deque.

Of course, in some cases we do tax everyone for sake of a few. A notable example is the affinity support, which is currently exploited by only a few of the algorithm templates. However, it inflicts only a few branches on everyone, and gives a major scalability boost for some common idioms. The benefit has been measured at >5x for some cases. It's inevitably a judgement call of which taxes are worth making everyone pay.

"None of the other algorithm templates require multiple deques."

That's beside the point: a pipeline would be only too happy with a deque to provide global progress, so much so that it probably would not mind if later data units bump into earlier ones that have stalled because they don't have a higher priority anymore (the only reason I see at this time for having more than two depths in pipeline).

The point is what a worker will do while waiting for a task's children to complete. Are you going to require the use of continuations, as Dmitriy indicates? That would be like throwing in the towel just because the opponent isn't showing you a sunny smile. Or is there another solution to this (I think I have an idea, but I'd like to hear yours first)? An inquiring mind wants to know...

So who would like to benchmark the changes I made against the original? I think there are problems with the test code that make it unsuitable for benchmarking, but you might have a real example that you can use. I just want to know that it's not a totally insignificant change or worse.

Due to some forum issues, I can't reply to your last post and had to reply to the thread.

The test_pipeline is definitely not good to serve as a benchmark. But have you tried "text_filter", and even better "square" examples shipped with TBB? That could give at least some digits.

Also regarding your earlier questions that I promised to answer. Pardon me that I did not answered at that time. Teh discussion moved quite far since that. Are there still questions you'd like to know the answer for, and if there are, could you please repeat, or point me to them, so that I could help you the best way?

"Are there still questions you'd like to know the answer for" Thanks, but I think I've come to know pipeline quite well now. I would still want to know about this one deque per thread, but otherwise you might just comb the posts for feedback information and ignore the questions. I think I might try to partially restore order in a sandwiched non-ordered filter (except that I'm wondering whether and how much cache locality plays a role here), and look into the possibility of making the input buffers lock-free, and then submit the changes (or should I do that already?).

I am quite sure that if an unordered serial filter directly follows an ordered serial filter directly, its execution will likely be ordered:the previous filter outputs tokens in order, and the unordered serial filter processes tokens in FIFO order as far as Ican tell, thus it will also be ordered. Consequently, the next serial filter will receive tokens in order as well. Looks like ordering will only change after a parallel filter is met. So I wouldn't bother with any optimizations for the "sandwich" case.

Making the buffers lock-free sounds interesting, though off the top of the headI do not know how to combine it with on-demand dynamic growth of the buffers.

Regarding #37 "So I wouldn't bother with any optimizations for the "sandwich" case.": Yes, you would need an ordered filter, a parallel filter, a non-ordered filter, an ordered filter, in that order, between any others, so unless anybody has a compelling use case...

"combine it with on-demand dynamic growth of the buffers" It's a hybrid approach. Most often, completely lock-free algorithms are probably too risky and too cumbersome for what they deliver (if anything), but if there's low-hanging fruit to be picked by using a lock-free "pre-input buffer", then why not give it a try... I'm adding tasks to a multi-producer singly linked list (compare_and_swap()), which at certain points is cleared out (fetch_and_store(NULL)), to be fed into the existing code; one of those points is if the code finds out that before the addition the token was not yet equal to the one being served (otherwise it would be returned from execute()), but after the addition it is, because in that case the addition and the notification may have crossed each other.

One thing occurred to me after my posting yesterday: I should give wrap-around tasks that won't replicate priority over tasks that will (based on their m_spawn value), otherwise I'm creating overpopulation that might undo any benefits of recycling and not touching a shared reference count, or worse. Isn't the current pipeline supposed to work well with an unlimited value for tokens in flight (disregarding external considerations like use of a circular buffer)?

Riddle me this: how do you determine whether a task was stolen if all tasks are children of a single abstract parent and get recycled over and over, making is_stolen_task() as currently specified meaningless?

Answer: hack the scheduler some more...

Rationale: have your pipeline cake and eat it too. The only way I see to let pipeline start up in logarithmic time and not cause a task replication runaway chain reaction even with an excessive or unlimited maximum number of tasks in flight is to keep replication-capable tasks at a separate low priority/depth. When such a task is stolen, it performs meiosis, with each task getting half the replication potential. When not stolen, it performs mitosis, mobilising a task for pipeline duty and then going to sleep again (tasks on pipeline duty continuously traverse through the pipeline and wrap around to the beginning until pipeline shutdown). To complete the picture, meiosis would imply mitosis as an obvious optimisation, and the original zygote would be twofold because it would be silly for the first thread to have to go steal back another replication task after a single theft instead of two.

Any obvious holes in the argument (and I don't mean the imagery)? Any pointers on the hacking?

Call me obsessed, but I keep having doubts about memory semantics guarantees in TBB. Specifically, I cannot make out whether a serial filter is supposed to have state that is automatically released and acquired from one data unit to the next, even though it seems to me that it should. Am I mistaken? It's probably too much work to plug into some nifty validation tool, though... but I'm still hoping for an official statement.

Quoting - Raf Schietekat

Call me obsessed, but I keep having doubts about memory semantics guarantees in TBB. Specifically, I cannot make out whether a serial filter is supposed to have state that is automatically released and acquired from one data unit to the next, even though it seems to me that it should. Am I mistaken? It's probably too much work to plug into some nifty validation tool, though... but I'm still hoping for an official statement.

Isn't ordered_buffer::low_token the state you are looking for? More precisely, with regard to the memory semantics it's actually ordered_buffer::array_mutex.

Added: but if youthink ofreplacing the aforementioned mutex with something else, then yes, there should be some special memory fences built into a serial stage, at least if the current design of the pipeline is generally kept. Consider a parallel stage preceding a serial one.A few workers can process the parallel stage tasks simultaneously; after completion, each one would try to advance its current item to the serial stage. The serial stage tasks would be either scheduled to execute right away, or spawned to local pools; in both cases, they do not have any syncronization that would ensure"oneitem at a time" constraintrequired for a serial filter.

Of course this reply does not pretend to documentmemory barriers inside the TBB scheduler.

"Isn't ordered_buffer::low_token the state you are looking for? More precisely, with regard to the memory semantics it's actually ordered_buffer::array_mutex." No, by state I mean something like a member variable of a tbb::filter subclass. Yes, there will be a release of array_mutex in note_done(), but where is the corresponding acquire before the next data unit comes along? It doesn't seem at all unlikely (but I don't really know yet) that the next task will stumble into some fences of its own, but, even if it is a given, wouldn't whether that is relevant depend on the as-yet unofficial particularities of x86 (I'm not familiar enough with Itanium to say anything about even the other Intel architecture)?

It would be nice to have some proactive formal guarantees, which would inspire more confidence than the mere observation that, so far, it seems to just work, somehow, as far as we can tell, anyway.

Quoting - Raf Schietekat

No, by state I mean something like a member variable of a tbb::filter subclass. Yes, there will be a release of array_mutex in note_done(), but where is the corresponding acquire before the next data unit comes along? It doesn't seem at all unlikely (but I don't really know yet) that the next task will stumble into some fences of its own, but, even if it is a given, wouldn't whether that is relevant depend on the as-yet unofficial particularities of x86 (I'm not familiar enough with Itanium to say anything about even the other Intel architecture)?

I probably did not get it. Let's look at the code (somewhat reduced for clarity). Here is what's basically going on in filter processing:

my_object = (*my_filter)(my_object); // filter execution
if( ordered_buffer* input_buffer = my_filter->input_buffer )
    input_buffer->note_done(my_token, *this); // (1) signal that the next token might proceed
my_filter = my_filter->next_filter_in_pipeline; 
if( my_filter ) // there is next filter
    if( ordered_buffer* input_buffer = my_filter->input_buffer ) // the next filter is serial
        input_buffer->put_token(*this ); // (2) check whether the current token might proceed

So before execution of a serial filter, the thread executes put_token on its buffer, and after execution of a serial filter, the thread executes note_done on its buffer. If you look there, both methods are basically protected by the same mutex. This mutex provides both acquire and release fences. Isn't it enough?

It would be nice to have some proactive formal guarantees, which would inspire more confidence than the mere observation that, so far, it seems to just work, somehow, as far as we can tell, anyway.

I raised the request internally.

"So before execution of a serial filter, the thread executes put_token on its buffer, and after execution of a serial filter, the thread executes note_done on its buffer. If you look there, both methods are basically protected by the same mutex. This mutex provides both acquire and release fences. Isn't it enough?" The put_token() in this sequence is immaterial because it pertains to the same thread executing on the next filter, not to another thread on the same filter. If the next task/token was already in the buffer, there is no relevant put_token(). Or maybe one thread calls note_done(), a second thread calls put_token(), and a third thread executes the task. Or even a combination.

P.S.: I'm actually just trying to distract from the fact that my own changes aren't ready yet. :-)

Quoting - Raf Schietekat

The put_token() in this sequence is immaterial because it pertains to the same thread executing on the next filter, not to another thread on the same filter.

It equally pertains to another thread on the same filter. And it does not matter for the same thread on another filter.

If the next task/token was already in the buffer, there is no relevant put_token(). Or maybe one thread calls note_done(), a second thread calls put_token(), and a third thread executes the task. Or even a combination.

Let's clarify what we need fences for.
In your post #40, you asked about a state in a serial filter that changes from one data unit to the next (to ensure serial execution, I assume). I claim this state is the per-filter instance of ordered_buffer (accessible via filter::input_buffer), and in particular its low_token member, together with high_token for unordered serial filter. Both low_token and high_token are only accessed when the array_mutex is held, except for filter initialization and destruction. Thus array_mutex provides necessary fences for reading & changing the state.

A different perspective is how a (pointer to) data item moves from one filter to another. It is returned by the previos filter and stored to stage_task::my_object. In note_done(), it is not accessed, since note_done() is about completion with the previous filter, while my_object is input for the next filter. In put_token(), my_object can be copied into the buffer (see stage_task::put_task_info()), and after that the task returns NULL and is destroyed; if not put into the buffer, the data will be processed by the same thread, so no fences are necessary. From the buffer, the pointer can be read by another thread; array_mutex provides necessary fences between that read and the preceeding write. This thread then copies the pointer into a new instance of stage_task, and spawns the task; spawning executes (at least) release fence on the task pool.If the task is then taken out for execution by the same thread, no fence is necessary. If the task is stolen, stealing executes (at least) acquire fence on the task pool. So from this perspective, the necessary fences are also in place.

"It equally pertains to another thread on the same filter." We disagree on this. "And it does not matter for the same thread on another filter." We agree on that.

"Thus array_mutex provides necessary fences for reading & changing the state." I beg to differ.

"So from this perspective, the necessary fences are also in place." Here I only have an issue with the word "also". :-)

If anyone else has an opinion on this; feel free to chime in.

Would you mind to describe in more detailswhy do you think array_mutex fences do not provide enough guarantees, including possible analysis of corner cases where you think it is not enough?

Well, the lock on the mutex in note_done() is not necessarily followed by a lock on the mutex in put_token(), because the next task to visit the filter may already be in the buffer waiting, and then it is simply spawned.

But please disregard this mistake I made in #44: "Or maybe one thread calls note_done(), a second thread calls put_token(), and a third thread executes the task. Or even a combination.", because if put_token() happens later, the thread that called it also invokes the filter, because the scheduler is bypassed.

Maybe it matters only on some architectures whether synchronisation is through matching memory locations instead of a release related to the mutex on the one hand and an acquire related to something in the scheduler on the other hand, and not on x86 (although now I'm confused about that again), but why not simply acquire a lock on the filter while invoking it, just for the fences because there will be no competition?

(Added) Make that a very remote "maybe" in the third paragraph, really a request to confirm that the answer is "no". And yes, I mean one mutex on the buffer and another mutex on the filter (although they could technically both be member variables of the buffer).

(Added) Or, e.g., low_token could be atomic to signal state transfer from one task to the next.

One more, last attempt to show you necessary fences are in place :)

We might consider that token arrives into a filter with put_token(), and departs from the filter with note_done().If the next token arrives after the previous one departed, the array_mutex used in both methods provides necessary fences (on the same memory location) and we seems to agree on that.

Now, you concern what happens if the next token arrives before its predecessor departs, and whether there are fences. Let's look at it in detail.

While thread T1 applies a serial filter to the token i1, thread T2 arrives with i2. T2 actions:
- enter put_token();
-acquire array_mutex;
- check low_token and find out that i2 can't yet be processed;
- put i2 into the buffer;
- release array_mutex;
- return from put_token().
So i2 is in the buffer, no task processes it. Meanwhile, T1 completed processing of i1 and is going to check it out. T1 actions:
- apply the serial filter to i1 (just completed);
- enter note_done();
- acquire array_mutex;
- increase low_token;
- find out there isi2 token in the buffer which can be processed;
Note the pair of release/acquire fences on the same location between accesses to i2 and low_token. T1 continues:
- get i2 out of the buffer;
- release array_mutex;
- spawn a task to process i2, which roughly is:
- acquire own task pool P1;
- put the task into the pool;
- release pool P1;
- return from note_done().
So now processing of i1 really completed, and meanwhile T1 executed at least two release fences on different locations after execution of filter itself.
Now T1 applies subsequent filters to i1. But what happens with i2? The task to process it is in task pool P1. Let's assume there is an idle thread T3 that is about to steal some work from P1. T3 actions:
- select T1 as a victim for stealing;
- steal a task:
- acquire pool P1;
- take the task to process i2;
- release pool P1;
- execute the stolen task:
- apply the serial filter to i2;
- execute note_done(); details omitted.
Again, you see there is a release/acquire pair of fences on the same location (pool P1)between T1 applying the filter to i1 and T3 applying the fliter to i2.

If this still does not convince you in the ordering guarantees, I must give up :)

Now we've got it (I think): if i1 wraps up before i2 is buffered, state is signalled through array_mutex, otherwise state is signalled through T1's task pool (array_mutex plays no role here). Right? Thanks for clearing up the "otherwise" part (even though it seemed you were hedging your bets by still mentioning array_mutex)!

Now I'm getting worried that this should have been obvious. Well, maybe not. :-)

Pages

Leave a Comment

Please sign in to add a comment. Not a member? Join today