Pipeline

63 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

The array_mutex is important in the "otherwise" case too, because it guarantees that i2 arrives to the buffer in full before i1's departure creates a task that will process i2. As you mentioned earlier (and I used this in the example), the thread that puts an item i2 into the buffer and the thread that takes it out of the buffer to spawn for processing are likely to be different, that's why I mentioned this pair of fences as well.

"that's why I mentioned this pair of fences as well" Ah, OK.

Well, I guess I'm throwing in the towel. If you want to have a laugh, get these changes relative to tbb21_20081109oss (a development release), replace the corresponding files, and watch it all grind to a halt.

Attachments: 

I dropped some bread crums in "Modification variable in operator() method", and Dmitriy Vyukov responded: "I see the attachment in the last post, and I have skimmed over the thread, but I don't get what problem the attachment tries to solve, or what it tries to improve. What it's all about?"

This is an intermediate stage toward (hopefully) implementing global progress. Instead of having a global atomic to keep track of the number of live tokens, which is often just given an expensive nudge, I let the tasks wrap around to the beginning. With a serial input filter, the replication potential is still shared; otherwise, tasks with non-zero replication potential are kept in a separate level that is only tapped when no other work is available, giving such a pipeline the potential to start up in logarithmic time without the system getting flooded with surplus tasks. I'm not sure what the effect would be in the current implementation, but I had the impression that the global atomic was essential; in my version, the parameter for run() defaults to std::numeric_limits::max(). Also, buffers for serial filters get a lock-free inbox, which seemed like a nice thing to have (if it works) in an oversubscribed system. I thought about asking who would want to benchmark this, but it doesn't work yet, and I discovered that I need to take a break.

Quoting - Raf Schietekat

This is an intermediate stage toward (hopefully) implementing global progress. Instead of having a global atomic to keep track of the number of live tokens, which is often just given an expensive nudge, I let the tasks wrap around to the beginning. With a serial input filter, the replication potential is still shared; otherwise, tasks with non-zero replication potential are kept in a separate level that is only tapped when no other work is available, giving such a pipeline the potential to start up in logarithmic time without the system getting flooded with surplus tasks. I'm not sure what the effect would be in the current implementation, but I had the impression that the global atomic was essential; in my version, the parameter for run() defaults to std::numeric_limits::max(). Also, buffers for serial filters get a lock-free inbox, which seemed like a nice thing to have (if it works) in an oversubscribed system. I thought about asking who would want to benchmark this, but it doesn't work yet, and I discovered that I need to take a break.

I think it will be easier for you to not ask me for comments :) because I don't understand what is "tasks wrap around to the beginning", what is "replication potential", what is "replication potential is shared", what is "separate level", what is "start up in logarithmic time"...
But, in general I am still holding the opinion that scheduler must target either (1) at finite computational load, or (2) potentially infinite general load, but not both. Provided (1) scheduler must strongly prefer cohort scheduling over fairness, cohort may be wrt stages/agents or wrt items/messages or both. Provided (2) scheduler must prefer very strong fairness guarantees.
Probably soon I will change my mind because I will have to design scheduler which will have to support both loads...

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Dmitriy Vyukov
But, in general I am still holding the opinion that scheduler must target either (1) at finite computational load, or (2) potentially infinite general load, but not both. Provided (1) scheduler must strongly prefer cohort scheduling over fairness, cohort may be wrt stages/agents or wrt items/messages or both. Provided (2) scheduler must prefer very strong fairness guarantees.
Probably soon I will change my mind because I will have to design scheduler which will have to support both loads...

Interestingly how this is done in ConcRT, they have quite solid approach. User can create several scheduler, for each scheduler user can specify SchedulerPolicy. SchedulerPolicy includes several parameters including FairnessType = {EnhanceCacheLocality, EnhanceFairness}, ThreadsMultiplier (oversubscriptin coefficient), ResourcePriority, RogueChoreThreshold (if chore executes tool long, then scheduler will create additional threads - oversubscribe).
PPL (tasking library) uses EnchanceCacheLocality scheduler, which executes chores in LIFO order and poll chore groups in LIFO order too. AAL (agents library) uses EnchanceFairness scheduler, which executes chores in FIFO order and poll chore groups in round-robin order (switch chore group after each chore).
I've attached ConcRT header files.

Attachments: 

AttachmentSize
Download msvc2010.zip41.62 KB
All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

"tasks wrap around to the beginning" Currently a new task is allocated at the input filter to escort each new data unit through the pipeline, and then it is destroyed. Maybe the memory allocator-related cost is limited, but this way of doing things involves needless writing to a shared location, so refurbishing the tasks instead may improve performance for that reason alone.

"replication potential" In my version, the argument to run() is administered from a central location (shared) only with a serial input filter, otherwise it is distributed for recursive parallelism.

"separate level" As explained in the source, tasks that can self-replicate all live at m_parent.depth()+1, and the rest at greater depth/higher priority.

"start up in logarithmic time" Recursive parallelism means exponential growth as time progresses, or logarithmic time to reach all cores. For massively multi-core processors, that might be significant for short-lived pipelines. Well, for that to really happen, tasks shouldn't all be siblings of course, but it's a work in progress.

Anyway, it's all fairly trivial (the most interesting change is probably that the pipeline tunes itself), and I'm confident I'll get around to fix the problems myself after taking a break.

Quoting - Arch Robison (Intel)

None of the other algorithm templates require multiple deques.

As I see most parallel algorithms can run out of stack in the worst case provided uncontrolled stealing.
Arch, am I missing something?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Dmitriy Vyukov

As I see most parallel algorithms can run out of stack in the worst case provided uncontrolled stealing.
Arch, am I missing something?

What I am thinking of is following simple strategy.
On every recursive call scheduler probes stack usage, if stack usage is below threshold then thread is allowed to steal whatever he wants, if stack usage is above threshold then stealing is disabled.
The idea behind is that in 99% of cases stack usage will be below threshold, so that we will be taking advantage of single simple deque and improved parallelism (arbitrary stealing is allowed). For example, let's say that total stack size is 2MB and threshold is 1MB.
Do you see any caveats?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Dmitriy Vyukov "Do you see any caveats?" Unemployed worker thread, does nothing against job starvation?

Meanwhile, I think I've fixed the problems I had earlier, so here's a new version, based on tbb21_20081109oss (most recent development release), tested for x86 (dual core)/Ubuntu 6.06/g++ 4.0.3, see README.pipeline for applying the patch. Anyone for testing this in a realistic situation, to validate run() without a parameter (this verson should be self-tuning) and compare the performance against the original?

Attachments: 

Quoting - Raf Schietekat
Dmitriy Vyukov "Do you see any caveats?" Unemployed worker thread, does nothing against job starvation?

I've described only "work-stealing core", so to say. I think the scheme can be combined with thread employment and starvation prevention mechanism, so this must not be a problem.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Ah... maybe.

Please for now ignore all references I made to "self-tuning" (I should check my facts again in both the original and changed versions, and maybe run() should not have a default parameter anyway), but I'm still hoping for better performance by avoiding shared token count handling.

Pages

Leave a Comment

Please sign in to add a comment. Not a member? Join today