Emulating an OpenMP parallel for with TBB

Emulating an OpenMP parallel for with TBB

I was wondering if it is possible to have a similar construct like an OpenMP parallel for with TBB.For example, in OpenMP I can do the following:

void foo(void) {int i;#pragma omp parallel for private(i) for (int i=0; i<10000; ++i) { int tid = omp_get_thread_num(); // current thread id int nth = omp_get_num_threads(); // current number of threads // do sth - even if there are dependencies }}

I am aware of the existence of the parrallel_* algorithms, but they make no guarantee that the tasks generated will execute in parallel (and it is explicitly mentioned that dependencies between them can cause a deadlock).The question is if I am able to express in TBB a fork-join model and have the guarantee that it will be forked as much as possible as the resources allow it.I believe it should be possible if I would have access to the information on how many threads are already executing something.

33 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

The question is why would you do that except to make the program run slower than it should?

There may be a more appropriate way to translate those dependencies than simple transliteration, so to say.

I know that TBB was not made with this in mind and that they could probably be scalability issues with a data parallel approach but I still would like to experiment with fork-join in TBB.I tried to implement it with a task_list and have a scalable fork, however I'm still lacking the information on how many threads are readily available.

TBB's message is that required concurrency (as opposed to optional parallelism) is bad for you(r performance), so you'll have to forgive it for not handing you the rope with which to go hang yourself. If you just want to experiment, stick with OMP or employ user threads for the required concurrency, and perhaps you'll still have some opportunity to add features from TBB to help you with the performance (hybrids are possible, but it only makes sense if all those user threads aren't already hogging the machine).

Or you could still tell us what you want to do instead of how you want to do it and let us suggest a better approach.

I understand what TBB's ideology is - and I do not disagree with what you say.All I want is to execute a workfunction SPMD-style on N threads - and I know for sure that I have N threads executing it; which equals to a fork-join.However I do not wish to oversubscribe the system in the presence of other executing tasks (in that case, if I try to fork, it will just use the current executing thread).I am aware of the "dangers" of the data parallel model and I do not suggest adding it to TBB; however I'm not your typical user and I would like to test a data parallel approach through TBB and merge it with task parallelism. In that case I can compare it with OpenMP that offers both (even though OpenMP tasks are severely limited). But for that I need at least a way to know for example that a parallel_for will certainly create N tasks.

No matter whether you use parallel_for or spawn tasks, in TBB you do not know how many threads will work on it. Threads may join and leave work sharing arena at any time.

You may spawn exactly as many tasks as you need (or expect) threads, each waiting on the same barrier.When youhave no other work for TBB in the app/test, the threads will eventually arrive to the barrier. But, any comparison of this hack with OpenMP does not make sense, because OpenMP is designed and optimized for this team-and-barrier programming style (and has compiler support to do it efficiently) while TBB is not.

By the way, this team-and-barrier style has little relation to data parallelism IMHO. What you want is not parallelism but mandated concurrency.

I believe this would cause a deadlock, especially if the number of tasks is more than the number of threads available to TBB. Why do you call it mandated parallelism? I never said I want N threads, I only want to know how many threads are working on my stuff.I am basically asking if something like a parallel for is possible in TBB.If TBB cannot do such things ever, that's OK.

If you want to play with data parallelism then maybe you should code up the data parallel portion of your algorithmin Intel Array Building Blocks (part of the interoperable Intel Parallel Building Blocks), a package designed to exploit data parallelism. If that doesn't further your purpose, then perhaps what you really are (unintentionally) talking aboutmandated concurrency, or at least an assurance that a piece of code has been executed by some number of distinct threads.

Intel Threading Building Blocks doesn't work that way. You can think of it as a decentralized scheduler. There's no master controller that knows what all the threads are doing. There is no repository for collecting such statistics (at least in production code)because the cache thrashing required to collect it would kill performance.

You could have a kernel that keeps a tally, marking some slate as each unique thread passes through the code and then determine after the fact how many threads were engaged, but I get the feeling that's not what you want. And even that doesn't guarantee that all the threads that touched it were there simultaneously for long enough tobe called concurrent.

And even if you could get all the available threads to play concurrently, it still probably wouldn't look much like an OMP for-loop. The scheduling will be radically different. If you just use the OMP static scheduler, it will divide your loop of 10,000 into as many pieces as there are available threads (typically either all or one) and give each piece to a thread, which will execute in the specified direction until it's done with its piece, then go to the barrier. Whereas in Intel TBB, one thread will split the list into two pieces of 5,000, split one of those into two pieces of 2,500 and so on until it gets down toa threshold where it executes the piece it has left; concurrently, idle threads will look for these left behind pieces to steal and start the same splitting process upon them. The number of threads executing upon the 10,000 will then grow to some maximal concurrency as threads finish their pieces of the 10,000 and go off to look for other work. At some point that work will start to become scarce, and the available threads will go off to look for other work and eventually idle themselves. There's no even division of the work across the list and if you preturb the threads enough to observe them, it just looks like chaos.

So, not your typical user, can you be more specific on the types of data parallelism with which you want to experiment?

I do not wish to use Array Building Blocks.I am developing a framework that allows the user to run SPMD functions. The user only says: spmd_execute(Nthreads, f, args...);and the framework scalably invokes Nthreads-times the function f() with the given arguments.The user does not necessarily have to specify Nthreads.In OpenMP, a rough abstract implementation of spmd_execute() is:

spmd_execute(Nthreads, f, args...) { if (omp_get_level()==0) { int max_nth = min(omp_get_max_threads(), Nthreads); #pragma omp parallel for nowait schedule(static, 1) for (int i=0; i f(args...); } } else { f(args); } }

f() might do computations, have OpenMP calls, create nested OpenMP parallel regions and/or call spmd_execute() recursively.Becausef() is executed SPMD, I can know how many threads execute it and what is each thread id. This is crucial for my framework, as well as supporting nested parallelism without oversubscribing. However, specifying exactly Nthreads to spmd_execute() is not a big deal.I am trying to do the same with TBB (in which case I can have my SPMD function f() call TBB inside). I do not want to use threads, as it would oversubscribe the system in case f() is calling TBB.

But what are those dependencies if you don't obey the user's Nthreads argument in the first place? You're not telling us the ultimate goal, and even the mandated how is getting blurry now...

The user can if she wants to, to specify how many threads are going to execute f(). If she doesn't, then the framework picks a suitable number of threads. Which means:1) if no other parallel computation exists, fork as much as possible2) if another parallel computation exists (ie some threads are currently working) use the rest to fork3) if it is a nested spmd_execute() call, then run on one thread - the one that called youI do not know the dependencies in f() - consider this user-code. All I can provide is number of threads and current thread id. It is an arbitrary use case, I have no released software or users, I just want to see if it can get done with TBB (as I showed, it is doable with OpenMP).

So sounds it is about using a parallel framework as a mere thread pool manager that keeps control of overall amount of active execution contexts and so tries to prevent over- and undersubcription.

Yes you can do that with OpenMP and cannot do that with TBB, which motto is "think in terms of work (i.e. parallelism in your problem) andnot in terms of workers (threads)". TBB helps experssing the parallelism, in the form of tasks or generic parallel algorithms. The notion of threads or a parallel region is not exposed, so you really cannot do what you want without dangerous hacks.

I think you and TBB will just have to agree to disagree and go your separate ways.

More or less yes - again, what I gave is an oversimplified version.

In this case when I do task_scheduler_init init(p) and I know that nobody else uses TBB, do I have any guarantees that if I call parallel_for(0, p, f), f() will be executed on exactly p threads?

And if I do spawn_root_and_wait() from a task, when the spawned task is done, will TBB return me to the original or it might try to execute other pending tasks?

Will you get p threads if you're the only IntelTBB player and requested a pool of p threads? Maybe. No guarantees, as we said before. Consider my previous parallel_for example. If we're talking 10,000 items and say, 16 threads, it's likely that all 16 will be employed, but not guaranteed. If there's only 100 items, it's less likely that all 16 threads will be employed. A case I have in mind is where all the free work gets stolen before all the threads in the pool become engaged. You might have 15 threads engage in the work while the 16th vainly rolls a die to find work but never does. The one thing you can guarantee here is that it won't be more than p threads.

I'd try to answer your other question, but I'm not sure I understand it. "return me to the original?" task? If there are other tasks that have been scheduled, the threads in the TBB pool will seek them out. You don't own any threads in TBB, only tasks.

ipapadop,

In your second example (smpd_execute), consider for the sake of argument you enter spmd_execute with a thread at level 0 (main OpenMP thread of process). This establishes (or enlists) the max_nth thread pool to perform the work (as n-way fork). Immediately after this part all the threads will be busy chewing away at function f(args).

Assume also that the work to be performed in f(args) and/or the latency through f(args) is un-equal.Note f(args) could perform I/O, have page faults, or get pre-empted by other process on system. What happens in this situation is some threads complete earlier than others. No problem here (supposedly) because you have nowait on the #pragma omp.... Now let's see what can supposedly happen...

Under this circumstance, when the threads that are not the main thread complete... those threads go idel (may allso burn time in KMP_BLOCK_TIME). Should the main thread complete before the other threads complete, the main thread exits spmd_execute and presumably runs into code the calls spmd_execute again (with same or different f(args)). On this second call it will detect it calls from level 0 and will assume all threads are available. It is quite possible that some of the threads from the first call are still busy, but your code is executing as if theyare available.

In the former case you have idle threads, in the latter case you have a quasi over-subscription of threads on calls from outer most level and under-subscription of potentially soon to become available threads. So, excepting for an application thathas only one nest (recursion) level, your smpd_execute would likely exhibit ineffective use of available cores.

Jim Dempsey

www.quickthreadprogramming.com

It still would be nice to know for certain what this is all about for this discussion to be meaningful: a problem that could be refactored to shine on top of TBB, or an essentially different paradigm with requirements like, e.g., actor-based concurrency's. With performance and fairness typically (and perhaps essentially) in opposition, and TBB deliberately choosing the side of performance, an essential trade-off must be made, which is hard to do without knowing the ultimate goal. And even if everything is clearly understood and doable, there is the cost-benefit question between making the necessary changes (hopefully without adverse effects elsewhere or burdensome maintenance efforts afterwards) and being able to use them in other areas as well. Hence my previous response.

@Robert Reed: I am basically asking in the case that I call spawn_root_and_wait() from a task T1, what will the scheduling policy be? Will TBB execute the spawned tasks from T1 and return me to T1 ASAP or will it execute other pending tasks as well? In the latter case, will it work-steal as well?

@Jim Dempsey:Any decent runtime should be able to detect how many worker threads are available and divide the work accordingly right?

At least the OpenMP runtimes I have tried do that - if you specify the correct policy, they will only fork as much as possible, without oversubscribing. And remember, I am building a framework, so if the user does not specify how many threads she wants, I'm deciding how many I am going to use.

@Raf Schietekat: I am playing with distributed containers if that helps you. I am partitioning my container in n blocks and there are many different partitioning schemes. I can then use the spmd_execute() to apply algorithms on the distributed container. My framework decides how many threads to spawn for executing the algorithm, based on locality, number of available threads etc (yes, I do use knowledge regarding locality so it is not an issue).

I could allocate my container using the scalable allocator and let TBB do its magic as regards to affinity, but most of the times this is not possible for various reasons.
In OpenMP this thing works but I cannot interface correctly with TBB since 1) it does not allow me to say "i want n threads to execute this thing" and 2) it does not take into account how many threads are currently busy doing OpenMP stuff.I am also coming to the conclusion that because of the lack of this support (which is a design decision which I am not debating) that I cannot have it my way.

"I am basically asking in the case that I call spawn_root_and_wait() from a task T1, what will the scheduling policy be? Will TBB execute the spawned tasks from T1 and return me to T1 ASAP or will it execute other pending tasks as well? In the latter case, will it work-steal as well?"
See Reference Manual/Task Scheduler/Scheduling Algorithm: returning from spawn_root_and_wait() has the highest priority unless you explicitly override it, but there is no telling how long a thread may be otherwise occupied once it has stolen a task.

"Any decent runtime should be able to detect how many worker threads are available and divide the work accordingly right?"
Wrong. TBB deals more efficiently with the opportunity cost of parallel execution.

"At least the OpenMP runtimes I have tried do that - if you specify the correct policy, they will only fork as much as possible, without oversubscribing. And remember, I am building a framework, so if the user does not specify how many threads she wants, I'm deciding how many I am going to use."
I do not see why female users would be more willing to sacrifice performance for even division of work... on a machine. :-)

"I am playing with distributed containers if that helps you. I am partitioning my container in n blocks and there are many different partitioning schemes. I can then use the spmd_execute() to apply algorithms on the distributed container. My framework decides how many threads to spawn for executing the algorithm, based on locality, number of available threads etc (yes, I do use knowledge regarding locality so it is not an issue)."
I still see no inherent required concurrency.

"I could allocate my container using the scalable allocator and let TBB do its magic as regards to affinity, but most of the times this is not possible for various reasons."
That sounds mysterious enough for me to withhold comment.

"In OpenMP this thing works but I cannot interface correctly with TBB since 1) it does not allow me to say "i want n threads to execute this thing" and 2) it does not take into account how many threads are currently busy doing OpenMP stuff."
With Intel's compiler you can have OpenMP and TBB automagically coordinate thread use, I hear.

"I am also coming to the conclusion that because of the lack of this support (which is a design decision which I am not debating) that I cannot have it my way."
I refer back to my rope analogy (less is more), and wish you good luck.

"Wrong. TBB deals more efficiently with the opportunity cost of parallel execution."That's completely orthogonal.

"I do not see why female users would be more willing to sacrifice performance for even division of work... on a machine. :-)"Gender-neutral pronoun...

"I still see no inherent required concurrency."Bottom line is: I cannot express the algorithms using only the TBB provided structures; not because I do not want to, because I can't. They are written in a data parallel way and use distributed containers, and rewriting them is not desirable. I have to do the best I can while at the same time, use task parallelism - and then when speedups are there, I can convince users to change their algorithm. And I do not see why the answer to my question "Is this possible?" has essentially to be "Your question is wrong"."With Intel's compiler you can have OpenMP and TBB automagically coordinate thread use, I hear."
Good - at least that's a good lead that I didn't know.

A runtime system cannot predict when a busy thread will become idle. Your code though, could have heuristics or have pre-determined tuning parameters (e.g. in OpenMP setting number of threads for next parallel region).

As for partitioning based upon current idel threads, while TBB doesn't have it QuickThread (www.quickthreadprogramming.com) does have a means to partition based upon availability of threads

parallel_for( Waiting$, aFunctor, iBegin, iEnd, otherArgsHere);

If you want to restrict scheduling to current thread plus idel threads but only to threads in the same processor socket as the thread issuing the parallel_...call:

parallel_for( Waiting$+L3$, aFunctor, iBegin, iEnd, otherArgsHere);

There are several other scheduling hints you can use as well too.e.g. One thread per socket:

parallel_for( OneEach$+L3$, aFunctor, iBegin, iEnd, otherArgsHere);

Or, if you know the thread will block on I/O or event

parallel_task( IO$, aFunctor, otherArgsHere);

QuickThread has two classes of threads: Computational only, and I/O
In its parallel_pipeline you can mix threads of different classes along different stages of the pipeline. So your original problem of your f(args) having sections with computation only and other sections with I/O or event waiting could easily be handled with this type of parallel_pipeline.

back to TBB for a moment

You can use an atomic class variable to count the number of busy (or idel) threads in your application. This will take some work on your part but could possibly be done by inserting a class in the front of each task (kind of like profiling). Based on the availability of threads you could use TBB's parallel_invoke

switch(nAvailableThreads)
{
case 0: // count is buggard
case 1:
f(args);
case 2:
argsSplit(args1, args2);
parallel_invoke(
[&](){ f(args1); },
[&](){ fargs2); });
break;
case 3:
...
default:
// split max way here
} // switch

The above is your "traditional" fork and join

Jim Dempsey

www.quickthreadprogramming.com

"You can use an atomic class variable to count the number of busy (or idel) threads in your application. This will take some work on your part but could possibly be done by inserting a class in the front of each task (kind of like profiling). Based on the availability of threads you could use TBB's parallel_invoke"This is actually my current approach (taken from in the TBB reference manual in structured_task_group section) - however, this will break in case TBB is used somewhere in user-code and I'm unaware of it.

parallel_invoke() will try to instantiate n copies of my function but only some of them may execute in parallel, depending on resources. My function is SPMD, so it may contain collective operations that are going to cause a deadlock if all instances are not running.

Rewriting everything using continuations, although it is highly appealing and solves a lot of problems, is not possible.

Quoting ipapadop@Robert Reed: I am basically asking in the case that I call spawn_root_and_wait() from a task T1, what will the scheduling policy be? Will TBB execute the spawned tasks from T1 and return me to T1 ASAP or will it execute other pending tasks as well? In the latter case, will it work-steal as well?

The thread will return to T1 ASAP, but it's allowed to take other pending tasks and even stealif return is not yet possible (which means, part of spawned job was stolen and not yet completed by thieves).

""Wrong. TBB deals more efficiently with the opportunity cost of parallel execution."
That's completely orthogonal."
How does the program adapt to incorrect estimations of the workloads at launch time and/or different progress at run time (otherwise they don't finish together and some processing units sit idle)? How does it adapt to new processing elements becoming available while the work is in progress, or conversely to new high-level parallel slack elsewhere in the program? How does work distribution scale? Etc. Jim, I'd like to hear your take on that: did you include this as a feature (to be used with due caution), or are you less apprehensive about its use than I was taught to be?

"Gender-neutral pronoun..."
Actually, "he" is understood to be gender-neutral, and "she" is politically correct in overdrive (for the sake of future linguists studying old archives).

""I still see no inherent required concurrency."
Bottom line is: I cannot express the algorithms using only the TBB provided structures; not because I do not want to, because I can't. They are written in a data parallel way and use distributed containers, and rewriting them is not desirable. I have to do the best I can while at the same time, use task parallelism - and then when speedups are there, I can convince users to change their algorithm. And I do not see why the answer to my question "Is this possible?" has essentially to be "Your question is wrong"."
Are you saying that you have to deal with legacy code written for something like a grid that cannot be economically adapted to nimbler SMP? I can understand that as a real-world external constraint (although it would have been more productive to reveal this earlier in the discussion), and if it were prevalent and important enough I wouldn't be surprised if TBB would add support for it for that reason alone. But I'd be interested to watch a race between the two approaches (how did "race" become a bad thing?).

Quoting ipapadop"With Intel's compiler you can have OpenMP and TBB automagically coordinate thread use, I hear."
Good - at least that's a good lead that I didn't know.

Not completely automagically. There is a special component for this, and OpenMP usage should follow some rules for the coordination to have effect. Write me an email if you want to learn more.

>>Jim, I'd like to hear your take on that: did you include this as a feature (to be used with due caution), or are you less apprehensive about its use than I was taught to be?

In a parallel task paradigm, such as TBB and my QuickThread (singular, not QuickThreads plural), it is relatively easy for the task scheduler to know when threads become idel (this is part of its job). What is difficult though (without program hints) is how much compute time is remaining in any given task. None of the threading tool kits offers this capability, but someone could add this to their application (progression indication). When it is known that you have n idle threads, you could potentially schedule n+1 threads for the parallel_for or other parallel construct assuming your threading toolkit has this capability. OpenMP does as does QuickThread. QuickThread extended this capability by introducing filters to the "count the threads in setxxx" (e.g. cache level or socket). In addition to count, the scheduling can optionally restrict the threads to the cache level or socket of interest. This does not guarantee that what you see available at the time of scheduling decision will be immediately available after you make the decision. Nor does it guarantee that after you make the scheduling selection that an additional thread in that setbecomes free.

QuickThread does have feature whereby when you process a parallel_list, that the skip list automatically reconfitures as new thread become available during execution of the list.

Also, since QuickThead uses (example)

parallel_for(OptionalSelector$, functor, iBegin, iEnd, argsHere);

To enqueue

void functor(intptr_t iBegin, intptr_t iEnd, arg_t arg, ...)
{
for(intptr_t i = iBegin; i .lt. iEnd; ++i)
{
...
doSomething();
...
}
}

Of particular interest here is unlike TBB, the iteration space is visible to the function of the functor.
What this means is, after scheduling, during execution of the task loop, you can observe if additional threads become available. If they do, and conditions are acceptible, you can split the work.

Example of a conditional 2-way fork

void functor(intptr_t iBegin, intptr_t iEnd, arg_t arg, ...)
{
for(intptr_t i = iBegin; i .lt. iEnd; ++i)
{
if((iEnd - iBegin .gt. threshold) && qt_CountOfWaitingThreads())
{
intptr_t half = iEnd- iBegin/2;
parallel_invoke(
[&]() { functor(iBegin, iBegin+half, arg, ...); },
[&]() { functor(iBegin+half, iEnd, arg, ...)});
return;
}
...
doSomething();
...
}
}

or insertn-way fork if you prefer.

If you get tired of doing that then write a template that does this for you.

There are different ways to qualify which threads are waiting.

Jim Dempsey

www.quickthreadprogramming.com

But what, if any (other than legacy requirements), would be the circumstances that would entice you as a programmer to use this feature (even the one sentence you quoted from that paragraph was really a question of "why", not "what")? How far away a thread in the NUMA hierarchy would you allow to come and collaborate potentially low in the dependency graph (it would seem wasteful to invite a thread that does not share the same cache and should pick another high-level task instead), and wouldn't NUMA awareness be a precondition for doing that? I don't see such a restriction in the use of qt_CountOfWaitingThreads() in the example, though.

Raf,

To check for waiting thread on same socket and schedule to same socket modify the prior code to use

if((half .gt. threshold) && get_qtControl()->CountOfWaitingThreads(L3$))

To get the count of waiting threads within L3 distance of current thread.

Then

parallel_invoke(
L3$, // restrict to within L3
[&](){ functor(iBegin, iBegin+half, arg1, ...); }
...

If you haven sockets per NUMA node (n .gt. 1) and more than one NUMA node then consider replacing L3$ above with M0$ (within 0 NUMA hops from current CPU). Assuming that is what you want.

QED

Jim Dempsey

www.quickthreadprogramming.com

That's a lot of control, very nice! Now, would you agree that this selectivity is essential for performance? And it seems a bit strange to have to insist, if you'll forgive me for that, because isn't that why you would have built in this level of controllability, e.g., because if it stands to reason that stealing low-level tasks is bad if the NUMA distance is considerable, then why would one willingly invite such a thief instead of one sharing an already warmed-up cache (if that is the appropriate term)? And, in the absense of external requirements to have mandatory concurrency or a way to select threads, wouldn't it then be better to not ask for a particular level of parallelism than to let any thread join in?

>>you agree that this selectivity is essential for performance?

Sometimes yes, sometimes no. You have the ability to select when it makes a difference.

An additional thing to consider is:

a) parallel_for(...);
b) parallel_for(L3$, ...);
c) parallel_for(Waiting$+L3$, ...);
d) parallel_for(ExcludeMyself$+L3$,...);

a) "standard" parallel for - all threads
b) all threads within socket of current threadregardless of busy or not.
c)current threadplus any waitingthread in current socket
d) all threads incurrent socket except current thread

In QuickThread you can choose if the current thread participates or not and/or blocks or not.
The above sketches shows some of the flexibility ofselecting the degree of "Hot in cache".
(locality filters: L0$, L1$, L2$, L3$, M0$, M1$, M2$, M3$)

Now what about"Cold in cache". e.g. you know thedata is not in cache and you would like to push the work off to a socket that has the most available threads:

parallel_for(NotInCache$+L3$, ...);

>>And, in the absense of external requirements to have mandatory concurrency or a way to select threads, wouldn't it then be better to not ask for a particular level of parallelism than to let any thread join in?

In those situations, don't add a qualifier.

There are several other things you can do quite easily using QuickThread that are difficult to do (or not possible to do) with other threading paradigm. Consider creating a major parallel task, scheduling this task, then indicating when theparallel task completes, that you wish to do something else. i.e. a task completion routine. For controlable tasks in QuickThread you declare and use a task structure control:

qtControl yourTaskControl;

this can be static or scoped. Then

// parallel for in current socket without current thread
// and without blocking
parallel_for(
ExcludeMyself$+L3$,
&yourTaskControl,
aFunctor, iBegin, iEnd, args here...);
// immediate return from parallel_for
// Now enqueue a completion routine
//to the above task control such that
//when above parallel for completes
//your completionroutine runs
parallel_task( OnDone$,
&yourTaskControl,
yourCompletionRoutine,
itsArgsHere);
// main thread still running here
...
// if you want then some time later
yourTaskControl.waitTillDone();

(scoped task controls have implicit waitTillDone() in dtor of qtControl)

Also, it is not unusual for an application to have tasks that perform I/O. QuickThread has two classes of threads: Compute and I/O

Compute class threads are default. I/O class is indicated by inclusion of qualifier IO$, or in the case of an I/O completion routine to a compute routine use IOOnDone$.

In a parallel paradigm like OpenMP and TBB all threads are equal. Should you schedule a thread in those, a thread that blocks for I/O removes a thread from you compute thread pool.In OpenMP or TBB to correct for this you may be inclined to add a thread (or two). Doing this produces oversubscription whennot waiting for I/O. Notincluding additional threads produces undersubscription when thread waiting for I/O. In QuickThread you have none of this because you have two classesof threads (assuming you take advantage of this and program accordingly).

Jim Dempsey

www.quickthreadprogramming.com

I'm getting the impression (please correct me if I'm mistaken) that this is not about my question about the original matter of a low-level task actively grabbing threads that could otherwise have chosen to steal a high-level task.

Task level/priority is a bit different in QuickThread than it is in TBB, OpenMP, Cilk, ...

Each thread has its own queue.
Each thread knows the relationships as to which threads share various cache levels and NUMA nodes

(n producers n consumers)

parallel_invoke dispatched tasks have highest priority within restrictions placed upon it, there are 4 levels per thread, these tend to get processed LIFO, but some enqueuing sequences can reorder one or more prior parallel_invoked enqueued tasks (to that thread).

parallel_...(other) has two types of queues an n-slot (currently 4) similar to the parallel_invoke queue but of lesser priorityand with higher degree of filtering restrictions.

The second type of parallel_...(other) is a FIFO linked list. This has the lowest priority.
You can also enqueue task completion nodes in yet another FIFO linked list (one list per task control)

The parallel_...(other) can enqueue to either the (predominantly) FIFO n-slot list with higher priority than unlimitedFIFO listor directly to the lowest priority unlimited FIFO list.

If a deadlock is detected, then the affected queues are reorderd.
Under an exceptionally unusual circumstance (low memory), there is yet another queue.

***

Task stealing has optional filtering capability:

no restrictions as to who steals task
restricted to within a specified cache or NUMA level distance from the owning thread
restricted to specific thread or pre-defined list of threads (specified at enqueue time)

Due to filtering, dequeue order can vary from that of an unfiltered dequeue.

Also, task stealing takes priority of examining queues by order of closeness of cache/NUMA level.
IOW thread performs

check its own queue
check queues of threads sharing L1 (e.g. HT sibling)
check queues of threads sharing L2
... L3
... same NUMA node
... one NUMA hop
... two NUMA hops
... three NUMA hops

(above with filtering restrictions)

Task stealing, when it occure (and it does occure) steals not only by way of FIFO/LIFO but then subsequently by cache level proximity (closest chosen first). Although this sounds like a lot of work, I spent a lot of time getting this optimized.

This provides a rather large (complex) choice of enqueuing/dequeuing flexibility. If you do not need it - do not use the features. When you do need it it is there.

BTW my parallel_pipeline takes advantage of several of these features, it can really fly.

Jim Dempsey

www.quickthreadprogramming.com

Leave a Comment

Please sign in to add a comment. Not a member? Join today