Inner parallelization calls outer

Inner parallelization calls outer

According to the documentation, the task scheduler should basically go DFS in a thread's own task pool but steal from other tasks near the root. To my understanding, this implies that when you have two nested parallel loops, a thread that arrives at the inner loop should never decide to postpone processing of the inner loop or continuing after the inner loop but decide to start a new task from the outer loop.

However this is what we observed after several days of bug-chasing in our program in some rare and non-reproducable cases.

I was able to write a minimal example that demonstrates this behaviour more reliably. On the two computers where we tested the example, the program below eventually prints "inner parallelization called outer". When you set a breakpoint on that line and analyze the callstack, you see that in deed the thread is in a task from the outer loop, had arrived at the inner loop and started a task from the outer loop.

We suspect that this behaviour might be a bug in TBB. We are using tbb40_20120613 on Windows 7. I am running 8 threads (4 phyisical cores with hyperthreading).

volatile bool first_task;

__declspec(thread) bool thread_is_running_inner_parallelization = false;

int g_i;
class F

{

public:

    void operator() (int p_task_id) const;

};
void F::operator() (int p_task_id) const

{

    // first task has inner parallelization

    if (first_task)

    {

        first_task = false;
        int sum = 0;

        thread_is_running_inner_parallelization = true;

        for (int i=0; i<10000; ++i)

        {

            g_i = i;

            tbb::parallel_for (0, 2, [&](int k)

            {

                // we came from another thread to help with this task

                // (which is OK but rarely happens and the bug always occurs shortly after this)

                if (!thread_is_running_inner_parallelization)

                    std::cout << "stolen " << i << "/" << k << " ";

               // task must do some work, otherwise the problem will not occur

                for (int j=0; j<100000; ++j)

                    sum += (i*j)%7;

            });

        }

        thread_is_running_inner_parallelization = false;

        // print the result of the computation to avoid optimizing the work away

        // and to see when first task is finished

        std::cout << sum ;

    }
    // other tasks basically do nothing

    else

    {

        // print heartbeat

        if (p_task_id%1000000==0)

            std::cerr << ".";

        // here is the bug

        if (thread_is_running_inner_parallelization)

            std::cout << "inner parallelization called outer " << g_i << " " << std::endl;

    }

}
int main ()

{

    // set global variable

    first_task = true;

    // create very many tasks on the outer parallelization,

    // to make sure we won't run out of them early.

    tbb::parallel_for (0, 2000000000, F());

}
(EDIT: Syntax highlighting improved.)

49 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Raf Schietekat's picture

"According to the documentation, the task scheduler should basically go
DFS in a thread's own task pool but steal from other tasks near the
root."
Tasks are stolen, not being stolen from.

"To my understanding, this implies that when you have two nested
parallel loops, a thread that arrives at the inner loop should never
decide to postpone processing of the inner loop or continuing after the
inner loop but decide to start a new task from the outer loop."
That's the tendency, but it stopped being an implicit guarantee since 2.1 or so (I found "20090313 open-source release" in CHANGES.txt, and 2.1 is right above that). I don't recall whether it was ever an explicit guarantee.

"However this is what we observed after several days of bug-chasing in our program in some rare and non-reproducable cases.

I
was able to write a minimal example that demonstrates this behaviour
more reliably. On the two computers where we tested the example, the
program below eventually prints "inner parallelization called outer".
When you set a breakpoint on that line and analyze the callstack, you
see that in deed the thread is in a task from the outer loop, had
arrived at the inner loop and started a task from the outer loop.

We
suspect that this behaviour might be a bug in TBB. We are using
tbb40_20120613 on Windows 7. I am running 8 threads (4 phyisical cores
with hyperthreading)."

It is not a bug, that's exactly what you should expect because of stealing from another thread's depth-agnostic deque-implemented pool. What high-level problem do you want to solve?

Tasks are stolen, not being stolen from.

Typo. I wanted to write that that a thread steals from other threads near the root.

It is not a bug, that's exactly what you should expect because of
stealing from another thread's depth-agnostic deque-implemented pool.

Can you explain what happens here and why that is OK?

I understand that it's OK that tasks get stolen from the thread running first_task (let's call it first_thread). Then it can happen then that first_thread is idle, but one task from the inner for-loop is still running (in another thread, which I'll call slave).

I do not understand why first_thread is stealing from other threads. I would find it slightly better (but still poor) if first_thread just sits idly and waits for slave to return. A really good solution would steal, but only tasks that are direct or indirect children of the node where first_thread is waiting. (It may be possible to express in terms of threads, where to steal.) I any case, I want first_thread to be ready when slave finishes, so first_thread can continue first_task. Above stealing policy guarantees that. When first_thread has stolen a task from the outer loop, it is most likely not ready in time.

What high-level problem do you want to solve?

Disregarding our high-level problem we think that there should not run more outer-loop-tasks simultaniously than there are threads. Current TBB behaviour results in abandoning an outer-loop-task when it arrives at an inner parallelization and starting a new outer-loop-task. This results in problems such as unnecessary memory consumption or later average availability of results from the tasks (e.g. to display to the user).

Sure, when the outer parallelization uses all threads, the inner is useless. But removing the inner parallelization is not an option because there are less outer tasks on other instances or more CPUs on other machines.

For our actual high-level problem we know the solution, but are not completely satisfied with it. We have the inner parallelization in a lazy initialization. TBB abandons constructing the object when it arrives at the inner parallelization. We had a mutex implementation, so the new outer task waited for the initialization to complete (deadlock). This may be a case of holding the mutex to long. But starting the lazy initialization again (as in your lazy pattern) isn't good either. One might need a lot of attempts before one initialization completes!

The solution we want to implement is to initialize the objects before the outer parallelization.

Raf Schietekat's picture

"Can you explain what happens here and why that is OK?"
If all chunks of an inner loop are being executed, the first thread that finishes will have nothing to do but to steal from another thread, and it will tend to grab a task related to the outer loop.

"The solution we want to implement is to initialize the objects before the outer parallelization."
You should probably express the shared dependencies on lower-level initialisations at the TBB level, by maintaining a shared collection of followers in the object being initialised, and releasing each of them when initialisation is complete. A client of such an object would then set a reference count, register itself with the object, and wait for it to finish.

If all chunks of an inner loop are being executed, the first thread that
finishes will have nothing to do but to steal from another thread, and
it will tend to grab a task related to the outer loop.

Well, we agree that this happens. We disagree if this is desirable behaviour if the first thread that finishes is the thread that spawned the inner tasks.

I don't get your suggestion how to solve the problem with lazy initialization. Are you talking of something like a future class? How does the solution make sure that the thread that does the initialization is not blocked waiting for the initialization (after it has droped out the initialization due to inner parallelization)? Even if it polls work while waiting, it has an incomplete outer task in its callstack. And therefore cannot complete the initialization.

Raf Schietekat's picture

"I don't get your suggestion how to solve the problem with lazy
initialization."
I was assuming that you want to apply lazy initialisation from the first thread that uses a resource through a shared handle.

"Are you talking of something like a future class?"
A non-TBB future would block a thread and cause undersubscription, so this would be a way to expose the future to the scheduler.

"How
does the solution make sure that the thread that does the initialization
is not blocked waiting for the initialization (after it has droped out
the initialization due to inner parallelization)?"
That brings me to the limitations of this solution. There have been discussions about futures and agents etc. earlier, and they would seem to cause issues with the scheduler, but I don't recall all the details. I don't see a problem if the "TBB future" (to coin a name for the pattern) is a nonblocking leaf task (no dependencies of its own) that would then either become merged with the first task visiting it (to avoid it getting lost in the deque somehow) or become a dependency of any other task visiting it while it is already being processed. While this is not TBB's normal bread and butter (only tree-shaped graphs of tasks are directly supported in the syntax), the scheduler is fully capable of handlng directed acyclic graphs of tasks, and this would ensure that topology. Does this apply to your high-level goal? Let's see... thread not first at the scene will be "blocked", sure, and able to steal an outer-loop task in the meantimen, but the first thread that encountered the resource uninitialised would already be busy executing it and so nobody would be blocked forever waiting for the initialisation to finish. I don't see what "after it has droped out
the initialization due to inner parallelization" could possibly mean?

"Even if it polls work
while waiting, it has an incomplete outer task in its callstack. And
therefore cannot complete the initialization."
Maybe there would be more stealing in the scenario (wild guess), which is not the optimal thing to occur for performance to say the least, it may well be better for performance than if everybody would just block on mutexes (subject for study or do I just need a coffee at this point?). But I don't see why initialisation would not be able to complete?

I do concede that working with dags is somewhat of a drag in TBB, but you could probably encapsulate that.

Does that come close?

A non-TBB future would block a thread and cause undersubscription, so this would be a way to expose the future to the scheduler.

OK, so a TBB future is better. Not blocking a thread is nice to have. However, the essence of the problem is not maximizing the number of working threads, but to get the solution working at all. The pros and cons of blocking the threads in mutexes, non-TBB futures or TBB-futures are not what bothers me. The problem is that any these solution deadlocks if there is a parallelization in the initialization.

I don't see a problem if the "TBB future" [...] is a nonblocking leaf task

Everything is fine if the TBB future doesn't spawn new tasks.

thread not first at the scene will be "blocked", sure, and able to steal
an outer-loop task in the meantimen, but the first thread that
encountered the resource uninitialised would already be busy executing
it

When a "thread not first at the scene" (call it N) is "blocked", it may also steal a task spawned by the first thread (call it F). The stolen task contributes to building the ressource. E.g. the stolen task may be one index in an inner parallel_for. F is currently executing another index of the same parallel_for. Now F happens to finish before N. There are no more indices left. So F steals an outer-loop task (that's what I called dropping out - that's what causes the problem and has to be avoided) and gets "blocked" waiting for the ressource. N finishes, but F still cannot continue with the initalization, because it is "busy" with the outer loop task. Deadlock. Even if F is still polling work while waiting: It cannot resume the initialization task, it can only start new tasks.

To sum up, any lazy initialization only works when there is no parallelization in the initialization. :-(

I don't see what "after it has droped out
the initialization due to inner parallelization" could possibly mean?

To visualize that, put a breakpoint on the line "inner parallelization called outer" and look at the callstack. Imagine that the "if (first_task)" scope is the initialization and sum is the ressource. The only difference between lazy initialization and my example code is that in my example the other threads are not waiting for the computation of sum to complete.

Raf Schietekat's picture

"The problem is that any these solution deadlocks if there is a parallelization in the initialization."
Did I miss that? Even this should not be a deal breaker if you can at least guarantee that the initialisation-task graph is entirely separate from the rest of the program, has no potental cycles, and can therefore be composed with the rest of the programmer to a bigger graph that is at all times acyclic even if not always connected (as far as TBB is concerned). My initial proposal was simpler, but based on stricter assumptions for higher convincing power. :-)

"The stolen task contributes to building the ressource."
Please explain, because this could imply breaking the assumption of a dag.

"N finishes, but F still cannot continue with the initalization, because it is "busy" with the outer loop task. Deadlock."
F doesn't continue with the initialisation, it just needs to get the go-ahead to return from the wait, which it will eventually do as its other work winds down. It might happen that there is nothing to be stolen at some point even though another way of scheduling would allow an idle thread to resume from a wait now buried somewhere on its stack, but this is not a permanent condition, and how bad this is for performance is a matter for experimentation (unless you can do this in cerebro), not pessimism.

"To sum up, any lazy initialization only works when there is no parallelization in the initialization. :-("
See above.

"The stolen task contributes to building the ressource."
Please explain, because this could imply breaking the assumption of a dag.

Instead of repeating myself I'll post code that demonstrates the deadlock. I hope that this answers your questions. The initialization-task graph is entierely separate from the rest of the program. Our overall task graph is a dag.

Please think away the variable first_thread in the example. It is not in the production code that we are debugging. The purpose of this variable is to make sure that only the first thread depends on the initialization while tasks processed by other threads finish quickly. Thus it is maximizing the probability that the other threads steal from the first thread (which does the initialization). So the variable first_thread is a device that increases the probability that the bug occurs dramatically. In principle, the bug can occur without this variable (and actually did in some runs of our production code).

With mutex_type=tbb the code runs into an assertion when the first thread has dropped out of the initialization and wants to grab the mutex a second time. With mutex_type=boost, it just deadlocks because the first thread waits for the mutex which it already holds.

Note that the problem cannot be solved by replacing the mutex with some code that makes threads poll for other work when they find the resource creation in progress. If you state something else, could you please post code to prove that?

//namespace mutex_type = tbb;

namespace mutex_type = boost;
__declspec(thread) bool first_thread = false;
mutex_type::mutex d_mutex;

class Lazy

{

private:

    int* d_ressource;

    Lazy (const Lazy&);

    void operator= (const Lazy&);
public:

    Lazy ();

    int& operator* ();

};
Lazy::Lazy ()

    : d_ressource(nullptr)

{

}
int& Lazy::operator* ()

{

    if (d_ressource)

        return *d_ressource;

    mutex_type::mutex::scoped_lock lock(d_mutex);

    if (d_ressource)

        return *d_ressource;
    int sum = 0;

    for (int i=0; i<10000; ++i)

    {

        tbb::parallel_for (0, 2, [&](int k)

        {

            if (!first_thread)

                std::cout << "stolen" << std::endl;

            for (int j=0; j<100000; ++j)

                sum += (i*j)%7;

        });

    }

    d_ressource = new int (sum);

    return *d_ressource;

}
class F

{

private:

    Lazy* d_lazy;
public:

    F (Lazy*);

    void operator() (int) const;

};
F::F (Lazy* p_lazy)

    : d_lazy(p_lazy)

{

}
void F::operator() (int) const

{

    if (first_thread)

        std::cout << **d_lazy << std::endl;

}
int main ()

{

    first_thread = true;

    Lazy ressource;

    tbb::parallel_for (0, 2000000000, F(&ressource));

}

Raf Schietekat's picture

"Instead of repeating myself I'll post code that demonstrates the
deadlock. I hope that this answers your questions. The
initialization-task graph is entierely separate from the rest of the
program. Our overall task graph is a dag."
BTW, be careful with that double-checked locking: you should use an atomic there.

"Note that the problem cannot
be solved by replacing the mutex with some code that makes threads poll
for other work when they find the resource creation in progress. If you
state something else, could you please post code to prove that?"
Sure it can, even if I don't have the time right now to write that code for you. The first task to visit a resource should attach the resource's initialisation task to the TBB graph by creating it as its child, and any subsequent task should add itself as an additional parent c/o an explicit collection in the resource or its handle (you'll have to be careful how to do that in a thread-safe manner).

BTW, be careful with that double-checked locking: you should use an atomic there.

Yes, I've heard that. But so far the lock never caused problems. If you replace the lock by an atomic, the code would not deadlock, but instead start creation of the ressource again and again.

Sure it can, even if I don't have the time right now to write that code
for you. The first task to visit a resource should attach the resource's
initialisation task to the TBB graph by creating it as its child, and
any subsequent task should add itself as an additional parent c/o an
explicit collection in the resource or its handle (you'll have to be
careful how to do that in a thread-safe manner).

This would be interesting for us. However I don't understand how to implement that to make an esstianal difference over my code. This is why I asked for code. As you don't have the time to write it, I have some questions how to do that myself.

The first task creates the initialization task and then does spawn_and_wait_for_all on it?
Other tasks attach themselves and then do wait_for_all?

This way, the threads do not just wait for the mutex, but look for other tasks. Good.

But what's the essential point that prevents the deadlock? Which of the following steps does not happen?
- The initialization task (run by thread F) still spawns subtasks in the inner parallelization.
- Some of these subtasks might still be stolen by another thread N.
- F might still finish its own subtask before N.
- F then polls for work.
- F might start another outer task.
- The outer task attaches itsself to wait for the initialization.
- Now F polls for work again. It may or may not find work. But even if N finished in the meantime and F does not find work, F still has the outer task on top of its callstack. Therefore F cannot continue with the initialization. N cannot continue with the initialization either because it does not have the beginning of the initialization in its callstack.

Raf Schietekat's picture

"Yes, I've heard that. But so far the lock never caused problems. If you
replace the lock by an atomic, the code would not deadlock, but instead
start creation of the ressource again and again."
For portability (and presentability), even with the mutex (that's not the issue), d_ressource should really be redeclared as atomic and used with acquire or release semantics.

I'll have a look at the rest later.

Raf Schietekat's picture

Thank you for an interesting question, and sorry for just taking a swing at it before. I try to keep a good batting average here, but occasionally I do strike out. (If I should be messing up even my metaphores, please note that I've never even seen a game on TV.)

I've been a sceptic about depth-agnostic deque-based scheduling myself at some point. It seemed vulnerable to stack overruns, against which the countermeasure is not to steal past a certain threshold (traditional sequential stack overruns are still a hazard of course). This implementation is capable of scheduling dags of tasks, but that does not apply here, since the parents of the shared resource-initialisation task are already being executed. That was one part of my mistake. The other part was that, while recognising futures as cyclic dependency-creating devices, I thought that if only this was being avoided by policy you would get the above-mentioned safe dag and things would just work.

On the other hand, even considering task depth, it would still be possible for F to start initialising the resource, go out stealing a lower-level task somewhere else (let's say that levels grow downward), and still find itself needing the same resource, similarly resulting in deadlock. This may be something that was not considered in the original analysis around outer and inner loops, so hopefully it is a helpful new insight here.

Could the solution be to apply continuations? That would seem to be a necessary part of getting around frames being buried inside a stack, but how much of the implementation would then need to be in terms of continuations? At first sight it would need to be all of the resource-initialising code, but then what do you do when you would like to use a parallel_for, which is only offered as a blocking call?

Declaring the resource-initialising thread unauthorised for stealing (using API to be added for this purpose), much as when its stack level is past the above-mentioned threshold, is probably still insufficient if its own tasks may be stolen by threads that are not dedicated to the effort by some arena-related mechanism. And of course there's the recurring idea of using fibers between acts of theft so that no frame is ever artificially buried on the stack because of the scheduler alone.

I hope I have rectified some of my errors in this posting, but unfortunately I was only able to come up with largely hypothetical solutions. If anybody can improve on those, please do chime in at this point!

Thank you for an interesting question, and sorry for just taking a swing
at it before. I try to keep a good batting average here, but
occasionally I do strike out.

Thank you for admitting honestly.

Could the solution be to apply continuations?

Rewriting the initialization in terms of continuations is not an option. In our real application, the initialization is the full time job of one of my colleagues who has been working at it for several years. We cannot refactor all of this.

Declaring the resource-initialising thread unauthorised for stealing [...] is probably still insufficient if
its own tasks may be stolen by threads that are not dedicated to the
effort by some arena-related mechanism.

In deed. Moreover, it would be nice if the resource-initializing thread steals back stolen tasks. Look at the implementation of parallel_for. Say range is 0..7. F starts to work at 0 and pushes 4..7, 2..3, 1 in its dqueue (or similar, I have not studied the TBB implementation that intensively). Now 4..7 is stolen by N. N works at 4, pushing 6..7, 5 in N's dqueue. Suppose 2..3 and 1 compute quickly or are stolen by other threads. Then after finishing with 0, F should steal back 6..7 or 5. Also, if subtasks of N are stolen by N' and then N steals an outer task, we again have a deadlock.

And of course there's the recurring idea of using fibers between acts of
theft so that no frame is ever artificially buried on the stack because
of the scheduler alone.

That's beyond my current understanding of TBB.

Thank you for your replies.

Raf Schietekat's picture

"That's beyond my current understanding of TBB."
It's not something currently in TBB, but it has been mentioned before on this forum.

Maybe the upcoming direct control of arenas might support a solution here, with the following twist: each additional thread arriving at a resource and needing to wait for it to be initialised could temporarily authorise the scheduler to add a new thread to the arena (from a pool of course), after the first thread had set it up and either created a new thread or reused itself; when the work on the resource is done, the arena is destroyed, the authorisations forgotten, and the original threads would be able to continue. Not all resources would be lucky enough to have multiple worker threads in their respective arenas, but perhaps there's a good chance that multiple interest is met with commensurate response, so to say.

(Added 2012-07-14) Oh yes, the other threads arriving at the initialisation code could perhaps set affinity for the threads to be admitted to the arena, or, why not, reuse themselves much like the first thread, as long as they are similarly prevented from wandering out of the arena until they are finished with its work.

(Fixed typo 2012-07-16)

each additional thread arriving at a resource and needing to wait for it
to be intialised could temporarily authorise the scheduler to add a new
thread to the arena (from a pool of course), after the first thread had
set it up and either created a new thread or reused itself

I don't understand that. Maybe due the fact that I don't know what precisely an arena is.

If you consider a solution with new threads: It's too late for thread creation when an additional thread arrives at the ressource. When F arrives at the ressource for the second time, you already are in a losing position. It doen't help to create a new thread then.

Basically, there are 2 approaches IMHO:

Either make sure that threads waiting for stolen child tasks complete only steals tasks that directly or indirectly are spawned by those tasks. I implemented that solution some time ago in a chess playing software (without TBB, just Windows threads or pthreads respectively) It worked fine because every node in a game tree has potential for parallelism, so the thread can find tasks to steal easily, even when restricted where to steal. It might perform poorly (making threads sitting idle because they don't find tasks to steal legally) if there is less parallelism in the program. This way, tasks are completed as soon as possible.

A considerable (but untested) alternative is to have more threads. You have three thread pools. Any thread is in exactly one of them.
- the usual pool P, limited in size by the constructor parameter of task scheduler (which usually is one thread per CPU core)
- a pool of "clean" threads C. C is initially empty. Threads are put there rather then destroyed, when they are done with all of their work. This way they can be reused instead of new threads. By taking a thread from C in case C is empty I mean creating a new thread.
- a heap of blocked threads B

Scheduling:
- When a thread finds that it cannot continue with its current task (either waiting for a ressource or waiting for stolen tasks to complete), it does not stealing, but goes to B. Other threads in B are checked if they can continue now. If so, one of them returns to P. Otherwise, a thread from C goes to P and starts stealing.
- When a thread t completes with its outermost task, it checks the threads in B if one of them continue. If so, one of them goes to P. t goes to C. Otherwise t remains in P.

This way, you usually have more threads then CPU cores, but P is limited to the constructor parameter, so never more threads are working simultaniously. Tasks do not strictly complete as soon as possible (once in B, a thread can resume only if another thread finishes or gets blocked), but at least there are no deadlocks any more. Also, there might be better CPU usage than when restricting where to steal. Memory consuption for the stacks of the threads might be a problem.

Raf Schietekat's picture

"I don't understand that. Maybe due the fact that I don't know what precisely an arena is."
So far it's only an implementation construct (see "TBB 3.0 task scheduler improves composability of TBB based solutions. Part 1."), but they will soon be exposed in the API.

"If you consider a solution with new threads: It's too late for thread
creation when an additional thread arrives at the ressource. When F
arrives at the ressource for the second time, you already are in a
losing position. It doen't help to create a new thread then."
I dare hope that my blindness was only intermittent... In what I just proposed, F (or its stand-in) would be constrained to working on initialising the resource. Use of the arena would implement roughly what you're also proposing, so I hope that it will come with the necessary buttons and levers for such a purpose (or I hereby propose that they will). If the other threads arriving at the resource were to reuse themselves, there would also be much less of a penalty than in case of a context switch, and there would still be emphasis on finishing work on the arena/initialisation as quickly as possible, so perhaps that's even better than merely preventing deadlocks. If anything, there might be too many threads in the arena by default if there is not enough parallel slack at that level, but probably not for too long, so it might not be worthwhile to entangle everything for fear of the possibility of a little bit of idle time before at least gaining some experience with this solution.

"Memory consuption for the stacks of the threads might be a problem."
How, exactly?

In what I just proposed, F (or its stand-in) would be constrained to working on initialising the resource.

OK, that would solve the problem. You would have to impose that constraint on F as soon as F starts with initializing. I was confused because you didn't write that explicitely. You only wrote what would happen for "each additional thread arriving at a resource".

So do I understand you correctly?
1. When a thread (F) starts to initialize the resource, it creates a new arena and stays in the new arena until the initializaion is complete.
2. The other threads stay in the main arena. Only if one of it also waits for the resource one of them enters the new arena.
("each additional thread arriving at a resource and needing to wait for it
to be intialised could temporarily authorise the scheduler to add a new
thread to the arena"
)

@2: Why do you think you should restrict the number of threads that enter the new arena? Why not allow any thread that polls for work enter the new arena (especially in case there are no tasks in the main arena left)? In any case, a more or less static assignment of threads to arenas may limit the performance.

I don't know what methods/features are provided by areneas and don't have the time to dig into the internals of TBB that far. When a thread hasn't found tasks in new arena and it doesn't have partially completed tasks from the arena in the callstack, the thread can safely leave the arena. When you enhance your solution by that, that might solve the problem. Esp. nice that it avoids context switches.

"Memory consuption for the stacks of the threads might be a problem."
How, exactly?

I suggested to have more threads. To my understanding, each of them needs an own stack, which is reserved even if not currently used by the thread. So more threads need more memory.

see "TBB 3.0 task scheduler improves composability of TBB based solutions. Part 1."

Thank you for that link. In the second part, I find this code:

task& r1 = *new( task::allocate_root() ) empty_task;

r1.set_ref_count(2);

task& t1 = *new( r1.allocate_child() ) BackgroundTask;

task::spawn(t1);
task& r2 = *new( task::allocate_root() ) empty_task;

r2.set_ref_count(2);

task& t2 = *new( r2.allocate_child() ) ForegroundTask;

task::spawn(t2);

r2.wait_for_all();

task::destroy(r2);
r1.wait_for_all();

task::destroy(r1);
I wonder if the problem that I reported can occur there in a similar way:
Let F work at ForegroundTask.
If a thread N happens to steal a child of ForegroundTask, F might have to wait for N and steal a child of BackgroundTask in the meantime.
With F being temporarily trapped in BackgroundTask, ForegroundTask cannot complete.
F will eventually finish, so this isn't a deadlock. But still ForegroundTask (e.g. GUI) will not respond for some time.

Raf Schietekat's picture

"So do I understand you correctly?"
Yes.

"@2: Why do you think you should restrict the number of threads that
enter the new arena? Why not allow any thread that polls for work enter
the new arena (especially in case there are no tasks in the main arena
left)? In any case, a more or less static assignment of threads to
arenas may limit the performance."
The main thing is that the existing thread reuses itself (my favourite) or taps a thread from a pool to step in (back-up solution, perhaps also based on current stack level), to avoid oversubscription. I'm not entirely sure that the arena would always be able to absorb even every waiting thread, e.g., when many threads need a shared resource with not much inherent parallelism, let alone additional ones, but that could also just be a matter of arena configuration (inherit the count from the arena of the thread that created the arena, or use the maximum from the arenas of the other threads, unless explicitly overridden?). Something like that?

"I don't know what methods/features are provided by areneas and don't
have the time to dig into the internals of TBB that far."
I think we're building a wish list for the new API here...

"When a thread
hasn't found tasks in new arena and it doesn't have partially completed
tasks from the arena in the callstack, the thread can safely leave the
arena. When you enhance your solution by that, that might solve the
problem."
That might be a good generalisation, even though it wouldn't apply here (waiting for resource initialisation to finish).

"I suggested to have more threads. To my understanding, each of them
needs an own stack, which is reserved even if not currently used by the
thread. So more threads need more memory."
OK, I thought you meant memory consumption per stack.

Well, now would probably be a good moment for somebody from the TBB team to let us know what they think about this, whether it was already part of the plan or whether it will now be added?

Well, now would probably be a good moment for somebody from the TBB team
to let us know what they think about this, whether it was already part
of the plan or whether it will now be added?

In deed, I agree with that.

In the meantime, I created an example that does not depend on lucky or unlucky timing. I do all timing myself with a Timer class. The code requires some obvious modifications to run without our internal framework. Especially, you will need a Timer class.

(EDIT: atomic added, base class introduced)

//namespace smp = Concurrency;

namespace smp = tbb;
//namespace mutex_type = tbb;

namespace mutex_type = boost;
Timer g_timer;

mutex_type::mutex g_mutex;

int g_count = 0;
class Task

{

private:

    virtual void run () const = 0;

public:

    void operator() () const

    {

        {

            mutex_type::mutex::scoped_lock lock(g_mutex);
            static char* name[] = {"F","N","M"};

            if (Platform::thread_name().empty() && g_count<3)

            {

                Platform::set_thread_name (name[g_count]);

                ++g_count;

            }
            std::cout << std::left;

            std::cout << "thread " << std::setw(10) << Platform::thread_id_and_name() << ":"

                      << std::setw(13) << typeid(*this).name()

                      << " started  at " << g_timer.user_time() << std::endl;

        }

        run ();

        {

            mutex_type::mutex::scoped_lock lock(g_mutex);

            std::cout << "thread " << std::setw(10) << Platform::thread_id_and_name() << ":"

                      << std::setw(13) << typeid(*this).name()

                      << " finished at " << g_timer.user_time() << std::endl;

        }

    }

};
class Task11 : public Task

{

public:

    void run () const

    {

        while (g_timer.user_time()<4);

    }

};
class Task12 : public Task

{

public:

    void run () const

    {

        while (g_timer.user_time()<5);

    }

};
class Lazy

{

private:

    mutex_type::mutex d_mutex;

    tbb::atomic d_resource;

    Lazy (const Lazy&);

    void operator= (const Lazy&);
public:

    Lazy ();

    int& operator* ();

};
Lazy::Lazy ()

    : d_resource()

{

}
int& Lazy::operator* ()

{

    if (d_resource)

        return *d_resource;

    mutex_type::mutex::scoped_lock lock(d_mutex);

    if (d_resource)

        return *d_resource;
    while (g_timer.user_time()<1);

    smp::parallel_invoke (Task11(), Task12());
    d_resource = new int (42);

    return *d_resource;

}
Lazy g_lazy;
class Task1 : public Task

{

public:

    void run () const

    {

        std::cout << *g_lazy << std::endl;

    }

};
class Task21 : public Task

{

public:

    void run () const

    {

        while (g_timer.user_time()<2);

    }

};
class Task221 : public Task

{

public:

    void run () const

    {

        while (g_timer.user_time()<6);

    }

};
class Task222 : public Task

{

public:

    void run () const

    {

        std::cout << *g_lazy << std::endl;

    }

};
class Task22 : public Task

{

public:

    void run () const

    {

        while (g_timer.user_time()<3);

        smp::parallel_invoke (Task221(), Task222());

    }

};
class Task2 : public Task

{

public:

    void run () const

    {

        smp::parallel_invoke (Task21(), Task22());

    }

};
BOOST_AUTO_TEST_SUITE( TbbT );
BOOST_AUTO_TEST_CASE( inner_parallelization_calls_outer )

{

    tbb::task_scheduler_init scheduler_init(3);

    g_timer.start();

    smp::parallel_invoke (Task1(), Task2());

}
BOOST_AUTO_TEST_SUITE_END();

Expected behaviour:
(EDIT: We don't know exactly the output, there are degrees of freedom)
Program should print 42 twice (plus some debug output) and terminate normally after 6 seconds.

Actual behaviour:
(EDIT: Thread names added)
Program deadlocks after printing:

thread [6008] F  :class Task1   started  at 0

thread [4956] N  :class Task2   started  at 0

thread [4956] N  :class Task21  started  at 0

thread [4300] M  :class Task22  started  at 0

thread [6008] F  :class Task11  started  at 1.01401

thread [4956] N  :class Task21  finished at 2.04361

thread [4956] N  :class Task12  started  at 2.04361

thread [4300] M  :class Task221 started  at 3.01082

thread [6008] F  :class Task11  finished at 4.00923

thread [6008] F  :class Task222 started  at 4.00923

thread [4956] N  :class Task12  finished at 5.00763

thread [4300] M  :class Task221 finished at 6.02164 

Raf Schietekat's picture

Nice to already have a reproducer.

Please do make the resource (single s) pointer an atomic. BTW, what does d_ mean?

Please do make the resource (single s) pointer an atomic

Done (in original post)

BTW, what does d_ mean?

We introduced prefixes for variables more than a decade ago and stick to the letters we chose those days. As m_ was already taken by mutable, we use d_ for member variables, following some outdated book on software design.

Raf Schietekat's picture

"Done (in original post)"
Perhaps it's mostly pedantic (especially in this simple situation), but the intent of the atomic may be made clearer (through explicit memory semantics), and some machine cycles saved, by using the following code instead (not tested):

int& Lazy::operator* ()

{

    if (int* l_resource = d_resource.load())

        return *l_resource;

    mutex_type::mutex::scoped_lock lock(d_mutex);

    if (int* l_resource = d_resource.load())

        return *l_resource;
    while (g_timer.user_time()<1);

    smp::parallel_invoke (Task11(), Task12());
    int* l_resource = new int (42);

    d_resource.store(l_resource);

    return *l_resource;

}

Raf Schietekat's picture

#19 "Well, now would probably be a good moment for somebody from the TBB team
to let us know what they think about this, whether it was already part
of the plan or whether it will now be added?"
Are there any thoughts about this yet?

Are there any thoughts about this yet?

As far as we are concerned: We have replaced the lazy initialization by an early initialization. This is quite nasty becaue the logic if the resource is needed is quite complicated. We now have to keep the conditions in various modules for resource usage synchronous with the conditions for resource initialization. Moreover, we decided to simplify the logic when doing the initialization: We check necessary conditions for initializations, which means that we occasionally do initialization though it is not needed.

Because of this problem, we put replacement of TBB by Microsoft PPL on our mid-term todo-list. We have already checked that PPL does not have the same problem. However, replacement would affect the whole project and would require some evaluation of PPL beforehand.

Anton Malakhov (Intel)'s picture

Ok, after a few attempts to understand what's going on in this forum thread, I think I can answer something relevant. AFA I Understand, you try to call TBB scheduling (parallel_invoke) holding a lock which can be requested on the outer level as well. Didn't we warn against such a design anti-pattern?So, the bad news is that we don't directlysupport it yet. Good news is that we understand that somethimesfor complex modular programs there is no other way but to support it. Raf is right saying that new CPF feature task_arena coming in the next TBB release will help to isolate the work under the lock so *no* external tasks can be stolen.Actually, we already have arenas which separated between and tied to user threads as described in the mentioned blog. Thus, we can probably claim we already support this anti-pattern via ugly work-araund :)involving additional user thread

Raf Schietekat's picture

"Thus, we can probably claim we already support this anti-pattern via ugly work-araund :)involving additional user thread"
Any prospects for task_arena to include the necessary details so that the user doesn't have to deal with thread pools but can instead reuse all the threads that would be blocked anyway, etc.? Note that this means that the arena should dynamically increase its cardinality as workers arrive at its boundary. But I confess that I should probably have insisted that a partial workaround is already possible without this (by using a thread from a pool and accepting temporary undersubscription).

Dear Anton, Thank you for your Reply.

Understand, you try to call TBB scheduling (parallel_invoke) holding a
lock which can be requested on the outer level as well. Didn't we warn
against such a design anti-pattern?

In deed, somewhere in the documentation there is a hint that is no good idea to hold locks too long. However, lazy initialization itsself is not an anti-pattern. Plus, parallelization inside the initialization is necessary for performance reasons. We are looking for a way to do that with TBB. (Getting the hold-lock-too-long anti pattern working is one of several possible ways. If you can show me another way, I'll be happy with that.)

I wasn't able to concretize the information from the blog to actually working code. Can you give some hints?

When is the next TBB release with support for task_arena supposed to be released?

Anton Malakhov (Intel)'s picture

Hm.. I found no direct warning regarding parallel work under lock in the Reference, we should describe it more explicitly. thanks!The release is expected in a few weeks. But 'CPF' actually means 'backward-incompatible'. So, you should be careful deciding on usage of a CPF feature in a product.The workaround is to use separate (e.g. std::) thread which starts the parallel processing under the lock. In this case, the work will be separated between your main thread and this additional initialization thread. If starting a thread under the lock is too expensive, please consider creating the thread in advance and putting it asleep until lazy initialization starts, then it can start a parallel processing, and when it finishes it should release the lock andunblock the main thread.

Raf Schietekat's picture

Why would it break compatibility if arenas were never exposed before?

And could you address #27 (workaround suboptimal because parallelism probably not realised in initialisation code, specific features needed)?

Anton Malakhov (Intel)'s picture

I meant that we don't promise backward-compatibility and we do not do versioning for community preview features (CPF). In particular, if you link your app with tbb_preview.dll of one version, it is not guaranteed it will work with tbb_preview.dll of the next version without recompilation (as it is for tbb.dll). Worse, on Linux and other platforms with weak symbols, if e.g.two modules are compiled with CPF of different versions, it can easily break the program because a loader can find two symbols with the same name but different layout/implementation, choose one, but it can be incompatible with a context where it is used.Sorry, can you eleborate what is your question (re #27..)

The workaround is to use separate (e.g. std::) thread which starts the
parallel processing under the lock. In this case, the work will be
separated between your main thread and this additional initialization
thread. If starting a thread under the lock is too expensive,

It will not be too expensive to start the thread under the lock.

We already considered this approach, but we could not complete this to design a solution. How exactly is the work separated? I assume, the std::thread for initialization has its own arena, so I can trick TBB to have two arenas without explicitely accessing arenas, right? On the one hand, tasks are collected in per-thread task pools. On the other hand, tasks belong to an arena. What's the relationship between those per-thread task pools and arenas? Does every thread belong to an arena? (If so, how is the assignment done?) Does every thread have one task pool per arena?

Note that is not enough to prevent the second master (the std::thread for initialization) from stealing outer tasks: The example from comment #20 is based on the initialization thread stealing an outer task. But a deadlock would also occur like this: Thread T from the pool steals work from the initialization thread F, and then another thread T' from the pool steals work from T. At some point, T might have to wait for T', while still processing a subtask of the initialization. If T than starts another outer task, it might end up working for the lock. T' finishes, but T is trapped waiting for the lock and cannot continue. However, F will eventually wait for T and cannont complete the initialization and release the lock.

So: Does TBB prevent any threads that are processing any initialization-subtask (like T above) from stealing outer tasks before they complete their initialization-subtask?

On the other hand, I always want to have the maximum number of threads running. When the first thread arrives at the resource, it falls asleep waiting for the resource. But it starts the initialization thread, so everything is fine. If another thread arrives at the ressource, it should either add a new thread to the thread pool or look for other work itsself to keep the maximum number of threads working. How can I do that?

Raf Schietekat's picture

"I meant that we don't promise backward-compatibility and we do not do
versioning for community preview features (CPF)."
Ah, of course... I'm afraid "backward-incompatbility" may not be the best term for that.

"Sorry, can you eleborate what is your question (re #27..)"
The suggested workaround would allow initialisation with one thread per resource (manually taken from a pool), but TBB thinks the other threads are all occupied, making that an arena of one master thread without workers, whereas some threads are just waiting for the initialisation to finish, leading to undersubscription in the presence of lots of parallel work to be done as part of that initialisation. So proper support for this scenario would either allow as many threads in as are currently blocked, or would reuse the threads waiting at the boundary but prevent them from stealing outside the arena associated with the resource being initialised, and it would dynamically adjust that number as additional threads arrive at the arena waiting for the resource.

(Added) Well, there may be some workers in the additional arena, but there would still be undersubscription.

(Added) Leaving the floor to Anton to answer #32...

Anton Malakhov (Intel)'s picture

Quoting onnog
We already considered this approach, but we could not complete this to design a solution. How exactly is the work separated? I assume, the std::thread for initialization has its own arena, so I can trick TBB to have two arenas without explicitely accessing arenas, right?[AM] Right
On the one hand, tasks are collected in per-thread task pools. On the other hand, tasks belong to an arena. What's the relationship between those per-thread task pools and arenas? Does every thread belong to an arena? (If so, how is the assignment done?) Does every thread have one task pool per arena?
[AM] Think of it as if arena has the fixed number of task pools. Each master thread (main or std::thread) which schedule a work or calls task_scheduler_init creates own implicit arena. So, an (implicit) arena is tied to a thread, and a task is tied to an arena where it was spawned. Worker threads can migrate between arenas but only when it complete all the work from its local task-pool.
Note that is not enough to prevent the second master (the std::thread for initialization) from stealing outer tasks: The example from comment #20 is based on the initialization thread stealing an outer task. But a deadlock would also occur like this: Thread T from the pool steals work from the initialization thread F, and then another thread T' from the pool steals work from T. At some point, T might have to wait for T', while still processing a subtask of the initialization. If T than starts another outer task, it might end up working for the lock. T' finishes, but T is trapped waiting for the lock and cannot continue. However, F will eventually wait for T and cannont complete the initialization and release the lock.

So: Does TBB prevent any threads that are processing any initialization-subtask (like T above) from stealing outer tasks before they complete their initialization-subtask?
[AM] The tasks in arenas are separated in both directions.
On the other hand, I always want to have the maximum number of threads running. When the first thread arrives at the resource, it falls asleep waiting for the resource. But it starts the initialization thread, so everything is fine. If another thread arrives at the ressource, it should either add a new thread to the thread pool or look for other work itsself to keep the maximum number of threads working. How can I do that?
[AM] Please try tbb::task_group which starts a single initialization task, which creates new std::thread/arena to process the lazy initialization and waits until it finishes, and then any thread can wait on this task_group effectively executing remaining outer work (except a thread blocked on execution of the initialization task - it is necessary to avoid oversubscription.. on the other hand a temporary undersubscription is much worse than temporary oversubscription...).

Raf Schietekat's picture

"[AM] Please try tbb::task_group which starts a single initialization
task, which creates new std::thread/arena to process the lazy
initialization and waits until it finishes, and then any thread can wait
on this task_group effectively executing remaining outer work (except a
thread blocked on execution of the initialization task - it is
necessary to avoid oversubscription.. on the other hand a temporary
undersubscription is much worse than temporary oversubscription...)."
It would be nice for the documentation to explicitly state that multiple threads can wait() for a single task_group, instead of requiring the reader to read on about structured_task_group and interpret its restrictions as implicit capabilities of task_group. I wouldn't have guessed so, until now; did I guess right?

And of course it remains to be seen how much work can be done without idling anyway before the underpowered initialisation work finishes, because intuitively a lot of the code will depend on this initialisation. Even then, a lot of (suboptimal) stealing may be involved.

Anton Malakhov (Intel)'s picture

Quoting Raf Schietekat
"Sorry, can you eleborate what is your question (re #27..)"
The suggested workaround would allow initialisation with one thread per resource (manually taken from a pool), but TBB thinks the other threads are all occupied, making that an arena of one master thread without workers, whereas some threads are just waiting for the initialisation to finish, leading to undersubscription in the presence of lots of parallel work to be done as part of that initialisation. So proper support for this scenario would either allow as many threads in as are currently blocked, or would reuse the threads waiting at the boundary but prevent them from stealing outside the arena associated with the resource being initialised, and it would dynamcally adjust that number as additional threads arrive at the arena waiting for the resource.

(Added) Well, there may be some workers in the additional arena, but there would still be undersubscription.

I hope I answered this question in previous message by suggesting to use task_group. To be more specific, waiting on a task from a different arena does not mean joining that different arena. So, you are still able to wait and execute tasks from current ('outer') arena.

And of course it remains to be seen how much work can be done without
idling anyway before the underpowered initialisation work finishes,
because intuitively a lot of the code will depend on this
initialisation.

In our case, we are quite certain that if a task arriving at the resource can only take other outer tasks, for some instances, eventually all outer tasks will be finished (if not depending on the resource) or waiting for the resource. It's necessary, that threads from the outer arena that wait for the initialization can go into the initialization arena (but not the other way round).

Before a solution for that problem is in sight, we will not try to implement the suggested solution, but rather stick to early initialization or eventually evaluate PPL.

Anton Malakhov (Intel)'s picture

Arhg.. I see what is your concern now. yes.. waiting on a task_goup will not help the initialization because it spins in outer arena.
But if such a task can recycle itself instead of waiting on initilization, the task priorities can help. Initialization thread must start work with high priority, it effectively marks initialization arena with high-priority and assign all the available resources to it, then all the workers can switch to initialization as soon as they finish processing a current task. But it will not help the main thread which should just wait when initialization completes (or it can spin on recycling if it is acceptable). Initialization task which starts the std::thread can recycle as well (instead of what I suggested before) to unblock the worker and allow it migrating into initialization arena.With new feature, the design could be simplified by calling initialization arena directly but the first implementation does not actually support participating of multiple threads excplicitly joined arena to actually do the work.. we are not yet sure it is really desired feature (let's call it 'multiple masters fortask_arena')

Raf Schietekat's picture

"Arhg.. I see what is your concern now. yes.. waiting on a task_goup will
not help the initialization because it spins in outer arena."
That's confusing: does a task blocked on task_group::wait() steal (as you wrote earlier) or spin (as you write now)?

"But if
such a task can recycle itself instead of waiting on initilization, the
task priorities can help. Initialization thread must start work with
high priority, it effectively marks initialization arena with
high-priority and assign all the available resources to it, then all the
workers can switch to initialization as soon as they finish processing a
current task."
I'm not sure that higher priority is the primary concern.

"But it will not help the main thread which should just
wait when initialization completes (or it can spin on recycling if it is
acceptable). Initialization task which starts the std::thread can
recycle as well (instead of what I suggested before) to unblock the
worker and allow it migrating into initialization arena."
I assume you mean that a "continuation" is used (which could be recycled but also a new task)? But how does this free up the entire thread that was runnng the original task, unless this was a root task or everything is organised with continuations? Otherwise I'm afraid this is not a sufficient solution in that regard...
"With
new feature, the design could be simplified by calling initialization
arena directly but the first implementation does not actually support
participating of multiple threads excplicitly joined arena to actually
do the work.. we are not yet sure it is really desired feature (let's
call it 'multiple masters fortask_arena')"
So the feature has to be widely desired as well as really cool? What a high burden! :-)

How difficult could it be to temporarily reassign a thread to a different arena until its work there is finished?

Anton Malakhov (Intel)'s picture

Quoting Raf Schietekat
"Arhg.. I see what is your concern now. yes.. waiting on a task_goup will
not help the initialization because it spins in outer arena."
That's confusing: does a task blocked on task_group::wait() steal (as you wrote earlier) or spin (as you write now)?
[AM] It spins stealing and executing the tasks but it just spins when there are no tasks to steal. :)
"But if
such a task can recycle itself instead of waiting on initilization, the
task priorities can help. Initialization thread must start work with
high priority, it effectively marks initialization arena with
high-priority and assign all the available resources to it, then all the
workers can switch to initialization as soon as they finish processing a
current task."
I'm not sure that higher priority is the primary concern.
[AM] Are you sure you are not mixing it with a thread priority? task priorities helps transfer workers to high-priority (initialization) arena. It is exactly what you (both) are asking for AFAIU. Without that, a half of workers will keep stealing outer tasks.
"But it will not help the main thread which should just
wait when initialization completes (or it can spin on recycling if it is
acceptable). Initialization task which starts the std::thread can
recycle as well (instead of what I suggested before) to unblock the
worker and allow it migrating into initialization arena."
I assume you mean that a "continuation" is used (which could be recycled but also a new task)? But how does this free up the entire thread that was runnng the original task, unless this was a root task or everything is organised with continuations? Otherwise I'm afraid this is not a sufficient solution in that regard...
[AM] When a task comes to a resource which is not yet initialized, it should be able to terminate itself (by recycling itself as continuationor creating a new continuation - doesn't matter) to allow the scheduler to transfer the thread to high-priority arena.
"With
new feature, the design could be simplified by calling initialization
arena directly but the first implementation does not actually support
participating of multiple threads excplicitly joined arena to actually
do the work.. we are not yet sure it is really desired feature (let's
call it 'multiple masters fortask_arena')"
So the feature has to be widely desired as well as really cool? What a high burden! :-)
[AM] Well :) multiple masters per arena require (further) refactoring of the scheduler as a whole. And if there is no use-cases, why to bother? This discussion shows there are.. priorities will not help if task cannot be terminated but need to block instead.One of the idea for CPF is to gather feedbacks on the usage model.. well, it is not out yet but starts getting hot already :)
How difficult could it be to temporarily reassign a thread to a different arena until its work there is finished?

[AM] It is exactly what this new CPF does.. and it was not an easy work (and still should be improved).

Raf Schietekat's picture

"Are you sure you are not mixing it with a thread priority? task
priorities helps transfer workers to high-priority (initialization)
arena. It is exactly what you (both) are asking for AFAIU. Without that,
a half of workers will keep stealing outer tasks."
Since the threads predominately won't actually be free to migrate (unless everything is rewritten in terms of continuations), priorities won't be of much if any use.

"When a task comes to a resource which is not yet initialized, it should
be able to terminate itself (by recycling itself as continuationor
creating a new continuation - doesn't matter) to allow the scheduler to
transfer the thread to high-priority arena."
But that task is not everything the thread is currently doing (exceptions like root tasks notwithstanding), so ad-hoc continuations at the arena boundary are not sufficient.

"Well :) multiple masters per arena require (further) refactoring of
the scheduler as a whole. And if there is no use-cases, why to bother?
This discussion shows there are.. priorities will not help if task
cannot be terminated but need to block instead."
Prorities will never be the solution: requiring a rewrite of the whole application in terms of continuations is not realistic. If it were, the problem wouldn't occur in the first place, as this forum thread has already established.
"One of the idea for CPF is to gather feedbacks on the usage model.. well, it is not out yet but starts getting hot already :)"
Actually I don't have any idea yet about what else it might be good for! :-)

"It is exactly what this new CPF does.. and it was not an easy work (and still should be improved)."
I'm sure the initial work may have been the hardest work; now the question is how hard it would be to bolt on this new idea: thread arrives at arena, changes allegiance until arena work is finished, thread resumes where it was "interrupted", all typical stack-based operation.

Of course it would be best to first meditate a bit about possible generalisations: can there be multiple initialisations with dag-like interdependencies, are there other applications where a thread is free to leave when only its own task pool is emptied out instead of all the task pools in the arena, ... But not for too long!

Anton Malakhov (Intel)'s picture

Quoting Raf Schietekat
"When a task comes to a resource which is not yet initialized, it should
be able to terminate itself (by recycling itself as continuationor
creating a new continuation - doesn't matter) to allow the scheduler to
transfer the thread to high-priority arena."
But that task is not everything the thread is currently doing (exceptions like root tasks notwithstanding), so ad-hoc continuations at the arena boundary are not sufficient.
[AM] Do you mean if nested parallelism (i.e. wait_for_all inside a task) is used? Yes, priorities will not help as well. But if a task finishes on the outermost level (i.e. no nesting), the worker can offload all its remaining tasks and move to high-priority arena.

"It is exactly what this new CPF does.. and it was not an easy work (and still should be improved)."
I'm sure the initial work may have been the hardest work; now the question is how hard it would be to bolt on this new idea: thread arrives at arena, changes allegiance until arena work is finished, thread resumes where it was "interrupted", all typical stack-based operation.[AM] That is exactly 'multiple masters' problem.. explicit join is different from how workers behave. It is like master. But there is only one slot for master (and it is distingueshed from workers by slot number currently). The slot selection code should be added to task_arena implementation and scheduler should be refactored to extend assumptions and diffirentiate user threads and threads from TBB pool in other way.. it is at least..
Of course it would be best to first meditate a bit about possible generalisations: can there be multiple initialisations with dag-like interdependencies, are there other applications where a thread is free to leave when only its own task pool is emptied out instead of all the task pools in the arena, ... But not for too long![AM] There is an idea how to avoid the deadlock on nested parallelism without dealing with separete arenas at all.. similar to how PPL does it (I guess with great help of ConcRT). But I'm not sure when we can get to implementing it.

Raf Schietekat's picture

"Do you mean if nested parallelism (i.e. wait_for_all inside a task) is
used? Yes, priorities will not help as well. But if a task finishes on
the outermost level (i.e. no nesting), the worker can offload all its
remaining tasks and move to high-priority arena."
The remaining tasks on the stack will be ancestors of that continuation, or likely be constrained some other way. The only way to expect that worker to be able to offload its remaining tasks is if there were no remaining tasks to offload in the first place, i.e., no blocking calls ever, and everything in terms of continuations. That is not a realistic assumption.

[Anton] "When a task comes to a resource which is not yet initialized, it should
be able to terminate itself (by recycling itself as continuationor
creating a new continuation - doesn't matter) to allow the scheduler to
transfer the thread to high-priority arena."
[Raf] But that task is not
everything the thread is currently doing (exceptions like root tasks
notwithstanding), so ad-hoc continuations at the arena boundary are not
sufficient.

Also, there is data of the task itsself in the callstack. This would get lost if the task terminates on finding an uninitialized resource. At least it would mean extremely heavy refactoring to write the task in a way that it can terminate without data loss when it finds an unitialized resource. I find that highly undesirable.

If I got you correctly, a thread that is waiting on task_group (e.g. for the resource) is free to take other tasks from its current arena, but not from other arenas. Right? So, would it be possible to (manually?) migrate the thread that waits for the resource to the initialization arena and take tasks from there? It could leave the task that waits for the initialization in the callstack. To me, that seems to solve the problem.

EDIT: When the initialization arena is closed, the thread should fall back to the outer arena.
EDIT: Crap at end of post removed.

Hm.. I found no direct warning regarding parallel work under lock in the
Reference, we should describe it more explicitly. thanks!

What I had in mind when I said you did warn against holding locks too long:
TBB lazy class
You are right, this does not warn against doing parallel work under the lock. It's based on more general considerations.

Note that the improved TBB lazy class from the document will not deadlock with parallel work in the initialization, but might restart the initialization several times (which is almost as bad).

[AM] Think of it as if arena has the fixed number of task pools. Each
master thread (main or std::thread) which schedule a work or calls
task_scheduler_init creates own implicit arena. So, an (implicit) arena
is tied to a thread, and a task is tied to an arena where it was
spawned. Worker threads can migrate between arenas but only when it
complete all the work from its local task-pool.

So when a worker thread goes to an arena, it takes one of the arena's task pools thereby making it its local task pool? Before my idea was that every thread has a task pool and that every task pool belongs to a thread. That is just a simplification to describe the basic case that there is one arena and this arena has by default the same number of task pools as there are threads, right?

Does the condition "when it
complete[s] all the work from its local task-pool"
imply that there is no unfinished task being processed (e.g. waiting for stolen subtasks to finish)?

What I need would be:
1. if task in local pool, take it
2. if no task in local pool, but currently processing unfinished task, steal from other pools in current arena only. If no tasks left in current arena, spin/idle.
3. if no task in local pool and not processing anything, either steal from other pools in current arena or migrate to another arena.

Is that how TBB works?

4. Have a way to manually migrate a thread that is waiting for the initialization to the inialization arena, overriding rule 2.

Anton Malakhov (Intel)'s picture

Quoting onnog
If I got you correctly, a thread that is waiting on task_group (e.g. for the resource) is free to take other tasks from its current arena, but not from other arenas. Right?[AM] RightSo, would it be possible to (manually?) migrate the thread that waits for the resource to the initialization arena and take tasks from there? It could leave the task that waits for the initialization in the callstack. To me, that seems to solve the problem.[AM] It is not yet possible but it could be in future with multiple masters joining the same explicit task_arena object.

Anton Malakhov (Intel)'s picture

Quoting onnog
So when a worker thread goes to an arena, it takes one of the arena's task pools thereby making it its local task pool? Before my idea was that every thread has a task pool and that every task pool belongs to a thread.[AM] it was the implementation prior to 4.1.. but it doesn't matter who own the task pool (thread or slot) because workers migrate always with empty task poolThat is just a simplification to describe the basic case that there is one arena and this arena has by default the same number of task pools as there are threads, right?[AM] yes, it's simplification
Does the condition "when it
complete[s] all the work from its local task-pool"
imply that there is no unfinished task being processed (e.g. waiting for stolen subtasks to finish)?
[AM] does 'unfinished task' means nested parallelism, e.g. calls to wait_for_all() or an algorithm from inside a task execution? In this case, a thread cannot migrate anyway until the task finishes the wait and exits.

What I need would be:
1. if task in local pool, take it
2. if no task in local pool, but currently processing unfinished task, steal from other pools in current arena only. If no tasks left in current arena, spin/idle.
3. if no task in local pool and not processing anything, either steal from other pools in current arena or migrate to another arena.

Is that how TBB works?[AM] yes

4. Have a way to manually migrate a thread that is waiting for the initialization to the inialization arena, overriding rule 2.[AM] ok, we will consider implementing the multiple masters for task_arena

Login to leave a comment.