Future Class

Future Class

How might you write a Future class using TBB? ie

template
class tbb_future
{
public:
tbb_future(T (*f)()); //call the passed function
T get_value() //get the value and block until computation is finished
}

ideally get_value will not just spin and can be called by multiple threads...

Mike

35 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

It should be fairly straightforward to use tasks and dependencies, if your program will tolerate performance-wise that later callers of get_value() will be blocked until the result has been evaluated (not being able to do anything useful in the meantime), unless somebody could remind me how to simulate diamond dependencies with the current state of TBB. Do you want to try that without knowing in advance that it will give a satisfying result?

Quoting - Raf Schietekat

It should be fairly straightforward to use tasks and dependencies, if your program will tolerate performance-wise that later callers of get_value() will be blocked until the result has been evaluated (not being able to do anything useful in the meantime), unless somebody could remind me how to simulate diamond dependencies with the current state of TBB. Do you want to try that without knowing in advance that it will give a satisfying result?

It's actually harder (at least for me) than it initially appears. In example, the following deadlocks if too many tasks call get_value() before completed. Any better ideas than those below?

template
class tbb_future
{
public:
   tbb_future(T (*f)())
   {
      tbb::task*t = new(tbb::task::allocate_root())tbb::empty_task;

      root = t;

      root->set_ref_count(2);

      tbb::task*future = new(root->allocate_child())future_task(f,&value);

      root->spawn(*future);

   }

   T get_value()
   {
      if(root)
      {
         tbb::spin_mutex::scoped_lock lock(myMutex);
         if(root)
         {
            root->wait_for_all();
            root = 0;
            lock.release();
         }
      }
      return value;
      
   }
private:
   tbb::atomicroot;
   T value;
   tbb::spin_mutex myMutex;

   class future_task : public tbb::task
   {
   public:
      future_task(T (*f)(),T*val):func(f),value(val)
      {

      }

      virtual tbb::task*execute()
      {
         *value = func();
         return 0;
      }

      virtual ~future_task(){}
   private:
      T(*func)();
      T*value;
   };
   friend future_task;
};

This is the only implementation I could come up with that works:

template
class tbb_future
{
public:
   tbb_future(T (*f)())
   {
      done = false;
      tbb::task*future = new(tbb::task::allocate_root())future_task(f,&queue);
      future->spawn(*future);

   }

   T get_value()
   {
      if(done)
         return value;

      T t;
      queue.pop(t);
      if(!done)
      {
         value = t;
         done = true;
      }
      queue.push(t);
      return value;
   }
private:
   tbb::concurrent_queue queue;
   tbb::atomicdone;
   T value;

   class future_task : public tbb::task
   {
   public:
      future_task(T (*f)(),tbb::concurrent_queue*q):func(f),queue(q)
      {

      }

      virtual tbb::task*execute()
      {
         queue->push(func());
         return 0;
      }

      virtual ~future_task(){}
   private:
      T(*func)();
      tbb::concurrent_queue*queue;
   };
   friend future_task;
};

The first implementation looks like what I had in mind. I would not use a spin_mutex, though, because it is assumed that the future is heavy enough that a task-based future will be beneficial, so tbb::mutex seems more appropriate. It looks that root is leaked just after wait_for_all(); lock.release() is merely a distraction. What happens if you destroy root afterwards, or instead use spawn_root_and_wait_for_all(root)? I haven't looked at the second implementation yet.

It also occurred to me (only just now, sorry for that) that the code might be calling get_value() at a greater depth than where future_task was spawned, which might cause a deadlock (anyone from the TBB team?). Maybe you can trace some depth values and see whether that provides a clue, before that possibility is explored any further.

Doesn't the version using a concurrent_queue deadlock if there are no worker threads? E.g. the scheduler is initialized with "tbb::task_scheduler_init init(1);" In general, there has to be calls to one of the "wait" methods of class task to ensure progress in the single-threaded case.

We looked at implementing futures in the original version of TBB. The problem was that all the efficient implementations deadlocked for interesting cases where futures used other futures. The root problem is that futures, as typically specified, leave it to the system to figure out an evaluation order, and there are subtle conflicts between "figure it out" and the way task stealing can trap a partially evaluated task on a thread's stack.

Interesting... now what if get_value() could steal a future_task's work while that hasn't started executing, i.e., not the wimpish "stealing" of a task that has actually been carefully made available by the "victim", but really stealing something that both parties want to get to first? The work wouldn't be passed to the future_task when it is created, but the future_task would have to go back to the future :-) to get its assignment, if it is the first, otherwise some other code calls get_value() first and that way starts working on the assignment, which it stole from under the nose of the future_task. Wouldn't that dynamic spawning at least significantly decrease the opportunities for deadlock? What obstacles would remain (barring situations where futures are caught in a deadly embrace)?

Quoting - Arch Robison (Intel)

Doesn't the version using a concurrent_queue deadlock if there are no worker threads? E.g. the scheduler is initialized with "tbb::task_scheduler_init init(1);" In general, there has to be calls to one of the "wait" methods of class task to ensure progress in the single-threaded case.

We looked at implementing futures in the original version of TBB. The problem was that all the efficient implementations deadlocked for interesting cases where futures used other futures. The root problem is that futures, as typically specified, leave it to the system to figure out an evaluation order, and there are subtle conflicts between "figure it out" and the way task stealing can trap a partially evaluated task on a thread's stack.

Hmmm, interesting. I'm not sure I understand exactly, why does the system have to figure out the evaluation order?

Here's a solution that I thought would work, based on the Two Mouths example from the TBB book. But it claims that deadlock is detected if two tasks call get_value(), when I do a tbb_schedule_init(1). And it sometimes deadlocks at wait_for_all() when I do a tbb_schedule_init(2) - and future_task::execute() never executes! Do you know what's going on here? I must have some misconception of TBB... I've appreciated the comments so far. Thanks!

template
class tbb_future
{
public:
   tbb_future(T (*f)())
   {
      done = false;
      future = new(tbb::task::allocate_root())future_task(f,this);
      future->spawn(*future);
   }

   T get_value()
   {
      if(!done)
      {
         cout << "acquiring" << endl;
         tbb::spin_mutex::scoped_lock lock(mutex);
         cout << "acquired" << endl;
         tbb::task*temp_root = 0;
         if(!done)
         {
            temp_root = new (tbb::task::allocate_root())tbb::empty_task;
            tbb::empty_task*child = new (temp_root->allocate_child())tbb::empty_task;

            temp_root->set_ref_count(2);

            future->add_task(child);


         }
         lock.release();
         if(temp_root)
         {
            //what happens if child task is spawned before this line is run?
            cout << "waiting" << endl;
            temp_root->wait_for_all();
         }
      }
      return value;
   }
private:


   void finish(const T&v)
   {
      tbb::spin_mutex::scoped_lock lock(mutex);
      value = v;
      done = true;
   }

   class future_task : public tbb::task
   {
   public:
      future_task(T (*f)(),tbb_future*fut):func(f),future(fut)
      {

      }

      virtual tbb::task*execute()
      {
         cout << "executing" << endl;
         future->finish(func());

         cout << "finished!" << endl;

         for(size_t i=0;i < dependent_tasks.size();++i)
            spawn(*dependent_tasks[i]);

         return 0;
      }

      void add_task(tbb::task*t)
      {
         dependent_tasks.push_back(t);
      }


      virtual ~future_task(){}
   private:
      T(*func)();
      tbb_future*future;
      tbb::concurrent_vectordependent_tasks;
   };
   tbb::atomicdone;
   T value;
   tbb::spin_mutex mutex;
   tbb::atomicfuture;
   
   friend future_task;
};

Here's now I run it:

class get_future : public tbb::task
{
public:
   get_future(tbb_future&f):fut(f)
   {
   }

   tbb::task*execute()
   {
      cout << "getting value" << endl;
      
      cout << "got value: " << fut.get_value() << endl;
      return 0;
   }

   tbb_future&fut;
};

int func()
{
   cout << "sleeping" << endl;
   Sleep(2000);
   cout << "sleeping" << endl;
   Sleep(2000);

   return 4;
}

void main()
{
tbb::tbb_scheduler_init init;
  tbb::task*s;

   tbb_futurefut(func);
   s = new (tbb::task::allocate_root())get_future(fut);s->spawn(*s);
   s = new (tbb::task::allocate_root())get_future(fut);s->spawn(*s);
   s = new (tbb::task::allocate_root())get_future(fut);s->spawn(*s);
   s = new (tbb::task::allocate_root())get_future(fut);s->spawn(*s);
   s = new (tbb::task::allocate_root())get_future(fut);s->spawn(*s);

   Sleep(100);

   int val = fut.get_value();

   std::cout << endl << val << endl;

}

Thanks!

Again, I see a task being leaked (temp_root). The name "future" for a pointer to a "future_task" is a bit confusing (I would use m_future_task myself). I didn't think of this possibility for a worker to be able to do useful work while waiting for a resultfuture's value, but I still believe that you may run into problems by pinning all your hope on the future_task, which may be caught up in an unfortunate entanglement (see my earlier suggestion).

(Added) (Meanwhile, I saw and responded to Gunjan Rawal's request.)

Quoting - Raf Schietekat

Again, I see a task being leaked (temp_root). The name "future" for a pointer to a "future_task" is a bit confusing (I would use m_future_task myself). I didn't think of this possibility for a worker to be able to do useful work while waiting for a resultfuture's value, but I still believe that you may run into problems by pinning all your hope on the future_task, which may be caught up in an unfortunate entanglement (see my earlier suggestion).

(Added) (Meanwhile, I saw and responded to Gunjan Rawal's request.)

Hmm you're right about the task leaking. What suggestion are you referring to, checking the task_depth? If I understand correctly, that shouldn't matter in this case at all? Shouldn't wait_for_all() be able to execute any task with a ref_count of 0? What am I missing? Thanks!

Mike

If you create a future in task A and task A subsequently spawns B which does get_value() before the future_task has started executing, then you'll have a deadlock because the thread that's waiting will not consider the future_task as an option for keeping busy (maybe you'll need some more distance, but you get the picture). This situation would require another worker thread (and TBB is about optional concurrency, so everything should work with just one thread), that isn't caught up in work at a greater depth than future_task either. So there are many ways for this to cause problems, and this seems clear enough now not to need verification anymore.

Hence my suggestion to give a future_task a chance to create concurrency, but to take the assignment away from it if it hasn't started work by get_value() time. It's like a workaround for not being able to dynamically update a directed acyclic graph of dependencies (more than one places in the code may call get_value()) and get the future_task respawned at a more appropriate depth (if that concept then still holds).

I don't remember the exact sequence that caused the hang problems we saw, but I do remember the example. The example was computing the binomial coefficient B(n,k) recursively; i.e., by Pacal's triangle.

  1. Construct a future for each B(i,j) required for the recursive computation of B(n,k). It works out to a rectangular subregion of Pascal's triangle. Don't worry about arithmetic wrap-around. E.g., actually compute B(n,k) mod 2^32.
  2. Link up the futures so that evaluation of B(i,j) computes 1 if j=0 or i==j. Otherwise compute it as spawning the the futures for B(i-1,j-1) and B(i-1,j) and adding their result. The key point is that there should be only one future created and evaluated for each B(i,j). The dependence graph becomes rectangular grid.
  3. Start the future for B(n,n/2) for a suitably high value of n.

I recall the hang problem was related to the way that a thread working on afuture X could end up waiting for another future Y to complete, and to keep itself busy, steal another future Z that ended up needing the result of X.

This is not a situation I would have considered. But doesn't TBB preclude stealing tasks at the same depth or shallower? So wouldn't the computation proceed in the order dictated by the order of spawning, except for a window the size of the number of workers, and wouldn't this either result in all dependencies being resolved or soon resolved, or hopelessly stuck waiting for futures that have not started executing, in which case my suggestion could come to the rescue? (Sorry if there any obvious holes in this reasoning, I'm stuck in fact-finding mode for now.)

Quoting - Raf Schietekat

If you create a future in task A and task A subsequently spawns B which does get_value() before the future_task has started executing, then you'll have a deadlock because the thread that's waiting will not consider the future_task as an option for keeping busy (maybe you'll need some more distance, but you get the picture). This situation would require another worker thread (and TBB is about optional concurrency, so everything should work with just one thread), that isn't caught up in work at a greater depth than future_task either. So there are many ways for this to cause problems, and this seems clear enough now not to need verification anymore.

No, I don't understand. Where is this information in the TBB Documents? The documentation for the method wait_for_all() says nothing about not being able to task steal from other threads, nor can I find anything that suggests these conditions do not allow task stealing from another thread. Thanks again, I have another idea I'm working on, we'll see...

Quoting - Raf Schietekat

This is not a situation I would have considered. But doesn't TBB preclude stealing tasks at the same depth or shallower? So wouldn't the computation proceed in the order dictated by the order of spawning, except for a window the size of the number of workers, and wouldn't this either result in all dependencies being resolved or soon resolved, or hopelessly stuck waiting for futures that have not started executing, in which case my suggestion could come to the rescue? (Sorry if there any obvious holes in this reasoning, I'm stuck in fact-finding mode for now.)

Thanks for your patience. I stole your idea and have the future start a new task if one hasn't started yet. I also fixed the task leaking... This appears to work, though I'm not sure about the case that Arch mentions, nor would I know how to force a test of that case. Thanks for any comments!

template
class tbb_future
{
public:
   tbb_future(T (*f)()):func(f)
   {
      started = false;
      done = false;
      my_future = new(tbb::task::allocate_root())future_task(f,this);
      my_future->spawn(*my_future);
   }

   T get_value()
   {
      if(!done)
      {
         tbb::spin_mutex::scoped_lock lock(mutex);
         tbb::task*temp_root = 0;
         if(!done)
         {
            if(started)
            {
               temp_root = new (tbb::task::allocate_root())tbb::empty_task;
               tbb::empty_task*child = new (temp_root->allocate_child())tbb::empty_task;

               temp_root->set_ref_count(2);

               my_future->add_task(child);
               lock.release();

            }
            //it's never even started, but may have a different owner and never be able to.
            //recreate a new task and wait on it
            else
            {
               my_future = new(tbb::task::allocate_root())future_task(func,this);
               lock.release();
               tbb::task::spawn_root_and_wait(*my_future);
            }

         }
         if(temp_root)
         {
            //what happens if child task is spawned before this line is run?
            temp_root->wait_for_all();
            temp_root->destroy(*temp_root);
         }
      }
      return value;
   }
private:

   bool start()
   {
      tbb::spin_mutex::scoped_lock lock(mutex);
      bool ret = started;
      started = true;
      return ret;
   }

   void finish(const T&v)
   {
      tbb::spin_mutex::scoped_lock lock(mutex);
      value = v;
      done = true;
   }

   class future_task : public tbb::task
   {
   public:
      future_task(T (*f)(),tbb_future*fut):func(f),my_future(fut)
      {

      }

      virtual tbb::task*execute()
      {
         //if there's another task that already did / is doing this computation, just return
         if(!my_future->start())
         {
            my_future->finish(func());

            for(size_t i=0;i < dependent_tasks.size();++i)
               spawn(*dependent_tasks[i]);
         }

         return 0;
      }

      void add_task(tbb::task*t)
      {
         dependent_tasks.push_back(t);
      }


      virtual ~future_task(){}
   private:
      T(*func)();
      tbb_future*my_future;
      tbb::concurrent_vectordependent_tasks;
   };
   tbb::atomicdone;
   tbb::atomicstarted;
   T value;
   tbb::spin_mutex mutex;
   tbb::atomicmy_future;
   
   T(*func)();
   friend future_task;
};

How about direct invocation instead of the extra future_task in get_value()?

Quoting - Raf Schietekat

How about direct invocation instead of the extra future_task in get_value()

Hmmm, good suggestion, but I think this is necessary so that subsequent callers of get_value() can add a dependent_task to know when the computation is done.

So this future class is guarenteed to work except when it depends on another future class, where it could (due to task stealing) be stuck waiting for itself? This seems to be a very subtle potential for race conditions in a task stealing scheduler that I would never have considered...

"Hmmm, good suggestion, but I think this is necessary so that subsequent callers of get_value() can add a dependent_task to know when the computation is done." Wouldn't that still work if you directly call execute()?

"stuck waiting for itself" I don't know... The example provided by Arch didn't seem to apply to TBB as it is now, or does it? Maybe it's not too difficult to formulate and follow some workable rules to make a future a useful Threading Building Block?



Quoting - Raf Schietekat
"stuck waiting for itself" I don't know... The example provided by Arch didn't seem to apply to TBB as it is now, or does it? Maybe it's not too difficult to formulate and follow some workable rules to make a future a useful Threading Building Block? this is necessary so that subsequent callers of get_value() can add a dependent_task to know when the computation is done.

Ahh I thought you meant call func() directly... yeah calling execute should work, good idea I hadn't considered that.

I really don't know whether it will work with TBB as it is now... that's why I was hoping to find out from the forum; but it's also important that it works as TBB will be...

Quoting - mwhenson

Ahh I thought you meant call func() directly... yeah calling execute should work, good idea I hadn't considered that.

I really don't know whether it will work with TBB as it is now... that's why I was hoping to find out from the forum; but it's also important that it works as TBB will be...

Also, .NET has (will have) the Task Parallel Library which is a work stealing library with futures. So it must be possible to implement futures with a work stealing scheduler. Anyone know how they do it?

Mike

What we really need is a way to create a task that is guarenteed not to steal... ie a task constructor with some do_not_steal option that if set to true would never steal on a wait_for_xxx method. This would make my code correct (future_task would call the constructor with this option), though we wouldn't be able to use Raf's direct invocation optimization. This may lower some potential parallism, but if the other side is potential deadlock, this would be worth it. What do you think?

"calling execute should work, good idea" I actually consider it mandatory to avoid deadlock that the caller take responsibility for executing the code itself, or verifying that it is being executed, not just spawned.

"So it must be possible to implement futures with a work stealing scheduler." Elementary, my dear Henson! Surely everybody knows the Coffman conditions for deadlock... well, mutual exclusion corresponds to get_value() waiting until somebody else comes up with the answer, so if instead you just go ahead anyway (or wait only after verifying that some other code is actually working on the future), not caring that you might be doing redundant work or sitting idle, you will avoid inducing deadlock.

"What we really need is a way to create a task that is guarenteed not to steal..." Not really: TBB will only steal deeper tasks, so that takes care of one of the other Coffman conditions.

Well, I'm not 100% sure yet, but it seems plausible.

P.S.: Actually, Holmes never said that.

waiting until somebody else comes up with the answer, so if instead you just go ahead anyway (or wait only after verifying that some other code is actually working on the future), not caring that you might be doing redundant work or sitting idle, you will avoid inducing deadlock.

"What we really need is a way to create a task that is guarenteed not to steal..." Not really: TBB will only steal deeper tasks, so that takes care of one of the other Coffman conditions.

It's possible I'm completely misunderstanding something. But guaranteeing that a task has started does not seem sufficient to prevent deadlock. Let's say that task X stole another task, Z, in an unrelated wait_for_xxx call. This should be able to happen if both X and Z are future_task's since they are roots with the same depth. But the future Z then calls X's corresponding get_value() method, but X is already running (deeper in the same call stack), so work cannot continue. This is what Arch was talking about if I understand correctly. Am I completely misunderstanding something? And if I'm correct, could this be solved by not allowing a task to steal?

Mike

"But guaranteeing that a task has started does not seem sufficient to prevent deadlock." I dropped the ball there, sorry (also for the silly joke): only "go ahead anyway" goes (waiting after verification only works with a self-contained future or so), but wouldn't that be too heavy a requirement on a future? Disallowing stealing does seem plausible because it goes after the cycles, but I need to think about this some more.

"But guaranteeing that a task has started does not seem sufficient to prevent deadlock." I dropped the ball there, sorry (also for the silly joke): only "go ahead anyway" goes (waiting after verification only works with a self-contained future or so), but wouldn't that be too heavy a requirement on a future? Disallowing stealing does seem plausible because it goes after the cycles, but I need to think about this some more.

I don't think this is a heavy requirement at all. This is a potential problem when any future wants to wait on another one...

"I don't think this is a heavy requirement at all." Well then, no problem... Make sure to let us know how much redundant effort is typically spent in your programs, and whether the futures are beneficial to performance at all. Just one question at this time: are you using the futures because you don't know yet whether they will be executed, or because you want to create concurrency (always a good thing if not too fine-grained), or something else (but I don't suppose it has anything to do with reducing latency in a distributed system)?

(Added) Although... an affinity-related future might not even be a bad idea?

Quoting - Raf Schietekat
"I don't think this is a heavy requirement at all." Well then, no problem... Make sure to let us know how much redundant effort is typically spent in your programs, and whether the futures are beneficial to performance at all. Just one question at this time: are you using the futures because you don't know yet whether they will be executed, or because you want to create concurrency (always a good thing if not too fine-grained), or something else (but I don't suppose it has anything to do with reducing latency in a distributed system)?

(Added) Although... an affinity-related future might not even be a bad idea?

My idea is as follows: I'm create a huge task DAG where each individual task will write it's resulting value to some global memory location. Then, some event occurs and I need to create another huge task DAG where some of these individual tasks need to read these global memory locations. I envision that the futures will commonly be done executing by the time someone needs the value, but of course there's no way to guarentee that. What I will probably have to do right now is: create a new task that is identical to this one, place it in the dependency list of the future, then cancel / return from the current task. This will work in my case, but it's not completely general as it must guarentee not to have any global side effects before cancelling.

Quoting - mwhenson

Also, .NET has (will have) the Task Parallel Library which is a work stealing library with futures. So it must be possible to implement futures with a work stealing scheduler. Anyone know how they do it?

That's closed source, and it seems that they are not very wishing to disclose any internal details...

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Raf Schietekat

This is not a situation I would have considered. But doesn't TBB preclude stealing tasks at the same depth or shallower? So wouldn't the computation proceed in the order dictated by the order of spawning, except for a window the size of the number of workers, and wouldn't this either result in all dependencies being resolved or soon resolved, or hopelessly stuck waiting for futures that have not started executing, in which case my suggestion could come to the rescue? (Sorry if there any obvious holes in this reasoning, I'm stuck in fact-finding mode for now.)

Hmmm... But futures differ form normal parent-child task relations. Isn't it possible that one task constructus future and then passes it to another task (with lesser depth), and that task executes future's wait()? So that you will know future's "depth" only on wait() invocation, but it can be already too late...

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Dmitriy Vyukov

That's closed source, and it seems that they are not very wishing to disclose any internal details...

Heh probably, can't hurt to ask though. You think they do something like I suggested, where they have a task option that prevents stealing?

Quoting - Dmitriy Vyukov

Hmmm... But futures differ form normal parent-child task relations. Isn't it possible that one task constructus future and then passes it to another task (with lesser depth), and that task executes future's wait()? So that you will know future's "depth" only on wait() invocation, but it can be already too late...

It's the other way around that poses a problem: futures constructed at lesser depth but invoked at greater depth, because then you're going against the tree/dag. TBB worker threads steal the shallowest tasks they can find from their random victim (unless I'm mistaken, because at first sight it seems like a good idea to select a random victim from a preselected group that all have the shallowest potential loot, but maybe that's too costly to compute), but only from tasks at greater than the thief's current depth. Futures may subvert that and potentially create cycles.

Quoting - mwhenson

Heh probably, can't hurt to ask though. You think they do something like I suggested, where they have a task option that prevents stealing?

On MS Tech Ed EMEA, a presenter who told about PPL (which is their template library for C++, similar to TBB) said to the audience that "future is really called task, believe me". I think he meant that it will be a regular task that could held the result of its function, but won't support the semantics that you'd expect from a future. But this is just my speculation; we'll need to see what they will end up with (in both PPL and TPL).

"It's the other way around that poses a problem:" On the other hand, it's when a shallower task tries to get the value from a future that was created at greater depth that it can steal something that ultimately depends on the future. Maybe things can go wrong either way? Now I've managed to confuse myself again...

I often get the impression that it would be nice to juggle multiple stacks per worker thread: if you're following a dependency, then use the same stack, if you're stealing, pick another stack, and that way you won't get stuck. Mutually exclusive threads. Coroutines. Fibers. Solaris threads (needing a LWP to execute). Apparently not a novel idea, so what does it mean that it is not mainstream?

Quoting - Raf Schietekat
"It's the other way around that poses a problem:" On the other hand, it's when a shallower task tries to get the value from a future that was created at greater depth that it can steal something that ultimately depends on the future. Maybe things can go wrong either way? Now I've managed to confuse myself again...

I often get the impression that it would be nice to juggle multiple stacks per worker thread: if you're following a dependency, then use the same stack, if you're stealing, pick another stack, and that way you won't get stuck. Mutually exclusive threads. Coroutines. Fibers. Solaris threads (needing a LWP to execute). Apparently not a novel idea, so what does it mean that it is not mainstream?

Can we do this using tbb::tbb_thread? And if so, how expensive would that be?

Quoting - mwhenson

Quoting - Raf Schietekat
"It's the other way around that poses a problem:" On the other hand, it's when a shallower task tries to get the value from a future that was created at greater depth that it can steal something that ultimately depends on the future. Maybe things can go wrong either way? Now I've managed to confuse myself again...

I often get the impression that it would be nice to juggle multiple stacks per worker thread: if you're following a dependency, then use the same stack, if you're stealing, pick another stack, and that way you won't get stuck. Mutually exclusive threads. Coroutines. Fibers. Solaris threads (needing a LWP to execute). Apparently not a novel idea, so what does it mean that it is not mainstream?

Can we do this using tbb::tbb_thread? And if so, how expensive would that be?

In the past, we've kicked around the idea of doing something like this. If anyone tries such an approach, I'd like to hear about the results. Even for the restricted DAGs typical of fork-join programs, the worse case for a P processor system seems to be O(P^2) stacks, even though the space is only O(P*S), where S is the serial execution space. In other words, the worse case for the restricted DAG situation is many tiny stacks. But that's the worst case. It may turn out that in the common case the number of stacks is reasonable. On the other hand, if the intent is to implement futures and avoid deadlock, then the worse case must be handled.

Cilk essentially does something similar to the coroutine approach, by putting procedure frames in the heap instead of on the stack, so the number of stacks becomes moot. Alas the Cilk approach requires compilers with a different calling convention. It's sad that the jump to 64-bit processors preceded the jump to parallelism. The industry got locked into calling conventions optimized for serial execution. Tthe jump to 64 bits was the reset opportunity for calling conventions.

Quoting - Arch Robison (Intel)

In the past, we've kicked around the idea of doing something like this. If anyone tries such an approach, I'd like to hear about the results. Even for the restricted DAGs typical of fork-join programs, the worse case for a P processor system seems to be O(P^2) stacks, even though the space is only O(P*S), where S is the serial execution space. In other words, the worse case for the restricted DAG situation is many tiny stacks. But that's the worst case. It may turn out that in the common case the number of stacks is reasonable. On the other hand, if the intent is to implement futures and avoid deadlock, then the worse case must be handled.

Cilk essentially does something similar to the coroutine approach, by putting procedure frames in the heap instead of on the stack, so the number of stacks becomes moot. Alas the Cilk approach requires compilers with a different calling convention. It's sad that the jump to 64-bit processors preceded the jump to parallelism. The industry got locked into calling conventions optimized for serial execution. Tthe jump to 64 bits was the reset opportunity for calling conventions.

Thanks for the interesting thoughts. What might this new calling convention look like that could solve this problem?

As for the original problem, I'm not convinced this is a completely generic solution - even if a future X has it's own stack, it could still potentially call a wait_for_xxx and steal a task Y that depends on X, correct? Or am I missing an obvious solution?

Also, I don't have a good intuition about the overhead of tbb_thread. The tbb documentation says that they are heavyweight and should be avoided if possible... What is the overhead like compared to a task, and how would one minimize it?

Leave a Comment

Please sign in to add a comment. Not a member? Join today