Task level mutexes and locks

Task level mutexes and locks

Hi,

I'm a TBB newbie, but am familiar with concurrent programming etc... I've looked over the docs and read a few threads in the forum, so I am aware that I'm not the first to ask questions about DAGs.

I'm looking to use TBB to evaluate a complex directed graph of operations. I've seen a few suggestions on how to do this (eg: the white/grey/black edge colouring) which look reasonable, but involve quite a bit of book keeping to implement. I was wondering if there is anything equivalent to a task level mutex/lock which would simplify the implementation, and if not, is it a reasonable feature request?

For my purposes, we have a graph of operations. How I see it is that a TTB task is generated for each node in our graph, computes whatever it needs to do and shoves the result back into the graph node. So if we have a simple graph, where node A depends on nodes B and C, this would be fairly straight forward. We recurse up backwards through the dependency chain from A, first generating task for A, which generates tasks for B and C and waits for them (or rather a continuation task does). Tasks B and C do what they need to do then task A runs. All happy and easy enough with TBB.

However if we have another node D, which both B and C depend on it gets complex. We have a diamond dependency, so B and C's tasks will attempt to generate a task for D. Boo, what do we do here? We could...
- compute D twice and waste a thread and associated resources,
- block one thread and wait for D to be computed and waste a thread,
- implement the edge colouring idea to traverse through D, which would be quite complex to implement the glue for.

My suggestion is that there be TBB task level mutexes and locks, with them it would be comparatively trivial to implement this traversal. When a task gets to D it locks such a mutex, checks to see if the value is there, if not, it generates the task for D, waits for it to complete, then unlocks the mutex. So the first tasks computes D, the second one stops. However, being a task level lock, instead of blocking the entire thread, it blocks the calling task. The thread running that task suspends it at the point of the lock, goes away and attempts to do other work, every so often checking to see if the lock has now been gained and then resumes the task at that point. So if there were any parallel tasks needed to produce D, our blocked thread 'teleports' around the block and continue to do useful work on other tasks.

I can't imagine you could implement such a thing client side at all, as it needs to play with the TBB scheduler. Without peering into the source, I can't see this as being to hard to implement (always optimistic me), with a fairly simple lock class a few tweaks to the task queues. Admittedly this is effectively voluntary preemption, which goes against the design philosophy of TBB. If that is out of the question, something similar could be done with a 'lock continuation' task. Either way, the key point is not to have the entire thread waiting on a mutex, but only to have a task wait.

Anyone else see this at all a practical idea, or see some better way of implementing this in TBB?

tx

b

8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

The problem with task locks is that while a task is blocked waiting on the lock, it is occupying stack space. Unfortunately, current calling conventions require declaring stack sizes ahead of time, usually conservatively. So creating many blocked tasks requires creating many stacks. Some operating systems like Linux have support for letting a thread switch stacks, but alas not all OSes have this support. Even if they did, the possibility of creating an unbounded number of stacks seems expensive if not risky.

As you suggest, there is a way to create a lock-like class where a blocked task hangs a continuation task on the lock, and the continuation task is executed when the lock unblocks. I coded one a few years ago, but alas did not keep the code around.

Somewhat related is the example in TBB/2.1/examples/parallel_do/parallel_preorder, which performs a preorder traversal of a graph.

Quoting - Arch Robison (Intel)

As you suggest, there is a way to create a lock-like class where a blocked task hangs a continuation task on the lock, and the continuation task is executed when the lock unblocks. I coded one a few years ago, but alas did not keep the code around.

Hi Arch,

thanks for the reply. Were you able to do this client side, or did you have to go into the library source to do it?

bruno

Best Reply

I believe I did it all on the client side. Off the top of my head, there are three pieces of state to maintain:
A tbb::task_list can be used to keep the waiting list of tasks.
A tbb::spin_mutex to use for some brief internal locking (not locking the region of interest).
A flag to indicate whether a task is working on the locked region.
The logic for the ACQUIRE/RELEASE operations look something like the following, which assumes thatthe task trying to acquire the lock supplies a continuation task that will process the critical section and release the lock.

ACQUIRE:
    acquire internal spin lock
    if critical section is busy then 
        put continuation task on list
    else
        mark critical section as busy
        spawn continuation task
    end
    release internal spin lock

RELEASE:
    acquire internal spin lock
    if list is not empty then
        pop task from list and spawn it
    else
        mark critical section as not busy
    release internal spin lock


Bruno,

This may look like a shameless plug for vaporware. In the threading tool I am building and nearing completion to the point of WhatIf-like alpha testing your problem could be handled by:

// placed in object or make static
qtControl qtControl_D;

void TaskD()
{
  ... D code ...
}

void DoD()
{
  if(InterlockedIncrement(&qtControl_D.Count) == 1)
  {
     TaskD(); // call direct
     InterlockedDecrement(&qtControl_D.Count)
  }
  parallel_wait(&qtControl_D);
}

void TaskB()
{
  ... prolog ...
  DoD();
  ... epelog ...
}

void TaskC()
{
  ... prolog ...
  DoD();
  ... epelog ...
}

 --- or ---

void DoD()
{
  if(InterlockedIncrement(&qtControl_D.Count) == 1)
  {
     parallel_task(&TaskD); // enqueue as task
     InterlockedDecrement(&qtControl_D.Count)
  }
  parallel_wait(&qtControl_D);
}

The second method (though not detailed in the sample code) would permit the TaskD to run on a core that is not the core that issued the task enqueue of TaskD. The reason for this might be you wish to run TaskD on a different NUMA node than TaskC or TaskB. The thread run restrictions are setable in the qtControl structure.

Constructing diamond like dependencies is trivial given the right threading toolkit.

Making posts like this might provide the TBB programmers some talking points.

Jim Dempsey

www.quickthreadprogramming.com

Hi Arch,

That makes sense and looks fairly straight forward to programme. Thanks very much for the help.

Would it not be worthwhile adding something like this into TBB itself?

Quoting - jimdempseyatthecove
[SNIP]

Making posts like this might provide the TBB programmers some talking points.

Jim Dempsey

Thanks for the ideas Jim. My current approach is based around an explicit task graph that is generated by my main graph. This is separately walked by a set of boost::threads manipulating lists and running non dependent tasks. The main problem with it is the cost of explicit task generation. Walking one graph to make another graph to do things.

Task generation has its overhead, there are some specialized techniques you can use to reduce the overhead.

One technique is a series of ring buffer queues. Using non-interlocked write operations and with defensive code to detect packet loss, you can have relatively fast enqueue/dequeue operations as long as you can live with out-of-order processing. And an occasional long latency occurring during lost packet recovery operations.

I haven't written the code but I think it can be done using 6 non-interlocked writes (to two different locations) as opposed to two interlocked operations. It should run faster as the interlocked operations are very long (relatively speaking).

Jim

www.quickthreadprogramming.com

Leave a Comment

Please sign in to add a comment. Not a member? Join today