Abstracting Thread Local Storage

[Disclaimer: I'm sketching possibilities here. There is no commitment from the TBB group to implement any of this.]

Threading packages often have some notion of a thread id or thread local storage. The two are equivalent in the sense if given one, you can easily build the other. For example, thread local storage can be implemented as a one-to-one map from thread ids to pieces of storage. And vice versa, the address of a variable in thread-local storage can be used as a thread id.

TBB by design has no thread id or thread-local storage. TBB is based on task-based parallelism, where the programmer breaks work up into tasks, and the task scheduler is free to map tasks to hardware threads. Furthermore, our OpenMP run-time group strongly recommended that we avoid explicit thread ids because of problems with nested parallelism and dealing with a dynamically growing or shrinking team of threads. For example, the nature of the OpenMP thread id interface implies that the number of threads in a thread team is fixed for the duration of the team.

However, thread local storage does have its uses. Don Webber posted an excellent use case for thread local storage, which involves updating a sparse matrix. The problem involves doing many updates of the form *p += value, in parallel, where some updates might update the same location. Assuming that += is commutative and associative, one way to implement this is to have each thread sum its own updates privately, and then merge the sums. As Don notes, the alternative of locking *p on each update is prohibitively expensive. Using an atomic +=, even if available on current hardware, would likewise be prohibitively expensive, because cache lines would ping-pong severely.

I'd like to see TBB extended to provide the power of thread local storage without opening up a Pandora's box of raw thread ids. I think the solution needs to cleanly separate a high-level portion from a low-level portion, like we recently did for cache affinity. (Note: the type task::affinity_id might appear to have opened the box, but did not, because it is a hint, not a commandment.)

TBB's template parallel_reduce in TBB partially deals with the cited use case, because it is lazy about fork/joins. The user defines how to fork/join state information for the reduction. The template recursively decomposes the range and applies the users fork/join operations. The laziness is that fork/join pairs are only used when task stealing occurs. For example, if there are P threads and N leaf subranges, it does not do the obvious N-1 fork operations, but instead does just enough to keep the threads busy. Specifically, it does a fork/join pair only when the two children would be processed by different threads.

However, parallel_reduce is not lazy enough. At the high level, the problem is that parallel_reduce cannot assume that the reduction operation is commutative. For a non-commutative reduction operation, the current implementation is close to optimal (maybe a factor of 2 off in the worse case) with respect to the number of fork/join pairs. If TBB added a reduction template that could assume a commutative reduction operation (e.g.parallel_unordered_reduce), then at most P-1 fork/join pairs would be necessary.

The good thing about using the hypothetical parallel_unordered_reduce instead of exposing thread local storage is that it keeps the abstraction at a high level. Explicitly using thread local storage would introduce irrelevant low-level details. For example, a typical implementation based on thread local storage can be sketched as:

forall updates (p,value)  

  do  *p += value // *p points to thread-local partial sum

for each thread-local partial sum do

  update global-sum+= thread-local partial sum

This level exposes issues such as "where are the thread-local partial sums that were generated in the first loop?" Since threads can come and go during execution of the first loop, iterating across ids of currently running threads is not enough. Some of the partial sums might outlive their threads, or some threads might come into existence after the partial sums were generated.  We'll need a container to hold the partial sums, and a means of iterating over the sums.

The interface for such a container seems straightforward. Define it as a sequence of T that has iterator capability. Add TBB-style ranges too, so that reductions over the container can be done in parallel. Add a special method "mine()" that returns reference to  the element that is owned by the invoking thread. If the element is not present, "mine()" would insert one and default-construct it. 

Method mine() would be most likely implemented by hashing a thread id, so it's not going to be cheap, but probably inexpensive enough if the user hoists calls to it. 

There is an interesting alternative that weakens guarantees, with the intent of expressing intent at a little higher level. It's somewhat an object-oriented extension of a semaphore that combines the semaphore with the resource it is protecting. It would work as follows. The method "mine() could be replaced by two methods "acquire()" and "release()" [possibly sugared with RAII] such that: 

    • "acquire()" would grant access to an instance of T that is not being accessed by any other thread

    • "release()" would release access

This interface permits an implementation to keep the limit the number of "thread local" copies of T to what is actually necessary for concurrency, not what is necessary for one-per-thread. If T is really big, this could be advantageous. There's perhaps an issue with cache affinity.  However, a sufficiently clever implementation could bias towards grabbing the instance of T that the thread most recently had before. Of course, a conforming implementation could just use thread-local storage for each copy of T.