Abstracting Thread Local Storage

[Disclaimer: I'm sketching possibilities here. There is no commitment from the TBB group to implement any of this.]

Threading packages often have some notion of a thread id or thread local storage. The two are equivalent in the sense if given one, you can easily build the other. For example, thread local storage can be implemented as a one-to-one map from thread ids to pieces of storage. And vice versa, the address of a variable in thread-local storage can be used as a thread id.

TBB by design has no thread id or thread-local storage. TBB is based on task-based parallelism, where the programmer breaks work up into tasks, and the task scheduler is free to map tasks to hardware threads. Furthermore, our OpenMP run-time group strongly recommended that we avoid explicit thread ids because of problems with nested parallelism and dealing with a dynamically growing or shrinking team of threads. For example, the nature of the OpenMP thread id interface implies that the number of threads in a thread team is fixed for the duration of the team.

However, thread local storage does have its uses. Don Webber posted an excellent use case for thread local storage, which involves updating a sparse matrix. The problem involves doing many updates of the form *p += value, in parallel, where some updates might update the same location. Assuming that += is commutative and associative, one way to implement this is to have each thread sum its own updates privately, and then merge the sums. As Don notes, the alternative of locking *p on each update is prohibitively expensive. Using an atomic +=, even if available on current hardware, would likewise be prohibitively expensive, because cache lines would ping-pong severely.

I'd like to see TBB extended to provide the power of thread local storage without opening up a Pandora's box of raw thread ids. I think the solution needs to cleanly separate a high-level portion from a low-level portion, like we recently did for cache affinity. (Note: the type task::affinity_id might appear to have opened the box, but did not, because it is a hint, not a commandment.)

TBB's template parallel_reduce in TBB partially deals with the cited use case, because it is lazy about fork/joins. The user defines how to fork/join state information for the reduction. The template recursively decomposes the range and applies the users fork/join operations. The laziness is that fork/join pairs are only used when task stealing occurs. For example, if there are P threads and N leaf subranges, it does not do the obvious N-1 fork operations, but instead does just enough to keep the threads busy. Specifically, it does a fork/join pair only when the two children would be processed by different threads.

However, parallel_reduce is not lazy enough. At the high level, the problem is that parallel_reduce cannot assume that the reduction operation is commutative. For a non-commutative reduction operation, the current implementation is close to optimal (maybe a factor of 2 off in the worse case) with respect to the number of fork/join pairs. If TBB added a reduction template that could assume a commutative reduction operation (e.g.parallel_unordered_reduce), then at most P-1 fork/join pairs would be necessary.

The good thing about using the hypothetical parallel_unordered_reduce instead of exposing thread local storage is that it keeps the abstraction at a high level. Explicitly using thread local storage would introduce irrelevant low-level details. For example, a typical implementation based on thread local storage can be sketched as:


forall updates (p,value)  

  do  *p += value // *p points to thread-local partial sum

for each thread-local partial sum do

  update global-sum+= thread-local partial sum


This level exposes issues such as "where are the thread-local partial sums that were generated in the first loop?" Since threads can come and go during execution of the first loop, iterating across ids of currently running threads is not enough. Some of the partial sums might outlive their threads, or some threads might come into existence after the partial sums were generated.  We'll need a container to hold the partial sums, and a means of iterating over the sums.

The interface for such a container seems straightforward. Define it as a sequence of T that has iterator capability. Add TBB-style ranges too, so that reductions over the container can be done in parallel. Add a special method "mine()" that returns reference to  the element that is owned by the invoking thread. If the element is not present, "mine()" would insert one and default-construct it. 

Method mine() would be most likely implemented by hashing a thread id, so it's not going to be cheap, but probably inexpensive enough if the user hoists calls to it. 

There is an interesting alternative that weakens guarantees, with the intent of expressing intent at a little higher level. It's somewhat an object-oriented extension of a semaphore that combines the semaphore with the resource it is protecting. It would work as follows. The method "mine() could be replaced by two methods "acquire()" and "release()" [possibly sugared with RAII] such that: 

    • "acquire()" would grant access to an instance of T that is not being accessed by any other thread

    • "release()" would release access



This interface permits an implementation to keep the limit the number of "thread local" copies of T to what is actually necessary for concurrency, not what is necessary for one-per-thread. If T is really big, this could be advantageous. There's perhaps an issue with cache affinity.  However, a sufficiently clever implementation could bias towards grabbing the instance of T that the thread most recently had before. Of course, a conforming implementation could just use thread-local storage for each copy of T.

Comments?

For more complete information about compiler optimizations, see our Optimization Notice.

Comments


Hi Arch
That looks like it would do the trick. I haven't been using the open source version (please make a version of this that compiles as a MSVC project, I don't want to have to install kinds of unix like things on my development machine).

Any idea when that feature is likely to make it into the commercial release?

Thanks

Mick


We recently added a feature to the open source version of TBB that might be just what you are looking for: task_scheduler_observer.

If you download the Reference Manual for the open source version from http://threadingbuildingblocks.org/documentation.php , and go to section 8.6, you'll find a description of task_scheduler_observer. It lets you define an object that gets notification when TBB creates or destroys a thread. If you create the object after TBB has already created a thread, you get notification before the thread does anything on behalf of the next parallel loop. Section 8.6.5 gives this guarantee in a somewhat oblique way.

If this solves your case, I'd like to hear about it, because we've been looking for example use cases for task_scheduler_observer.

Arch


I have another case where I want access to a thread id.
My application is database like with user requests being distributed to worker threads that manage their own large chunks of memory so that when the user task finishes all memory associated with that task gets immediately wiped with no long term issues with fragmentation etc. This user task is now to be parallelised with TBB so I now need to make sure memory allocation by a TBB task only uses the memory dedicated to the user task but also uses thread specific allocs to reduce locking overhead.
To reduce this to its simplest form I need a 2D thread specific memory allocator. I currently have 1D over my user task threads and the allocator provided with TBB is 1D over all threads.
The only way I can think of to do this is for the TBB threads to have an index that can be used to provide the 2nd dimension.
I can implement this with TLS (I'm Windows only), but I then have to check my TLS index at the beginning of each task just in case the current thread is new in some way.
What would be neat is if each TBB thread called a constructor/destructor call that could be replaced. Then all appropriate initialisation could be done in this call rather than constantly having to worry about the possibility of threads being constructed or terminated.

Mick


After reading more of these comments, maybe it's better to have a way for TBB return a CoreID, ThreadID, and any other kind of identifier and just let the programmer hash that as necessary for his or her own use. I'm taking back some of my earlier comments about trying to abstract everything, sometimes you do need the low level access. That's the beauty of C++, you get everything in C for free plus more. Maybe TBB could be the same way, you get all the power of threads for free plus more.


Another idea is to offer a specialized version of the parallel algorithms (parallel_for, etc.) which when called get an object (thread storage prototype) as an argument. This object is duplicated (thread storage object) by each thread used by the algorithm internal thread scheduler. When the functor/task of the programmer is called for a specific range the threads own thread storage object is also handed to the functor and can be used as a thread specific cache, for example to count.
This would only work if tasks have a thread-affinity or if the thread storage object is wrapped by a proxy that re-routes access to the thread storage object of the actual thread (and I am not sure if this could be done efficiently).
In a certain aspect this is like parallel_reduce but because of the thread storage objects only a minimal amount of copying, merging, and duplication is needed. Another difference is that no reduction takes place. After the algorithm finished the programmer could ask for a container of thread storage objects (or such a container is returned or the thread storage objects are filled into a container handed to the algorithm when calling it) and then iterate over it sequentially or in parallel.

While such a design wouldn't be as mighty and flexible as the one drafted above it might be easier to use - though duplicating the algorithm interfaces for it might be unpleasant.

Code example:

struct task {

void operator()(blocked_range& r, TS& thread_storage ) const {
// Use thread_storage to cache data that is only needed local to the thread calling this task.
// Thread affinity guarantees that the task isn't switched to another thread while running.
}

}; // struct task

TS thread_storage_prototype;
TSC thread_storage_container;

parallel_for( range, task( in, out ), thread_storage_prototype, thread_storage_container );

// Look into thread_storage_container and use whatever is stored in the contained thread storage objects.

Well, the longer I think about the aquire/release container the more I like the idea - mainly because it can be made to work correctly and if thread affinity comes into play it could be made very efficient without locking. Then it would also be easy to program a thread specific cache as described above but without the need for algorithm or task interface changes. I am looking forward to it ;-)

Cheers,
Bjoern


The notion of core affinity might actually work here.

With the mine() kind of interface, core affinity does not work, because TBB has no control over when the OS switches a core from one thread to another. E.g., supposed threads A and B are both executing X=X+1 on core-local storage. We would not want to be in a situation where:
<OL>
<LI>core 0 is running thread A
<LI>thread A is running X = X+1, but has only done the load so far
<LI>core 0 switches to thread B
<LI>thread B does X = X+1
<LI>core 0 switches to thread A
<LI>thread A does its store into X
</OL>
The net effect would be X=X+1 instead of the intended X=X+2.

But with the acquire()/release() kind of interface, the notion of mapping to cores intriguing. Call it an "affinitized_bag<T>". It would be a collection of T, initially empty. The acquire() operation would be guaranteed to return exclusive access to an object in the bag, and release() would release rights to it. Call the acquire/release pair an "access". Many conforming implementations would be possible:
<UL>
<LI> An implementation that has a single element, and serializes all accesses to it.
<LI> An implementation that lazily creates an element per core, and serializes all accesses to the same element.
<LI> An implementation that lazily creates an element per thread, and serializes all accesses to the same element.
<LI> Any of the above, but changed to resolve conflicts by creating multiple elements instead of serializing.
<LI> Random combinations of the above strategies, to shake loose programmer violations of the contract :-)
</UL>
The big win from core affinity would be in situations where cache affinity is important. The neat part is that it would address cache affinity without handing programmers explicit core ids.


Disclaimer: I'm fairly new to the world of parallel programming, so my comments may be worthless.

I think we should keep the idea of threads as far away from the user as possible. The beauty of TBB (in my mind) is the idea of "Task based programming" rather than "Thread based programming". If I'm understanding the mine() idea right, I think that would be the way to go. When writing a task, the programmer doesn't need know know anything about the thread it's running on. mine() just gets a pointer to the chunk of memory that belongs to the current thread that the task is running on and then afterwards you can iterate over all existing instances and pull out the data there.

Or would it make sense to go even one step lower and have a something like a ProcessorID or CoreID or something like that (like having the mine() idea work with a hash of ProcessorID rather than TreadID). Really, you're only ever going to have N threads running for N processors (cores), so you would only want to have N chunks local storage. So even if you have M&gt;N threads, only N are going to be active at any time, so you just have the thread update the local storage that belongs on the processor it's running on. Wouldn't this eliminate the cache issues? May have to watch out for deadlocks.