Thread-local storage vs. non-blocking data structures

Thread-local storage vs. non-blocking data structures

Hello,

I have some event-based simulation code, with processes subscribing to events, and being triggered accordingly. The simulation proceeds in cycles, which means that if a process asks to subscribe to an event, or if an event asks to wake up a process, this is not need to be reflected in the current cycle. Each cycle processes are executed in parallel using TBB, than the simulator does some book-keeping serially.
This looks like a good case for thread-local storage: each thread will log all subscribe and trigger (wake-up) requests locally, and then the simulator will process the logs serially in its book-keeping phase.
Alternatively, non-blocking datastructures could be used to process the requests in parallel section "on-the-fly".

What alternative should I choose?
If I go for thread-local storage, how expensive is the overhead of enumerable_tread_specific container? That is, I could have one enumerable_tread_specific the_storage, where T would contain everything a thread needs, and when some object needs data it will make a long way to the_storage, or I could have small enumerable_tread_specific container in each such object that works with the data.

Thanks,

Daniel

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi,
Sorry for the late response. Before answering this question, I'd like to know more about how you would go about implementing non-blocking data structures (if you chose to go with it). Is it possible to throw more light on non-blocking data structures? I still think they might need some hardware support.

There are so many variables involved, it's hard to guess which of the several approaches you suggest might best serve your application, but there are a few things I can say. You mention alternative "non-blocking" data structures that you suggest could be used to process the requests (by which I hope you mean accumulated for processing in the next generation). Those data structures may be non-blocking but they're probably not contention free--serialization in the accumulation process will force all the threads through a single pipe--whereas the ETS objects are each local to a thread and can be added to by that thread without contending with any other thread.

However, there is some overhead associated with setting up the ETS and a serial step at the end to harvest all the data from the collected ETS objects. So the question boils down to this: can the performance benefits while accumulating overcome the overhead of setting it up? I suspect the answer this will be qualified: it depends on how many triggers and subscription reqeuests the code is trying to accumulate and then process. Typically there's some threshold below which the overhead is too great (compared to the overhead of the alternatives) and above which it's not.

So here at the bottom I suggest the only way to know is to prototype it and see. That way you can test with the problem sizes you expect to encounter and can see where the thresholds lie. I am a little confused, though, by your final statement, "...or I could have small enumerable_tread_specific container in each such object...." In each object? ETS is hung off of Thread Local Storage and the componentsare owned by threads, not by objects. I don't know how to create an ETS container "...in each such object" but I think I can see how using ETS as a refillable per-thread accumulator for events and requests might work to avoid contention in your simulator.

Daniel,

Consider this:

Create your list of event objects (these may be all one type or different types)

Use std::vector or tbb:: concurrent vector to create a primary vector to hold the objects or pointersto these even objects.

Have a secondary std::vectoror tbb:: concurrent vector for purpose of containing pointers (or indexes, or references) to objects within the std::vectoror tbb:: concurrent vector of all event objects in signaled state.

Then (pseudo code)

while(!done)
// advance state
parallel_for on secondary vector ofsignaled object (initially empty)
... // your serial inter state code here
// prepare for next state advancement
clear secondary vector of all signaled event objects
for each event object in primary vector
if event object signaled
push ontovector of signaled objects
endif
end for
end while

Note,when the number of events is small, the serial code can push the signaled events onto the secondary list faster than a parallel_for can.
Whenthe number of events is large, you may want to parallel-ize the picking of signaled events and use the TBB concurrent vector for the signal objects vector.

You might wish to experiment using parallel_for_each when processing your signaled object list when the work per signaled event object is disproportionate.

The above method "converts" your concept of process into task.

The event object would contain the address of a function (or use virtual function) to perform the work of the task (former event).

Jim Dempsey

www.quickthreadprogramming.com

@all,Sorry for some mistakes. I've managed with the help of ETS for logging all communication requests and processing them later serially, because cycle-based semantics of the simulator permits this.

Login to leave a comment.