How tasks are allocated? Cache aligned or not?

How tasks are allocated? Cache aligned or not?

TBB uses overloaded operator new to allocate tasks. Does it allocate on cache lines? Can I make it do so?

Thank you,
renorm.

42 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

It could be false sharing or something was optimized out in my toy program. The difference became exactly zero once I switched to a realistic setup.

What is the best way to create TLS objects distinct for each thread?

I want to do the following. Run parallel_for with a dummy body to trigger lazy creation and then iterate through the TLS container. Is there any other way to trigger lazy creation?

The overloaded operator new eventually uses an internal function called NFS_Allocate, the same function that is used by cache_aligned_allocator. So allocated tasks are aligned on cache line size.

However there is another aspect. The space for a task is divided between the task object itself(i.e. instance of a class derived from tbb::task) and the task prefix, an instance of an internal class. As it follows from its name, the task prefix precedes the task object. So the user-visible task object (i.e. the address returned by the operator new) has weaker alignment of 16 bytes.

Let's say, I want to run parallel_for with the Body defined below. Because tasks are cache aligned, I don't have to copy the variable counter into the function stack in order to avoid false sharing. Do I understand it right?

struct Body {
    void operator() (const blocked_range&) const {
        // Copy counter into stack. Do I need to do it?
        int counter_copy = counter;
        
        // Here counter_copy is used very heavily and then copied back
        [...]
        counter = counter_copy;
    }

    int counter;
};

Thank you,
renorm.

With the Body being a member of a tbb::task subclass, you wouldn't have false sharing of counter, that's correct. Just be sure to initialise it. :-)

Right, false sharing is not an issue. I would however check whether compiler is able to optimize all the heavy work using a register.
And also, since operator() is the only call executed for Body (not counting construction and destruction), how does the class member differ from a local variable in the given example? It seems to me that the counter value is useless after operator() finishes, because right after that the completed task is destroyed and so is the Body object it contains.

I was wondering what the counter value was for, too. Maybe this is something for parallel_reduce, instead?

Thanks for the replys. The code fragment I posted is delibaretly made very simple. There is more to the body I use in my project. Each instance of Body initializes counter from a global atomic variable and then counter is used very heavily. Call operator can invoke other class method. The other methods use counter as well, but not heavily. Once all work is done, counter is added to the global atomic variable. It is good to hear that TBB uses cache aligned allocation for its taks. Can we say the about about boost threads?

Regards,
renorm.

That sounds very suspicious: you read an atomic into "counter", modify it, and then add the final value back to the atomic? I see no meaningful result from doing something like that, because the initial value of "counter" depends on the timing of the scheduler, and part of what is added to the atomic is an earlier value of that same atomic. How can you explain that what you are doing is correct?

Could it be because my code isn't very detailed? Here is the explanation of how the atomic variable is used.

Think of atomic variable as some sort of checking account. We withdraw some amount from the account, use all of it and go back for more, until the account is exhausted or we have nothing else to purchase. If somehow we don't spend all the money in our pocket, we deposit it back into the account. Each task checks the account and if is not empty, the task withdraws some small amount for it its work and stores it into the variable counter. Once the work is complete, the task dumps the results into a concurrent vector and goes back for more money.

So the atomic is changed when "counter" is set, that's new information. Seems like a serious scalability issue, though.

Why is that? The atomic is a bank account (sort of). It must be synchronized. The other
alternative is to use mutex. The atomic is changes once before heavy work starts and once after that. This way I can create only as many threads as available physical threads and keep the amount of work between updates of the atomic relatively small. It should be faster than creating many tasks using parallel_for, because the total amount of work isn't not known.

If it must be synchronised with the tasks for some business reason, then so be it, but that doesn't make it scalable. If you "keep the amount of work between updates of the atomic relatively small", that means that there are relatively many updates, and they imply contention for access to the atomic. Others should be able to provide benchmark data, but recursive parallelism is a central idea in TBB for a reason.

If this turns out to be problematic, I would perhaps try something with parallel_reduce (distributing cash in the splitting constructor), and if the cash runs out for a particular subrange push that subrange onto a concurrent_vector for a subsequent round, cook until done. Just a random idea, though, not something I've thought through.

(I removed a part that had to do with the required vs. available memory semantics for read-modify-write of the atomic, because I don't know the business requirements anyway.)

This is a sketch of the pattern I am using in my project.

atomic totalCount; // total amount of work
int grainSize; // how much work to do before requesting more
int threads; // how many threads to use

// each threads gets a worker. Worker is expensive to construct and copy.
vector > workers(threads);

concurrent_vector concurrent_storage; // store results here

class Body {
public:
    void operator()(const blocked_range& r) {
        while (true) {
            int count = totalCount.fetch_and_add(-grainSize);
            if (count <= 0)
                return; // nothing to do, bail out

            counter = grainSize;
            Worker& worker = workers[r.begin()];  // get a worker
            Data data;
            for (; counter!=0; --counter) {
                // use worker to generate data
            }

            concurrent_storage.push_back(data);
        }
    }

private:
    int counter;
};

// run parallel tasks
parallel_for(blocked_range(0, threads, 1), Body(), simple_partitioner());

What minimum amount of work can be inside the for loop without hurting scalability? With lightweight Worker parallel_for is OK to use, as long as grainSize is large enough. Unused task would simply return without doing anything.

Hmm, using task-oriented constructs just to manage threads, that doesn't bode well. Apparently you allow more work to be done than totalCount indicates, unless totalCount is a multiple of grainSize. Why not instead run a parallel_for over totalCount, with the expensive Worker objects referenced from thread-local storage?

Hello,it might be an interesting comparison to run, actually. It seems like you're trying to substitute TBB overhead (task stealing; taking a task out of a local queue; managing parent tasks and continuations) with a centralized atomic-based mechanism. The problem I see with your approach is that the one atomic integer you're using will become highly contended (on larger machines especially) and it'll affect the overall application performance. If grainSize represents some significant amount of work, then the contention will be less noticeable, but there'll be a chance of limiting parallel slackness and running into load imbalance problem hurting the overall performance again. Anyway, I reworked your code to follow TBB paradigm and if you can compare the performance of these two and post it here, I'd be interested to learn the results.

const int totalCount = 1000; // total amount of work
const int grainSize = 10;  // how much work to do before requesting more
//int threads;    // use automatic number of threads

// each threads gets a worker. Worker is expensive to construct and copy.
// enumerable_thread_specific uses cache_aligned_allocator by default
enumerable_thread_specific workers;

concurrent_vector concurrent_storage; // store results here

class Body {
public:
    void operator()(const blocked_range& r) const {
        int count = 0;

        Worker& worker = workers.local();  // get a worker
        Data data;

        for(int it = r.begin(); it < r.end(); ++it) {
            // use worker to generate data
            buffer[count++] = data;
        }

        std::copy(buffer, buffer + count, concurrent_storage.grow_by(count));
    }
private:
    mutable Data buffer[grainSize];
};  

// run parallel tasks
parallel_for(blocked_range(0, totalCount, grainSize), Body(), simple_partitioner());

Please note that I changed the way you accessed the concurrent_storage. Pushing back each element from every thread is quite expensive and should be bufferized instead.

I would make buffer local to operator(), do away with simple_partitioner (that was for the threads-related code), and process buffer-sized batches from blocked_range r until done (fewer tasks that way). I'm not sure how the size of buffer would relate to the grainsize in the parallel_for over totalCount, though.

(Added) Actually, why not keep the results local as well until the end...It's not going to do much for performance to avoid sharing the atomic only to then share access to the result. But I think the main thing would be better structure and maintainability: the idea about performance just arose because of how information became available to us, and I don't really want to promise anything because of this change. But you would get a better chance this way.

We may think that totalCount = n*grainSize. I though about TSL, but the pattern I use looks simple (at least to me). As a general rule, I don't place anything into TSL, unless there is no other way around. Btw, what are the performance implications of placing workers into TSL? Worker object does very heavy duty calculations and uses plenty of memory.

Anton,
I will try your pattern and see what happens.

I tested it on Core2 duo by gradually decreasing grainSize. As the grainSize became smaller than 100 (each granule is something like 5-10 floating point operations) , the TSL version lost more performance than the pattern with atomic variable. With large grain size the TSL version was slightly slower.

And now with my additional suggestions in #15? :-)

It means giving each thread its own entry in a concurrent_vector of vector for the results, index registered in TLS. The auto_partitioner instead of the simple_partitioner will avoid too many accesses to TLS, and then it's all local (as far as we've been told) until the post-join merge of the results. No tuning required, i.e., no grainsize involved.

Hi, so if I understand you correctly, for small grainsizes TLS version is better, but for larger grainsizes it's a little slower? Could you specify the real 'totalCount' you're using and the amount of work it represents (assuming the app is executed without TBB in a single-thread mode) in seconds? And the grainSizes you're using for the experiments. I'm just trying to figure out the task sizes and the amount of those. And another thing - how many cores does the machine have, that you're using for the experiments?

The actual grain size doesn't matter. What matters is the amount of work each task performs, or the amount of work between successive writes to the atomic variable. I will post compilable fragment sometime later.
On 2 core machine TSL versions slowed down more as grain size decreased. TSL version replaces each write to atomic with a task creation. How does task creation compares to atomic writes (assuming tasks are lightweight)?

@Raf
I do roughly the same thing as you suggested in #15. Each threads buffers all data into a vector and puts the vector into concurrent_vector of vector using swap method. It avoids moving data around and minimizes access to the concurent_vector.

"I do roughly the same thing as you suggested in #15."
But only part of it... :-)

"Each threads buffers all data into a vector and puts the vector into concurrent_vector of vector using swap method."
You don't need to swap at all, because concurrent_vector entries stay put right where they are by design.

"It avoids moving data around and minimizes access to the concurent_vector."
A reference or pointer stays valid as well, if you want to avoid evaluating a concurrent_vector index (how expensive is that, anyway?).

This thread got me to rethink/redesign some parts of my code. So here is another basic question. Does it make any sense to cache align a dynamically allocated instance of tbb::atomic? In general, atomic shouldn't be heavily contested. But what if the writes are rare but reads are frequent? Can it hurt scalability?

A location in memory can easily be shared among several caches, each with an identical copy. Only when somebody wants to write is it necessary to acquire unique ownership by telling the other caches to invalidate their copy (details vary), which takes a bit of time.

Alignment can be useful to avoid false sharing between unrelated items if at least one of them is written often. If it's mostly reads, or if anybody writing something is going to write the other items also, there's no need to keep them apart.

Well, that's my understanding, feel free to point out any mistakes or omissions.

That is my understanding too.
How does tbb::cache_aligned_allocator manage its memory pool? Does it keep delocated memory for quick recycling? I noticed that after cache aligned containers go out of scope, the process' memory usage doesn't drop. The same thing doesn't happen with STL allocator.

Last time I looked the scalable memory allocator kept all memory it ever used at the high-water mark, and I think aligned memory sits on top of that. If you reassign new and delete to use the scalable memory allocator for better performance, you'll see the same thing occur with STL.

(That wasn't very clear without the addition.)

OK, I redesigned my code. Now the pattern looks something like this.

// Bundle together Worker and expensive mutable variables
struct mutable_state;

// Body is cheap to copy
struct Body {
    shared_ptr > tlsState;

    void operator()(const blocked_range&) const {
        mutable_state& state = tlsState->local();
        // use TLS here...
    }
};

// auto_partitioner is used by default
parallel_for(blocked_range(0, TotalWork/*can be a big number*/), Body());

Default copy constructor is OK and parallel_for can create large number of copies for better work balancing.

The shared_ptr looks weird (why not just a reference or pointer, and how could tlsState.local() compile?).

tbb::flattened2d seems nice to process the results.

Do tell us what you find out about performance.

You are right. It won' compile. The pointer must be dereferenced. It is fixid now.
auto_partitioner was slightly slower.
shared_ptr is needed because there could be multiple unrelated instances of Body. Without shared_ptr shared state variables must be managed manually.

"auto_partitioner was slightly slower."
That's difficult to believe.

"shared_ptr is needed because there could be multiple unrelated instances of Body. Without shared_ptr shared state variables must be managed manually."
The instance must be the same across Body instances, sure, but that doesn't mean it has to be dynamically allocated instead of on the stack.

auto_partitioner was indeed slightly slower. Compiler optimization could be the reason. With auto_partitioner loop sizes are not known at compile time. I see no other reason.

The difference is about 5% or less in simplified tests. With any realistic load it should all go away.

P.S. I did test the scalability on hyperthreaded 8 core system and the parallel version run 9.4 times faster reagardless of parallel pattern I used. The speedup on dual core PC is exactly 100%.

"auto_partitioner was indeed slightly slower. Compiler optimization could be the reason. With auto_partitioner loop sizes are not known at compile time. I see no other reason."
Hmm, if that were so, you would be able to nest compiler-optimisable loops inside an outer loop, and surely loops don't need to be of known size for the compiler to be able to optimise anything? Maybe some loop limit should be hoisted out of the loop into a constant, maybe something is still being shared... But it's difficult to say much without details.

"What is the best way to create TLS objects distinct for each thread?"
If there are problems with enumerable_thread_specific, I'll have to defer to others.

"I want to do the following. Run parallel_for with a dummy body to trigger lazy creation and then iterate through the TLS container. Is there any other way to trigger lazy creation?"
Isn't that what enumerable_thread_specific::local() does? The reference for flattened2d has an example, I think.

Hi, I'm not sure what you mean by:"What is the best way to create TLS objects distinct for each thread?"If this refers to enumerable_thread_specific, then an object of this class should really be viewed as a container of elements, where each element is tied to one particular threads and there are as many elements as there are threads accessing enumerable_thread_specific object.

Yes, I am using enumerable_thread_specific. I can't iterate through the container before all threads call local(). What I want is the same as using distinct exemplar for each thread. One way to do it is to trigger lazy evaluation of all thread local copies and then overwrite all default constructed elements by iterating through the container.

There isa constructor for enumerable_thread-specific that takes a functor finit, which is used to generate the local copies. That could be used to create a distinct exemplar for each thread, even thougn initialization is lazy.

I see. Does finit have to be re-entrant? Do I need to synchronize the internals of finit or the callbacks from enumerable_thread_specific are synchronized?

Another relevant question. Does TLS work with OpenMP and Boost threads?

Thank you,
renorm.

Hello, yes, it should be safe to evaluatefinitconcurrently, since it can be executed by several threads during their corresponding TLS elements initialization.enumerable_thread_specific will work for all threads created in the application, doesn't matter if those are OpenMP ones or if you created native threads using OS calls. Once you access this threads element of enumerable_thread_specific by calling local() member function, the element for this thread will be created.

Hi Anton,
Can finit have mutable static state? finit is passed by value and all copies will generate the same result unless they use some shared state. To be more spesific, is this Finit OK?

struct Finit {
    static int count; // initially = 0
    int operator ()() {
        return count++;
    }
};

Hi, well, this is exactly what we call a non-thread-safe function :). Although the copies of finitused by different threads to evaluate an element in their TLS will be different, all of them will reference the same value in the static memory - count. If you want this to be thread-safe, make it a:

static tbb::atomic count;

Leave a Comment

Please sign in to add a comment. Not a member? Join today