Scalable hash map algorithm

Scalable hash map algorithm

I've attached implementation of my new concurrent hash map algorithm. Implementation is NOT PRODUCTION READY, it's just an illustration of the algorithm. Implementation can be compiled only with MSVC/Win32, but algorithm is fairly portable, only single-word atomic RMW operations are required, and very little of OS specific stuff.

Here are benchmark results on Q6600.
Element count = 256, readonly workload, 8 threads.
TBB concurrent_hash_map:
Single core: 338 cycles/find
Cores 0&1: 218 cycles/find (scaling 1.55)
Cores 0&2: 540 cycles/find (scaling 0.63)
All cores: 353 cycles/find (scaling 0.96)

My hash map:
Single core: 30 cycles/find
Cores 0&1: 15 cycles/find (scaling 2)
Cores 0&2: 15 cycles/find (scaling 2)
All cores: 7 cycles/find (scaling 4)

Element count = 65536, readonly workload, 8 threads.
TBB concurrent_hash_map:
Single core: 383 cycles/find
Cores 0&1: 212 cycles/find (scaling 1.80)
Cores 0&2: 584 cycles/find (scaling 0.66)
All cores: 363 cycles/find (scaling 1.06)

My hash map:
Single core: 42 cycles/find

Cores 0&1: 21 cycles/find (scaling 2)

Cores 0&2: 21 cycles/find (scaling 2)

All cores: 10 cycles/find (scaling 4)

Element count = 256, mixed workload (4 threads make 100% finds; 4 threads make 50% finds, 25% inserts, 25% removes)
TBB concurrent_hash_map:

Single core: 348 cycles/find

Cores 0&1: 212 cycles/find (scaling 1.64)

Cores 0&2: 477 cycles/find (scaling 0.73)

All cores: 304 cycles/find (scaling 1.14)

My hash map:
Single core: 51 cycles/find

Cores 0&1: 27 cycles/find (scaling 1.89)

Cores 0&2: 31 cycles/find (scaling 1.65)

All cores: 15 cycles/find (scaling 3.4)

Element count = 65536, mixed workload (4 threads make 100% finds; 4 threads make 50% finds, 25% inserts, 25% removes)
TBB concurrent_hash_map:

Single core: 377 cycles/find

Cores 0&1: 206 cycles/find (scaling 1.83)

Cores 0&2: 516 cycles/find (scaling 0.73)

All cores: 322 cycles/find (scaling 1.17)

My hash map:

Single core: 74 cycles/find

Cores 0&1: 35 cycles/find (scaling 2.12)

Cores 0&2: 43 cycles/find (scaling 1.72)

All cores: 22 cycles/find (scaling 3.36)

On read-mostly workload this algorithm is up to 50x faster than TBB's hash map. On mixed workload it is up to 30x faster than TBB's hash map. And scaling is substantially better, so on 8-core system speedup can be 100x and higher.

Memory consumption is 25% lower as compared with TBB hash map.

Few notes on algorithm.
First of all it uses amortized proxy-collector memory reclamation (implementation included into archive). It allows read-only transactions to be purely read-only, i.e. no modifications of shared state.
Read-only access (find operation) uses a kind of sequence mutex (so called SeqLock) to ensure consistency. But I made some modifications to SeqLock, so that read access is obstruction-free, as opposed to classical SeqLock which is actually blocking if reader conflicts with writer.
Mutators (insert/remove) use fine-grained locking wrt other mutators, and atomic modifications wrt readers.
Layout of data is cache-conscious so that most oprations rarely touch (either write or read) more that one cache-line.

AttachmentSize
Downloadapplication/zip hash_map.zip11.16 KB
All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net
83 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I would like to hear opinion of Arch Robison on this.
In TBB roadmap:
http://softwareblogs.intel.com/2007/10/12/threading-building-blocks-prod...
there is a point:
"Container class additions and improvements
The TBB development team is working on new container classes, and
interface improvements and enhanced performance for existing classes."

There are synchronization algorithms which are up to 100x faster on modern "moderate-multi-core" than traditional lock-based ones, and they have near linear scaling, so that they are more appropriate for "many-core future". In my opinion it's the way to bridle many-core. What do you think?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

References to the literature on the 100x faster synchronization algorithms would be appreciated.

Your concurrent_hash_map does appear to be aggressive about performance. I like the attention to layout, particularly the packing of a cell and handling of extra items that do not fit in a cell.

I have some concerns about deadlock, because methods like find_and_lock return while holding a lock on a granularity larger than an individual record. For example, if a thread tries to acquire locks on two separate keys (say in lexigraphic order), it risks deadlock if the keys are in the same cell. Also, method grow appears to lock all the cells before proceeding. So if a thread has a lock on an item and tries to insert another key that causes grow() to be called, does deadlock ensue?

Given the speed improvements, maybe an alternative usage model is called for, such as "do not lock more than one cell at once". An alternative interface, driven by C++ 0x lambdas, might be to replace the find_and_lock/update_and_unlock methods with a method "find_and_update(key,lambdaExpression)" where lambdaExpression does the update.

MADadrobiso:

References to the literature on the 100x faster synchronization algorithms would be appreciated.

I was referring to this algorithm running on eight-core system :)

If this algorithm will still have linear scaling, and TBB algorithm will continue to degrade, than difference will be >100x on eight-core system. Unfortunately I don't have access to such machine to test. Also if we are talking about 2-processors each with 4 cores, than communication between different processors packages must be even costlier than communication between cores in the same package. So algorithm which issues high cache-coherence traffic will degrade in spurts there.

And note that such system are currently present in server segment.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net
MADadrobiso:

Your concurrent_hash_map does appear to be aggressive about performance.

I think that it's not bad for low-level basic building block in the context of the library which is all about performance and scalability :)

MADadrobiso:

I like the attention to layout, particularly the packing of a cell and handling of extra items that do not fit in a cell.

Cell is designed to fully occupy 1 cache-line. Because whole block is properly aligned, cell is guaranteed to start at a cache-line boundary. So if thread executes read-only operation, then it have to fetch one cache-line at maximum. If thread executes write operation, then if will get cache-line in Modified status after executing CAS when locking the cell, no more cache-line transfers nor status changes required. Btw, currently I am investigating possibility to lock the cell with XCHG (instead of CMPXCHG), this can improve scalability on write workload to some degree.
Keys and values are separated in cell, in order to not waste many memory for alignment, if, for example, key is 8 bytes (int64), and value is 4 bytes (pointer to some structure).
All structures, which resides in arrays, are padded to occupy 2^N bytes to not waste time for division when accessing element of array.
Because of careful layout and absense of dynamic memory allocation, data structure is very memory efficient. On some workloads I seen that it uses 25% less memory than TBB hash map.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net
MADadrobiso:

I have some concerns about deadlock, because methods like find_and_lock
return while holding a lock on a granularity larger than an individual
record. For example, if a thread tries to acquire locks on two
separate keys (say in lexigraphic order), it risks deadlock if the keys
are in the same cell. Also, method grow appears to lock all the cells
before proceeding. So if a thread has a lock on an item and tries to
insert another key that causes grow() to be called, does deadlock ensue?

Given the speed improvements, maybe an alternative usage model is
called for, such as "do not lock more than one cell at once". An
alternative interface, driven by C++ 0x lambdas, might be to replace
the find_and_lock/update_and_unlock methods with a method
"find_and_update(key,lambdaExpression)" where lambdaExpression does the
update.

You are 100% right here. Currently deadlocks are possible. I just didn't have time to investigate and document/resolve all problems...

As you noticed, 2 ways are possible. Some deadlock possibilities can be
eliminated. And some usage patterns which can lead to deadlock can be
just prohibited (maybe with some asserts in debug version).

For example as for possible deadlock when thread tries to lock 2 items.
Hash map can provide the method which accepts the array of keys:

void find_and_lock_several_items(item_array_t* a, size_t count);

where

struct item_array_t
{

key_t key;

value_t placeholder_for_return_value;

bool is_found_and_locked;

};

Or list of all locked locations can be stored in TLS, or in handle_t object, so find_and_lock (and all other functions) can be able check for recursion.

Lambdas are a good variant too. And this can be emulated with "functors" in current C++.
For example functor can have following signature:
bool update_item(key_in_t k, value_in_t old_value, value_out_t new value); // return value means, whether to update or not.

So basically, one has to decide what *interface* and what *semantics* he wants to have in the first place. And then, I believe, all problems can be solved.

Btw, explicit proxy-collector calls 'th.quescent()' can be merged into TBB's accessors. I.e. in constructor accessor will acquire reference to proxy, and in destructor it will release reference to proxy. This way there will be no user visible calls to proxy-collector API. BUT here is a caveat. User is disallowed to block thread for a long time while holding any references to proxies, AND user have to make explicit call to proxy-collector API before blocking thread for a long time. Otherwise, system-wide memory reclamation will be blocked.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Some people are saying that mutexes are the way to automatically resolve contention and other synchronization techniques (usually they call it 'lock-free') are the way to maximize contention.

Some people are saying that 'wait-free programming' (yeah, sometimes they call it 'wait-free') only incurs additional overheads, none the less.

Some people are saying that I am hereby prohibited to be that much faster.

JUST download the code and RUN some benchmarks!

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

"Some people are saying that I am hereby prohibited to be that much faster."

Not "prohibited", "forbidden". But I hope the irony didn't get lost in translation (it was a joke)?

So how much faster is this than my hash map contribution (the last version)?

Raf_Schietekat:
"Some people are saying that I am hereby prohibited to be that much faster."
Not "prohibited", "forbidden". But I hope the irony didn't get lost in translation (it was a joke)?

Up to this moment I was not sure. So it was a joke. Ok. No problem :)

Raf_Schietekat:
So how much faster is this than my hash map contribution (the last version)?

I was testing against the latest stable release (2.1). If you know how much your version is faster than the last stable release, then we can calculate the unknown quantity.
I briefly skimmed through the code (your contribution), well, it still uses the mutex for read access, so it modifies shared state, so it issues cache-coherence traffic...

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

"If you know how much your version is faster than the last stable release, then we can calculate the unknown quantity." Well, I was hoping to do that calculation the other way around, but it would seem to be irrelevant now, anyway...

randomizer:
MADadrobiso:

References to the literature on the 100x faster synchronization algorithms would be appreciated.

I was referring to this algorithm running on eight-core system :)

Oh! Did you mean references to some academia articles, and not references to hardcore brain-damaging code?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Dmitriy, do you know Hopscotch Hashing algorithm (http://groups.google.com/group/hopscotch-hashing), a high performance cache aware hash map algorithm?
How does yours differ from this one?

MADamalakho:Dmitriy, do you know Hopscotch Hashing algorithm (http://groups.google.com/group/hopscotch-hashing), a high performance cache aware hash map algorithm?
How does yours differ from this one?

No, I was not aware of that algorithm. It looks very interesting. Thank you for the link.

I briefly skimmed through code. Well, it looks similar to my algorithm to some degree. They also use a kind of SeqLock to protect readers.
But they "forget" about memory reclamation. I.e. map can't grow. My map can grow, thanks to proxy-collector.
Further you can't store pointers in their hash map. Assume you value type is my_entity_t*. You insert object into map, then remove it from map. You disallowed to delete it! It must live forever now! Because there can be concurrent readers which still reads the object.

I have to read more. And make some benchmarks against their implementation.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net
MADamalakho:Dmitriy, do you know Hopscotch Hashing algorithm (http://groups.google.com/group/hopscotch-hashing), a high performance cache aware hash map algorithm?
How does yours differ from this one?

Ok, I've read the paper and source code.
As for 'synchronization part' of algorithm. Their implementation is very close to mine. Read operations use timestamping, i.e. they don't modify shared data, by can retried several times. Write operations use fine-grained locking, but their implementation uses substantially coarser-grained locking, because their algorithm requires bigger buckets. My algorithm uses lock per cache-line or even per 1/2 of cache-line, i.e. 3 or 4 elements at maximum.
But they omit memory reclamation. And as I understand it's intentional. They says something like 'we use open-hashing because it doesn't require GC, and GC is additional overheads in multi-threaded environment'. So basically you are disallowed to store pointers in table, you can store only embeded data.
But then they write 'in benchmarks each bucket encompasses pointers to data and value (satellite key and data). This scheme is thus general hash-map'. Well, it seems that they just preallocate all keys and data, and doesn't delete any objects while benchmarking. In my oipinion, this can't be called 'general hash-map'... at least in C/C++... until I am missing something.
As for hashing algorithm. It differs substantially. The only similarity is that both algorithms try to optimize cache-line usage. It's difficult to say which algorithm is better until we've made exhaustive testing in different situations.

I am still going to organize a little shootout between then, at least on synthetic benchmark. It will be interesting. I think that in favorable conditions, my algorithm will be faster. In not so favorable conditions... it's difficult to say ahead of time. However hashing algorithms differs substantially.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net
MADadrobiso:

References to the literature on the 100x faster synchronization algorithms would be appreciated.

Your concurrent_hash_map does appear to be aggressive about performance. I like the attention to layout, particularly the packing of a cell and handling of extra items that do not fit in a cell.

What do you think about incorporating such techniques into TBB?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

randomizer:

Oh! Did you mean references to some academia articles, and not references to hardcore brain-damaging code?

Yes, I meant academic or trade journal articles, not code. Maybe you should write one. You do seem to have the knack for exploiting x86 memory ordering rules for all they can yield.

randomizer:

What do you think about incorporating such techniques into TBB?

I think eventually we'll try to incorporate such elaborate packing. Right now, the key issue with TBB hash tables is figuring out the concurrency semantics for the improved hash table. We're looking into trying to parameterize the the locking mechanism so users can define their own accessor model, and thus not be tied down to the reader-writer model we have in tbb::concurrent_hash_map.

One of the sticking points is whether non-blocking erasure should be allowed; i.e., erase operations that return before the item is actually deleted. There seem to be pros and cons to this, depending on the use case.

It is perplexing that there are so many ways to implement concurrent hash tables, and each has various merits. The differences seem much more severe than for sequential hash tables. I'm starting to wonder if a "one size fits most" concurrent table is unrealistic.

MADadrobiso:

I think eventually we'll try to incorporate such elaborate packing.
Right now, the key issue with TBB hash tables is figuring out the
concurrency semantics for the improved hash table. We're looking into
trying to parameterize the the locking mechanism so users can define
their own accessor model, and thus not be tied down to the
reader-writer model we have in tbb::concurrent_hash_map.

Can you elaborate here a bit more, please? What is 'own accessor model'? How it can look like?

MADadrobiso:

One of the sticking points is whether non-blocking erasure should be
allowed; i.e., erase operations that return before the item is actually
deleted. There seem to be pros and cons to this, depending on the use
case.

My personal opinion is that Yes, non-blocking erasure should be allowed.

But behavior should (1) provide sequential self-consistency, i.e. if
thread removes item, then subsequent operation of *that* thread must not
see item in the table and (2) respect casual relations between threads,
i.e. thread 1 removes item from the table, then sends a message to
thread 2, then thread 2 receives a message, then thread 2 searches for
that item - thread 2 must not see the item in the table.

As for all other situations, I don't see how they can end up being non-intuitive. What you mean by 'cons' here?

My reasoning is that total order is not defined across all operations
of all threads. So term 'before' is not defined across all operations. So reasoning based on wall clock makes no sense in multi-threaded environment. Only casual relations makes sense.

This is how my hash map works. When thread removes item, thread only marks item as removed. And if key or value is pointer, then thread also queues actual key/data for deletion with pcx_thread::defer() function. So all other threads can still access this item as long as they want. Exactly this is underlying reason for high-performance - *asynchronous* processing, i.e. each thread operates on his own speed.

And this is exactly how hardware memory models work. I.e. there is no enforcement based on wall clock, only casual relations are respected.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net
randomizer:
MADadrobiso:
I think eventually we'll try to incorporate such elaborate packing.
Right now, the key issue with TBB hash tables is figuring out the
concurrency semantics for the improved hash table. We're looking into
trying to parameterize the the locking mechanism so users can define
their own accessor model, and thus not be tied down to the
reader-writer model we have in tbb::concurrent_hash_map.

Can you elaborate here a bit more, please? What is 'own accessor model'? How it can look like?

The proposal is to separate out the accessors from the container and make them flexible enough to enable several ways of data accessing and lifetime management:

  • "classical" RW protected access
  • direct access for atomic or const values.
  • deferred item deletion - similar to reference counters.
  • support for garbage-collected items.
  • support to make erase() methods blocking.

My prototype shows that it is feasible.
I gave your implementation only a quick look. And I find also read with timestamps useful.
But we'll definitely consider it more closely in the next few month while implementing new hash containers.

randomizer:

MADadrobiso:
One of the sticking points is whether non-blocking erasure should be
allowed; i.e., erase operations that return before the item is actually
deleted. There seem to be pros and cons to this, depending on the use
case.

My personal opinion is that Yes, non-blocking erasure should be allowed.

But behavior should (1) provide sequential self-consistency, i.e. if
thread removes item, then subsequent operation of *that* thread must not
see item in the table and (2) respect casual relations between threads,
i.e. thread 1 removes item from the table, then sends a message to
thread 2, then thread 2 receives a message, then thread 2 searches for
that item - thread 2 must not see the item in the table.

Thank you for this argumentation. It seems like what we were looking for.
My arguments that blocking semantics is only subset of non-blocking one (i.e. no lock acquired for item being deleted) and that there are applications like web-services which don't require blocking, they have no such convincing effect :)

randomizer: behavior should (1) provide sequential self-consistency, i.e. if thread removes item, then subsequent operation of *that* thread must not see item in the table and (2) respect casual relations between threads, i.e. thread 1 removes item from the table, then sends a message to thread 2, then thread 2 receives a message, then thread 2 searches for that item - thread 2 must not see the item in the table.

As for all other situations, I don't see how they can end up being non-intuitive. What you mean by 'cons' here?

This is a good description of the guarantees the container can provide; thanks a lot.

There is at least one questionable point though (the "other" situation, if you want). What if one and the same thread looks up for a key in a hash table, finds an element and keeps a reference to it (in the terms of tbb::concurrent_hash_map, a const_accessor; well, it's a bit stronger than just a reference - it's more of a reader lock), and then repeats the search for the same key - should it find exactly the same element? And if not (which is perfectly possible with the "non-blocking erasure"), is this behavior sequentially self-consistent?

MADamalakho:
The proposal is to separate out the accessors from the container and make them flexible enough to enable several ways of data accessing and lifetime management:
  • "classical" RW protected access
  • direct access for atomic or const values.
  • deferred item deletion - similar to reference counters.
  • support for garbage-collected items.
  • support to make erase() methods blocking.

My prototype shows that it is feasible.

Ok, I see.
But I'm not sure that it's possible to implement all types of accesses for all data structures w/o imposing huge overheads for all types of accesses. I.e. it will destroy 'pay as you go' principle.
For example for my hash map it's difficult to implement reactive detection of moment when item is actually deleted. In order to implement this feature we have to impose huge overheads for readers, no matter whether user will use blocking erase() or not.
So basically we anyway end up having only some restricted set of accessors for every particular data structure.
Or I am completely missing something?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

randomizer:
So basically we anyway end up having only some restricted set of accessors for every particular data structure.
Or I am completely missing something?

You are not missing anything. The hard decision is to decide how to balance potential generality against imposing costs on everyone. Our notion is that there ought to be a way to templatize the table (or operations on it) with respect to the locking discipline. For example, consider having a "Synchronizer" type asa template parameter to the table, and an "Accessor" type as a template parameter to method "find" or "insert". Thenconsiderwhat are the minimal set of signatures on Synchronizer and Accessor that allow a table to operate sanely. Here's a crude early stab I took at coming up with signatures. A is an Accessor; S is a synchronizer, and T isthe type of a table item.There is assumed to be an instance of synchronizer S wrapped around each table item, and A is a template parameter to the method find or insert. E.g., different calls might have a reader-lock type for A or a writer-lock type for A, or no locking at all.

S(const T& x)

Construct copy of x with access protection. [Should add rvalue-reference form for C++0x.]

~S()

Destructor

T& S::unprotected_ref();

Return reference to underlying type.

bool A::try_acquire(S& s);

Attempt to acquire an access right. Returns true if successful; false otherwise.

void S::wait_while_acquired();

Wait until all access rights have been released.

With this much support, I can choose (by choosing how the operations on A and S are implemented) whether to support only reader locking (reference counting), or full reader-writer locking, or no locking at all (which is useful if concurrent erase is not to be supported). Anton hassome ideas for extending it to handle more cases.

You could see how far you might templatize your hash map and its operations to open up more flexibility (perhaps in a very different
direction from the above).

randomizer:
But I'm not sure that it's possible to implement all types of accesses for all data structures w/o imposing huge overheads for all types of accesses. I.e. it will destroy 'pay as you go' principle.

This principle must be kept. My initial proposal was to introduce fixed set of accessor types that are suitable to be used both within concurrent_hash_map and independently. I'm not sure whether it can handle "all cases in the world" but for listed above there is no significant overhead (while compiler can optimize out empty inline methods and const values :) )

MADadrobiso:

Yes, I meant academic or trade journal articles, not code. Maybe you should write one. You do seem to have the knack for exploiting x86 memory ordering rules for all they can yield.

I am just coding, I am not writing articles. Maybe 20 years later I will be writing articles... :)
Actually there are no such articles. At all. At least I am not aware of any. And I am trying to be aware.

My algorithms are based on work of very few people. What I can provide is some links to interesting comp.programming.threads discussions:
http://groups.google.com/group/comp.programming.threads/browse_frm/threa...
http://groups.google.com/group/comp.programming.threads/browse_frm/threa...
http://groups.google.com/group/comp.programming.threads/browse_frm/threa...

And also there are some advanced lock-free libraries:
http://atomic-ptr-plus.sourceforge.net/
http://webpages.charter.net/appcore/

And also there is my mini-newsgroup, where I post all my algorithms:
http://groups.google.com/group/lock-free

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

I am about to finish the resize, it should be published soon.

When you are using C++ with its current standards, e.g. no GC, there are common known ways to support it without changing the code.

Since the released C++ code uses templates, it is possible to use something like smart pointer.

Good references:

1) The C++ Programming Language by Bjarne Stroustrup, Addison-Wesley

2) Design Patterns by Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides, Addison-Wesley

The idea is that you can implement a pointer with a reference counter, and then use it to the Data or Key template parameter.

Especially read 11.10 Dereferencing from the first reference.

Ill post an example how to create such smart reference pointer in C++.

Please let me know if you have further questions.

Thanks for the input,

Moran Tzafrir.

MADakukanov:

randomizer: behavior should (1) provide sequential self-consistency, i.e. if thread removes item, then subsequent operation of *that* thread must not see item in the table and (2) respect casual relations between threads, i.e. thread 1 removes item from the table, then sends a message to thread 2, then thread 2 receives a message, then thread 2 searches for that item - thread 2 must not see the item in the table.

As for all other situations, I don't see how they can end up being non-intuitive. What you mean by 'cons' here?

This is a good description of the guarantees the container can provide; thanks a lot.

There is at least one questionable point though (the "other" situation, if you want). What if one and the same thread looks up for a key in a hash table, finds an element and keeps a reference to it (in the terms of tbb::concurrent_hash_map, a const_accessor; well, it's a bit stronger than just a reference - it's more of a reader lock), and then repeats the search for the same key - should it find exactly the same element? And if not (which is perfectly possible with the "non-blocking erasure"), is this behavior sequentially self-consistent?

Whether it's sequentially self-consistent behaviour or not depends on definition of accessor. If accessor is defined only as thing which provides access to object, than yes, it's self-consistent behaviour. If accessor is defined as thing which provides access to object and also gives some guarantees about state of hash map (for example that while you are holding accessor to element in hash map, that element can't be removed from hash map), than no, it's not self-consistent behaviour.
And I'm not sure how accessors are currenlty defined in TBB, whether they only provide access to element, or also provide some additional guarantees about state of hash map.

If we must choose only one variant, than personally I would prefer the former, i.e. accessor only provides access to element, just because this gives more freedom to implementor, and this allows substantially higher performance.

But it seems that in the context of proposal for different types of accessors:
http://softwarecommunity.intel.com/isn/Community/en-US/forums/permalink/...
we can not choose any more. Accessor const_accessor will only provide access to element (it can use low overhead obstruction-free timestamp-based validation), accessor const_accessor_and_hash_map_state_locker will provide access to object and also provide some guarantees about state of hash map (it will use read locking of reader-writer mutex).

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net
MADamalakho:
This principle must be kept. My initial proposal was to introduce fixed set of accessor types that are suitable to be used both within concurrent_hash_map and independently. I'm not sure whether it can handle "all cases in the world" but for listed above there is no significant overhead (while compiler can optimize out empty inline methods and const values :) )

For example, to support reference counted object lifetime management ('deferred item deletion - similar to reference counters') additional level of indirection is required. I.e. separate proxy object must be allocated, and this proxy-object will hold user key/data along with reference counter. If reference counting is not used than additional level of indirection and additional memory allocations/deallocations are not required.
And whether reference counting will be used or not is not know ahead of time, because this is determined by type of accessor. How you are going to solve this?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net
MADadrobiso:

E.g., different calls might have a reader-lock type for A or a writer-lock type for A, or no locking at all.

S(const T& x)

Construct copy of x with access protection. [Should add rvalue-reference form for C++0x.]

~S()

Destructor

T& S::unprotected_ref();

Return reference to underlying type.

bool A::try_acquire(S& s);

Attempt to acquire an access right. Returns true if successful; false otherwise.

void S::wait_while_acquired();

Wait until all access rights have been released.

As proof of concept I've tried to apply this interface to my hash map.
In my hash map read access is protected by a kind of SeqLock which allows 'retries'. I.e. find function looks like this:

value_t hash_map::find(key_t k)
{
 for (;;) {
 handle_t h = lock.begin_read_transaction();
 value_t v = find_impl(k);
 if (lock.commit_read_transaction(h))
 return v;
 }
}

And I think the same situation will be if we will be using a kind of STM.
And this pattern is not supported by Synchronizer/Accessor interfaces.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Just make sure that TBB doesn't end up with justonly a lightning-fast hash map that makes me have to turn off the TV to be able to get it to reliably do what I want. (Seriously: TBB should primarily aim to be the best enabling toolkit for Joe Programmer, with best overall a close second.)

Raf_Schietekat:Just make sure that TBB doesn't end up with just a lightning-fast hash map that makes me have to turn off the TV to be able to get it to reliably do what I want. (Seriously: TBB should primarily aim to be the best enabling toolkit for Joe Programmer, with best overall a close second.)

Joe Programmer is better to enjoy with single-threaded programming.
You can say "And what about taking advantage of modern multi-core processors?". Do you realize that current TBB hash map is substantially (more than 10x) slower on quad-core, than basic single-threaded hash map on single-core? And the more cores we will have the more slow TBB hash map will be.
After all who said that raw multi-threaded multi-core programming can be easy? Well, some higher level programming models (like TBB tasks) can provide good ratio between simplicity and scalability. But it's definitely not raw multi-threaded multi-core programming.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

I was not aware of any such performance problems with the hash map.

I'm sure that achieving ultimate performance will never be easy, but it's not just about that (anymore), or so I hear.

Raf_Schietekat:
I was not aware of any such performance problems with the hash map.
I'm sure that achieving ultimate performance will never be easy, but it's not just about that (anymore), or so I hear.

The good sanity check for multi-threaded code is to check it against
best single-threaded code. If it's not faster then, well, what's the
point? Yeah, all 1000 cores are 100% busy, but performance is worse than single-threaded code.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

randomizer:
The good sanity check for multi-threaded code is to check it against best single-threaded code.

That's true when the code is the application. It gets trickier to evaluate when looking at only one piece of an application like a data structure. Consider a table that supports fetch-and-update. Using a single lock might be faster than locking each item separately, but if most of the time is spend in the update calculation (and not the search/lock itself), we are better off with the slower implementation that permits more concurrency, but not too slow.

That said, there is much room for improving the performance of TBB's concurrent_hash_map, though I never expect it to be as fast as a singled-threaded hash map. That is unless someone comes up with zero-overhead transactional memory.

randomizer:
And whether reference counting will be used or not is not know ahead of time, because this is determined by type of accessor. How you are going to solve this?

Accessors are abstraction of data accessing and is part of data storage abstraction of Synchronizer (the name is as in Arch's specification). So, it's known which accessor (-s?) will be used with container.

randomizer:
As proof of concept I've tried to apply this interface to my hash map.
In my hash map read access is protected by a kind of SeqLock which allows 'retries'. I.e. find function looks like this:
...[skipped]...
And I think the same situation will be if we will be using a kind of STM.
And this pattern is not supported by Synchronizer/Accessor interfaces.

I believe that this is rather part of container's algorithm than something related to access abstraction. You could rewrite it using accessor which stores data copy inside. And so, read retries became part of container logic.
But it is good example to consider, thank you.

MADamalakho:
randomizer:
And whether reference counting will be used or not is not know ahead of time, because this is determined by type of accessor. How you are going to solve this?

Accessors are abstraction of data accessing and is part of data storage abstraction of Synchronizer (the name is as in Arch's specification). So, it's known which accessor (-s?) will be used with container.

It's known which accessors *potentially* can be used with container. But it's *not* known which accessors *really* will be used with container. So it's impossible to keep 'pay as you go' principle. Everybody has to pay for possibility of reference counting by additional level of indirection and additional malloc/free calls.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net
randomizer:
You can say "And what about taking advantage of modern multi-core processors?". Do you realize that current TBB hash map is substantially (more than 10x) slower on quad-core, than basic single-threaded hash map on single-core? And the more cores we will have the more slow TBB hash map will be.
After all who said that raw multi-threaded multi-core programming can be easy? Well, some higher level programming models (like TBB tasks) can provide good ratio between simplicity and scalability. But it's definitely not raw multi-threaded multi-core programming.

I'm not sure in the measurements you provide. I compared current version of concurrent_hash_map with locked std::map (since hash map is non-standard), and it looks and scales well for size > few thousand items. And we are able to achieve much better performance and scalability while keeping it simple for users thanks to your sample implementation, Hopscotch Hashing, some other papers, and my proposals.

randomizer:
MADamalakho:
Accessors are abstraction of data accessing and is part of data storage abstraction of Synchronizer (the name is as in Arch's specification). So, it's known which accessor (-s?) will be used with container.

It's known which accessors *potentially* can be used with container. But it's *not* known which accessors *really* will be used with container. So it's impossible to keep 'pay as you go' principle. Everybody has to pay for possibility of reference counting by additional level of indirection and additional malloc/free calls.

Consider this code:
template class Synchronizer {
class Accessor {};
};
template class hash_map;

Accessor of certain type is defined within certain Synchronizer class that is used for instantiation of the container. So, you pay *only when you really need* some complicated stuff like GC.
Default Synchronizer with deferred deletion does not necessary require heavy reference counting.

MADamalakho:
I'm not sure in the measurements you provide. I compared current version of concurrent_hash_map with locked std::map (since hash map is non-standard), and it looks and scales well for size > few thousand items. And we are able to achieve much better performance and scalability while keeping it simple for users thanks to your sample implementation, Hopscotch Hashing, some other papers, and my proposals.

On what hardware you did tests?
Note that I was talking about following situation. Multi-threaded container is tested on multi-core machine with many threads. Single-threaded container is tested with one thread, and it doesn't protected with mutex.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net
randomizer:
Note that I was talking about following situation. Multi-threaded container is tested on multi-core machine with many threads. Single-threaded container is tested with one thread, and it doesn't protected with mutex.

It is purely academic interest. But in real application concurrent containers should be used just as synchronization object - a crossing point of separate threads/tasks which process data independently from each other.

MADamalakho:
And we are able to achieve much better performance and scalability while keeping it simple for users thanks to your sample implementation, Hopscotch Hashing, some other papers, and my proposals.

Few comments on this.
My map (at least in current version) is not simple for users. It has some caveats with memory reclamation, and supports only types for which read of inconsistent state is safe. Basically, it is not C++ type safe.
Hopscotch (in current version) is even worse. In addition, it doesn't support satellite data.
How one can explain those caveats to Joe Programmer? I don't know.
And other papers are targeted at Java, where life is much easier with GC.

But some improvements are definitely possible. To what point? I don't know. I'm constantly trying to merge high performance with user friendliness in the context of C++, but results are always not comforting. My current point of view is that only 2 solutions are possible: (1) low-level primitives usable only by experts and (2) higher-level programming model usable by all. And low-level primitives usable by all are not possible.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Raf, adrobiso, amalakho

I understand that synchronization primitive in common case can't be as fast as single-threaded primitive (I've get a bit excited here). I also perfectly understand your concern with convenience and simplicity for end user. I am not pushing such solutions into TBB, I've just shown what, I think, may be interesting to you. Final choice is up to you.

I agree with Anton that in well designed application "concurrent containers should be used just as
synchronization object - a crossing point of separate threads/tasks
which process data independently from each other". But there are still some moments wrt synchronization primitives performance which affect end user. First, performance of synchronization primitives affects notion of what is "well designed application". For example, if overhead per task execution in TBB is around 500 cycles, then I MUST design my application so that payload of task be not less than 5000 cycles. And if overhead per task is around 50 cycles, then I CAN design my application so that payload of task be between 5000-500 cycles too. In some situations high overhead per task execution can force me to refuse simple and straightforward design of application, and use instead some roundabout design.
Also performance of synchronization primitives can have great impact on 'not so well designed applications'. And if TBB is library for a Joe Programmer too, then it must count with 'not so well designed applications' too.

I see several ways for TBB in this context:
1. Concentrate on simplicity for end user, and get performance which can be obtained with this approach.
2. Concentrate on performance, and get simplicity which can be obtained with this approach.
3. Try to incorporate some of high-performance techniques, not sacrificing simplicity for end user.
4. Sacrifice a bit simplicity for end user for the sake of performance.
5. End up with 2 versions of primitives, first is for simplicity, and second is for performance.
6. Concentrate only on higher-level programming models (tasks for HPC, and agents for general-purpose computing), incorporating advanced synchronization techniques internally on low level.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

At this point I want to apologise to all programmers named Joe for dragging them into this.

I would prefer getting a solid kick without too much worrying involved (no surprises allowed), with additional to ultimate performance available if and when I'm prepared to deal with the associated complexity (it's not just about being an "expert" or not).

Raf_Schietekat:
I would prefer getting a solid kick without too much worrying involved (no surprises allowed), with additional to ultimate performance available if and when I'm prepared to deal with the associated complexity (it's not just about being an "expert" or not).

Btw, what exactly do you mean by 'associated complexity'? What 'associated complexity' do you see as end user, not as developer?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net
randomizer:
Raf_Schietekat:
I would prefer getting a solid kick without too much worrying involved (no surprises allowed), with additional to ultimate performance available if and when I'm prepared to deal with the associated complexity (it's not just about being an "expert" or not).

Btw, what exactly do you mean by 'associated complexity'? What 'associated complexity' do you see as end user, not as developer?

Mutex based solutions have associated complexity for end user. The most known is possibility of deadlock. The not so well known is spurious and unexpected performance destruction when transiting from single-core to multi-core.
So usage of mutex based synchronization instead of... let's call it advanced synchronization can be considered at least as trading bad for worse.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net
randomizer:
I see several ways for TBB in this context:
1. Concentrate on simplicity for end user, and get performance which can be obtained with this approach.
2. Concentrate on performance, and get simplicity which can be obtained with this approach.
3. Try to incorporate some of high-performance techniques, not sacrificing simplicity for end user.
4. Sacrifice a bit simplicity for end user for the sake of performance.
5. End up with 2 versions of primitives, first is for simplicity, and second is for performance.
6. Concentrate only on higher-level programming models (tasks for HPC, and agents for general-purpose computing), incorporating advanced synchronization techniques internally on low level.

We'll work hard to make the fusion of simplicity and performance as deep as possible. So, I like the point 3 and a bit 4.

But I'm on vacation for a couple of weeks to get married :-) So, I'll be able to continue this work and discussion when I'm back.

"Btw, what exactly do you mean by 'associated complexity'? What 'associated complexity' do you see as end user, not as developer?"

I was thinking about the difficulties associated with departing from sequential consistency, and in the present context whether it should be allowed to have N>1 different accessors to an object with key K in one map M all refer to private copies because "delete" doesn't want to wait anymore. Unfortunately, deadlock can be the cruel punishment for using one simple approach, you mentioned "performance destruction" as another adverse outcome (I have to admit that I don't know what this is about). Still, the primary goal seems to be to achieve correctness, which would be easier to deal with if one needs to actively release a security handle before "strange" things can start to occur. It's a design challenge...

MADamalakho:But I'm on vacation for a couple of weeks to get married :-) So, I'll be able to continue this work and discussion when I'm back.

Congratulations. Best of luck.

Raf_Schietekat:
I was thinking about the difficulties associated with departing from sequential consistency, and in the present context whether it should be allowed to have N>1 different accessors to an object with key K in one map M all refer to private copies because "delete" doesn't want to wait anymore.

Ok. What problems do you see here? Can you provide example where this will result in counter-intuitive behavior for end user?

Raf_Schietekat:
you mentioned "performance destruction" as another adverse outcome (I have to admit that I don't know what this is about).

See benchmark results in first post. I call 100x performance degradation 'performance destruction'.

Raf_Schietekat:
Still, the primary goal seems to be to achieve correctness

Nobody saying the opposite. The problem here is - what is 'correctness'?
At least I can say with confidence that semantics provided by my hash map are usable too.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net
MADadrobiso:

That said, there is much room for improving the performance of TBB's concurrent_hash_map, though I never expect it to be as fast as a singled-threaded hash map. That is unless someone comes up with zero-overhead transactional memory.

The cost of find operation in my hash map is about 30 cycles, it's roughly equal to that of sequential hash map. But modifying operations are substantially heavier than in sequential hash map.

As for transactional memory. It's highly probably that hardware transactional memories will provide performance equal to single-threaded on short read-only transactions. And it seems that we will see HTM much sooner than one can expect. Sun is very close to HTM:
http://blogs.sun.com/dave/resource/transact08-dice.pdf
And AMD is working hard on HTM:
http://groups.google.com/group/comp.programming.threads/browse_frm/threa...

But there are some things to mention.
First of all, there is no guarantee that resize operation will succeed till next Ice Age. It can just meet constant contention from short modifying transactions (insert, update, remove).
Second, in order to achieve sequential consistency one have to combine find operation and user processing into one transaction. So read-only transactions won't be short and fast.
Oh, and forget to mention, it's disallowed to issue any 'external' actions inside transactions. I.e. one can't write to files, can't send/recv from socket etc.

I'm not saying that HTM will be useless. It can be very powerful. But in the hands of thoughtless user, it will be roughly the same as mutex-based programming. Just different set of caveats. There are no deadlocks, there are livelocks. One still can saturate inter-connect. One still can totally degrade performance by bloating read- and write-sets of transactions (remember that you have to include user processing info hash map's find operation).
So no silver bullet there!

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net
randomizer:As for transactional memory. It's highly probably that hardware transactional memories will provide performance equal to single-threaded on short read-only transactions. And it seems that we will see HTM much sooner than one can expect. Sun is very close to HTM... And AMD is working hard on HTM... So no silver bullet there!

I agree that initial support for hardware-based transactional memory will be best for short transactions, but nobody is talking about making them read-only. For example, Sun's Rock supports a small number of writes per transaction using the processor's store buffer. It allows much larger read sets by using "transactionally read" bits in the L1 data cache tags. From what I've heard, AMD is planning to support a bounded TM that guarantees it will support operations on a small fixed number, say 16, memory operations (read or writes).

Transactional memory isn't a silver bullet for all of multicore programming. However, it is going to be immensely helpful in writing the sort of lock-free and low-lock data structures being discussed in this thread. TM has similar benefits (such as avoiding the lock acquire and providing fine-grained synchronization) without fewer tricky issues regarding the memory consistency, compiler transformations, or subtle races.

Yet, if I was designing a chip today, I would add support for "speculative locking" not transactional memory. The mechanisms look a lot like transactional memory, but it targets the speculative execution of critical sections. For example, a single-lock queue combined with hardware speculative locking would give the concurrency of a two-lock queue. The key advantages to such speculative locking is that (1) the system call always just fall back on acquiring the lock and (2) it can be applied to existing programs to make them faster or scale better.

Pages

Leave a Comment

Please sign in to add a comment. Not a member? Join today