Scalable hash map algorithm

83 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
milom:
I agree that initial support for hardware-based transactional memory will be best for short transactions, but nobody is talking about making them read-only.

I am talking about making them read-only. If nobody else talking about that... well... what can I do? I can only say that they will as non-scalable as current mutex-based or lock-free algorithms... well maybe just a bit simpler to implement.

milom:
For example, Sun's Rock supports a small number of writes per transaction using the processor's store buffer. It allows much larger read sets by using "transactionally read" bits in the L1 data cache tags. From what I've heard, AMD is planning to support a bounded TM that guarantees it will support operations on a small fixed number, say 16, memory operations (read or writes).

Definitely TM must support writes! I am not saying that TM must not support writes, I am saying that TM will be fast/scalable only for short read-only transactions.

milom:

Transactional memory isn't a silver bullet for all of multicore programming. However, it is going to be immensely helpful in writing the sort of lock-free and low-lock data structures being discussed in this thread. TM has similar benefits (such as avoiding the lock acquire and providing fine-grained synchronization) without fewer tricky issues regarding the memory consistency, compiler transformations, or subtle races.

Yet, if I was designing a chip today, I would add support for "speculative locking" not transactional memory. The mechanisms look a lot like transactional memory, but it targets the speculative execution of critical sections. For example, a single-lock queue combined with hardware speculative locking would give the concurrency of a two-lock queue. The key advantages to such speculative locking is that (1) the system call always just fall back on acquiring the lock and (2) it can be applied to existing programs to make them faster or scale better.

Can you elaborate more on "speculative locking"? What API and semantics it will have?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net
randomizer:Can you elaborate more on "speculative locking"? What API and semantics it will have?

The semantics are basically the same as a lock. Yet, the implementation speculates past the lock, executes the critical speculatively, and then commits if no conflicting accesses from other processors occurred. To do this, it obtains read permission to the lock variable to act as a "guard", but in the common case the lock variable isn't written, so no coherence traffic is generated. Otherwise, it has much in common with the subsequent transactional memory proposals.

A paper that describes the idea:

http://pages.cs.wisc.edu/~rajwar/papers/tlr_ieeemicro03.pdf

Dmitriy wrote: "Ok. What problems do you see here? Can you provide example where this will result in counter-intuitive behavior for end user?" I don't have a specific example, really, you'll have to ask my friend Murphy to come up with one (I'm sure he will, but probably not right now either).

"See benchmark results in first post. I call 100x performance degradation 'performance destruction'." Of course... Still, if manipulating the data structure is incidental (and Amdahl rules out significant benefits/penalties) but correctness is crucial, it may not be the worst thing to locally destroy performance and globally preserve correctness. Dare I say the words "premature optimisation"? Of course, where needed, a faster version had better be available, too.

"Correctness" is about being confident that you can use this stuff and not lose a million dollars here and there if your programmers didn't all "have a PhD". Or about making all software multicore and only have to think really hard where it really matters for performance.

But I don't know how it will turn out in the real world, I was just pointing out that different people have different priorities, and not everybody is equally intimately familiar with the subject matter. Can we leave it at that? I would love to be mistaken, so that puts me in a lose-lose position here...

Raf_Schietekat:
Dmitriy wrote: "Ok. What problems do you see here? Can you provide example where this will result in counter-intuitive behavior for end user?" I don't have a specific example, really, you'll have to ask my friend Murphy to come up with one (I'm sure he will, but probably not right now either).

"See benchmark results in first post. I call 100x performance degradation 'performance destruction'." Of course... Still, if manipulating the data structure is incidental (and Amdahl rules out significant benefits/penalties) but correctness is crucial, it may not be the worst thing to locally destroy performance and globally preserve correctness. Dare I say the words "premature optimisation"? Of course, where needed, a faster version had better be available, too.

Well, for the time being I don't see how my map is not correct. For now I am considering it as just faster.
Yes, in perfect world threads access shared containers really rarely. But performance of shared containers can still affect real-world programs. Here I post some my thought on this:
http://softwarecommunity.intel.com/isn/Community/en-US/forums/permalink/...
And note that there are not only single shared hash map in the program, there are also memory allocation, message queueing, object lifetime management (reference counting), logging, global application settings, database caches, statistics and so on and on. And all those things are shared, so actually frequency of accesses to shared objects can be much higher than one think.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net
Raf_Schietekat:
"Correctness" is about being confident that you can use this stuff and not lose a million dollars here and there if your programmers didn't all "have a PhD". Or about making all software multicore and only have to think really hard where it really matters for performance.

But I don't know how it will turn out in the real world, I was just pointing out that different people have different priorities, and not everybody is equally intimately familiar with the subject matter. Can we leave it at that? I would love to be mistaken, so that puts me in a lose-lose position here...

It's not fair. You are implicitly implying that my hash map is not correct, and you take this as axiom. First show some evidence.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Sorry, I never meant to imply that the map itself was not correct, only that it might be more difficult not to make mistakes dealing with some less intuitive properties. That's why it may make sense to have a choice between a forgiving implementation that happens to be rather slow, and also one that requires more attention but with big rewards where it matters. The first hurdle is getting multithreaded applications to run correctly, the second hurdle is to make them as fast as they can be but without breaking them. Correct me if I'm mistaken, but I had the impression that the first hurdle was still a significant one, even if it is nice to believe that all "future" programmers will just absorb all this as a natural part of their activity and merely wonder what the fuss was all about.

Raf_Schietekat:Sorry, I never meant to imply that the map itself was not correct, only that it might be more difficult not to make mistakes dealing with some less intuitive properties. That's why it may make sense to have a choice between a forgiving implementation that happens to be rather slow, and also one that requires more attention but with big rewards where it matters. The first hurdle is getting multithreaded applications to run correctly, the second hurdle is to make them as fast as they can be but without breaking them. Correct me if I'm mistaken, but I had the impression that the first hurdle was still a significant one, even if it is nice to believe that all "future" programmers will just absorb all this as a natural part of their activity and merely wonder what the fuss was all about.

I can't conclude whether it's significant one or not. I'm a bit biased here. This is why I'm asking for examples.
I can only say that currently I don't see where deferred deletion can cause problems.
My hash map provides 2 basic properties: (1) sequential self-consistency, (2) cause-and-effect relation. My personal opinion is that those properties are enough to write correct programs.

Consider example:

void some_thread()
{
 const_accessor a;
 if (hash_map.find(key, a))
 {
 // work with item here
 // really doesn't concerned when exactly item will be removed from map and when deleted
 }
}

What problems can be here, if item will be removed from collection, while this thread still precesses item?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

I do not see a problem with Dmitriy's semantics for removing a key. Note that in Java, clients of ConcurrentHashMapcan operate freely on a value even after it is removed from the map, so the practice has a major precedent.

MADadrobiso:

I do not see a problem with Dmitriy's semantics for removing a key. Note that in Java, clients of ConcurrentHashMapcan operate freely on a value even after it is removed from the map, so the practice has a major precedent.

Great point!
I believe that .NET's hash map provides the same semantics. And I think that most Java/.NET programmers don't even mediate on this moment, i.e. they can be not aware of semantics, they just write their programs, and programs just work out-of-the-box.

Also note that my hash map can be easily extended to support also current TBB's hash map semantics. Reader just have to lock bucket. It will significantly degrade performance, but if user need exactly such semantics, then he can get them. And if user is satisfied with just access to the item, then he can use read access how it's currently implemented in my hash map.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Two against one, that's so like a TBB scheduler (unfair), so I'll yield (it's not a perfect comparison...). Well, not without dismissing Java as a precedent: it does not have synchronous destructor handling, so emulating C++ accessors would involve requiring close() to be called a lot, and it also doesn't store values, only pointers to garbage-collected objects.

(Added clarification and smiley.)

Raf_Schietekat:
it does not have synchronous destructor handling, so emulating C++ accessors would involve requiring close() to be called a lot

Java/C# don't need accessors on read side at all. GC will do all the work.

Raf_Schietekat:
, and it also doesn't store values, only pointers to garbage-collected objects.

It's easy to specialize C++ template class only for pointers. So for pointers there can be one implementation, and for "fat embedded" objects - another.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

"Java/C# don't need accessors on read side at all. GC will do all the work." Exactly, but that's also why nobody would expect any synchronous locking semantics. If all the object provides is, e.g., a snapshot, then that's what it should be called, so that the user knows what he's dealing with; changing the semantics of an "accessor" is just not cool. It's like training somebody on Microsoft Windows, and then letting him use Unix without first telling him all about unlink(): this will almost never cause a problem... until it does.

Raf_Schietekat:
"Java/C# don't need accessors on read side at all. GC will do all the work." Exactly, but that's also why nobody would expect any synchronous locking semantics.

I don't see connection between those things.

Raf_Schietekat:
If all the object provides is, e.g., a snapshot, then that's what it should be called, so that the user knows what he's dealing with; changing the semantics of an "accessor" is just not cool.

This is the question I was asking before in this topic: What are *official* semantics of TBB's accessors now? Do they have to provide more than just access to the object? Well, at least their name suggests that they must provide access to object, just like smart pointer or Java reference.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

"I don't see connection between those things."

An "accessor" has a synchronous destructor, which is often used in C++. In Java losing reference reachability is a non-event, so any locking scopes have to be explicit, they cannot be assumed.

Raf_Schietekat:"I don't see connection between those things."

An "accessor" has a synchronous destructor, which is often used in C++. In Java losing reference reachability is a non-event, so any locking scopes have to be explicit, they cannot be assumed.

Ok. I see your point.
But if there is some object on the stack it does not presently follow item in hash map is locked. It only means that there are possibly some 'scoped semantics' for something, and maybe not.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

I've attached updated implementation of the hash map. A bunch of errors is fixed, not it at least looks like working implementation (heavy test works for several minutes w/o crashes, hangs, asserts, leaks, etc on dual core and quad core).
Also I've added support for satellite data. The archive contains example with strings used as keys and values. The lifetime of satellite data is managed with PDR (Partial-cope-on-write Deferred Reclamation) system, so than find operation remains purely read-only (no shared memory is modified).
I will resubmit it as contribution.

Attachments: 

AttachmentSize
Downloadapplication/zip hash_map.zip11.16 KB
All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Anton Malakhov (Intel)

The proposal is to separate out the accessors from the container and make them flexible enough to enable several ways of data accessing and lifetime management:

  • "classical" RW protected access
  • direct access for atomic or const values.




Direct access for atomics and conts would be great. I spent the whole day yesterday trying to figure out how to do this with the current tools to no avail

As for garbage collection, it seems to much features to me, since C++ isn't a GC language anyways. If GC was handled through policies, so I'm not forced to pay for it unless I need it, I guess it'd be nice for newcomers.

Quoting - robert.jay.gould


Direct access for atomics and conts would be great. I spent the whole day yesterday trying to figure out how to do this with the current tools to no avail

Thanks for expressing the interest. We are working on the new interface specifications of concurrent_unordered_map. But it turned out to be more difficult and long work than I expected before. So, it won't come out soon unfortunately.

Quoting - robert.jay.gould


As for garbage collection, it seems to much features to me, since C++ isn't a GC language anyways. If GC was handled through policies, so I'm not forced to pay for it unless I need it, I guess it'd be nice for newcomers.

Sure, the principle "pay as you go" should be kept.

Quoting - Anton Malakhov (Intel)
Quoting - robert.jay.gould


As for garbage collection, it seems to much features to me, since C++ isn't a GC language anyways. If GC was handled through policies, so I'm not forced to pay for it unless I need it, I guess it'd be nice for newcomers.

Sure, the principle "pay as you go" should be kept.

Both variants require payments. Deferred garbage collection increases memory footprint and introduces latency. Prompt reclamation inevitably degrades performance and scalability.
So one will pay whether he goes or marks time :)

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Dmitriy Vyukov

Both variants require payments. Deferred garbage collection increases memory footprint and introduces latency. Prompt reclamation inevitably degrades performance and scalability.
So one will pay whether he goes or marks time :)

True but I'd rather pay cash than credit.

Quoting - robert.jay.gould
True but I'd rather pay cash than credit.

Btw, there are deferred reclamation schemes (I think it's a better name than GC in this context) which do provide guarantees regarding memory consumption (bounded memory consumption). Safe Memory Reclamation (SMR) is the best example, it guarantees that there will be no more than N*NUMBER_OF_THREADS (N is very small, probably 1) deferred objects. Proxy-collector based on deferrential reference counting (like the one I used in the hash map) can be modified to allow no more than N deferred objects. Such solution will provide very good compromise between scalability and memory pressure.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Dmitriy Vyukov

Btw, there are deferred reclamation schemes (I think it's a better name than GC in this context) which do provide guarantees regarding memory consumption (bounded memory consumption). Safe Memory Reclamation (SMR) is the best example, it guarantees that there will be no more than N*NUMBER_OF_THREADS (N is very small, probably 1) deferred objects. Proxy-collector based on deferrential reference counting (like the one I used in the hash map) can be modified to allow no more than N deferred objects. Such solution will provide very good compromise between scalability and memory pressure.

I guess what I mean is I like concurrent_vector's compact() idea. I can use the vector in a tight-loop, let it grow as needed and then compact it once I'm finished, so I have control of the exact stage reclamation can occur.

If a container would do some automatic reclamation I'd hope it had a function that allowed me to turn reclamation on/off. And say in an automatic reclamation case, I'd be happy even if its "on" by default, doing its job as it esteems appropriate, as long as it give me the option to take control when I need to.

Quoting - Dmitriy Vyukov
I would like to hear opinion of Arch Robison on this.
In TBB roadmap:
http://softwareblogs.intel.com/2007/10/12/threading-building-blocks-prod...
there is a point:
"Container class additions and improvements
The TBB development team is working on new container classes, and interface improvements and enhanced performance for existing classes."

There are synchronization algorithms which are up to 100x faster on modern "moderate-multi-core" than traditional lock-based ones, and they have near linear scaling, so that they are more appropriate for "many-core future". In my opinion it's the way to bridle many-core. What do you think?

how do you measure so precisecpu cycle?on my linux/XEON the resolution is 4000....

Quoting - softarts

how do you measure so precisecpu cycle?on my linux/XEON the resolution is 4000....

Resolution of what?

I used RDTSC. Measure time of 10^6 operations, then divide by 10^6.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Dmitriy Vyukov

Resolution of what?

I used RDTSC. Measure time of 10^6 operations, then divide by 10^6.

could you give some advice about "Memory Pool" implementation?I saw some implementationis based on freelist & lock protection.

but the lock,such as spin_mutex and the other "CAS" implementations are not regarded as scable on multi-core platform(right?), is it necessary to design a new algorithm as your concurrent_hasmap?
mitigate the lock used,load-balancethe freelist write request to multi-sub-hashmap? that means we can not only rely the lock itself

I am strudying your code & tbb concurrent datastructure and try to find some help.

Quoting - softarts

could you give some advice about "Memory Pool" implementation?I saw some implementationis based on freelist & lock protection.

but the lock,such as spin_mutex and the other "CAS" implementations are not regarded as scable on multi-core platform(right?), is it necessary to design a new algorithm as your concurrent_hasmap?
mitigate the lock used,load-balancethe freelist write request to multi-sub-hashmap? that means we can not only rely the lock itself

I am strudying your code & tbb concurrent datastructure and try to find some help.

I tend to prefer headerless slab allocators with per-thread caches along the lines of the one you can find in TBB (scalable_alloc). The basic algorithm is as follows. Each thread has a bunch of caches for various block sizes (usually powers of 2). Allocations requests are done from that caches and very cheap:

void* alloc(size_t sz)
{
    cache* c = thread_caches[get_nearest_cache_idx(sz)];
    return c->pop();
}

During deallocation thread checks as to whether block came from own cache or remote and act accordingly:

void free(void* p)
{
    cache* c = get_cache_from_block(p);
    if (is_own(c))
    {
        c->push(p);
    }
    else
    {
        c->remote_push(p);
    }
}

Here is sketch of cache implementation:

struct cache
{
    void* local_head;
    void* remote_head;

    void* pop()
    {
retry:
        if (local_head)
        {
            void* p = local_head;
            local_head = *(void**)p;
            return p;
        }
        else if (remote_head)
        {
            local_head = ATOMIC_XCHG(remote_head, 0);
            goto retry;
        }
        else
        {
            local_head = alloc_more_blocks_from_system();
            goto retry;
        }
    }

    void push(void* p)
    {
        *(void**)p = local_head;
        local_head = p;
    }

    void remote_push(void* p)
    {
        void* cmp = remote_head;
        do
        {
            *(void**)p = cmp;
        }
        while (ATOMIC_CAS(&remote_head, &cmp, p));
    }
};

As you may see the algorithm is naturally distributed, and prefers locality (if a thread allocs and frees a block then the operations are atomic-free and do not touch any remote memory).

There are a lot of improvements on the basic algorithm: batching of remote frees; privatization of freed blocks ->this makes both alloc and free completely atomic-free; double-buffering of remote blocks, etc

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Hi,

Does anyone know the performance status of Intel TBB 2.2 Update 3's concurrent_hash_map? I've noticed that in the What's New for Intel TBB 2.2 (released after Dmitriy's original post) it mentions "Improved performance of the concurrent hash map".

Thanks!

It is implemented using different algorithm than TBB <=2.1 which eliminates one locking operation on the hot path. But it is not the algorithm Dmitry suggested. Dmitry's algorithm doesn't fit into interface of concurrent_hash_map.

Awesome!
Is this update Production Ready?
Does it resolve the deadlock issues raised by Arch Robinson?
Does it compile and run as 64 bit (MSVC/x64)?

BTW, The zip files you have posted in google groups for relacy and concurrent data structures are broken. I believe google has disabled zip file downloads. Can you make these files available elsewhere? specifically the concurrent data structures.

>Is this update Production Ready?

No

>Does it resolve the deadlock issues raised by Arch Robinson?

This discussion is almost 2 years old and contains 8 pages, what exactly issue do you mean?

> Does it compile and run as 64 bit (MSVC/x64)?

Don't remember. Most likely No.

> BTW, The zip files you have posted in google groups for relacy and
concurrent data structures are broken. I believe google has disabled zip
file downloads. Can you make these files available elsewhere?
specifically the concurrent data structures.

Yeah, I know. Do you know some appropriate and reliable place for them?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Thanks.Here is the deadlock concern I was talking about.

I have some concerns about deadlock, because methods like find_and_lock return while holding a lock on a granularity larger than an individual record. For example, if a thread tries to acquire locks on two separate keys (say in lexigraphic order), it risks deadlock if the keys are in the same cell. Also, method grow appears to lock all the cells before proceeding. So if a thread has a lock on an item and tries to insert another key that causes grow() to be called, does deadlock ensue?

Ah, that. Nothing has changed since that time.
The deadlock scenario outlined is impossible provided correct usage of the hash map, which is to not lock more than once item. The idea was just to split the update method into two parts to go without lambdas, but logically it's a single operation which must not be "preempted" by other operations from the same thread.
And there are way to deal with potential deadlocks as outlined in post #5.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Pages

Leave a Comment

Please sign in to add a comment. Not a member? Join today