Modern programming practices and computer languages (like .NET) tend to dynamically create and destroy objects at run-time, but how does it correlate with multi-core-enabled programming? A parallel program may need to synchronize both lifetime of- and access to- an object in shared memory.
Known methods suffer from either limitation of scalability or additional synchronization overhead. For example, the most popular implementation for erasing an item from a concurrent container is to exclusively lock the item before or after cutting it out from a list. So, no one can see (find) and access it anymore thus it’s safe to destroy it completely. It involves fairly small overhead if there is no contention but otherwise it leads to blocking of a thread destroying the object and so imposes scalability degradation. And in some situations, it can cause a deadlock.
To unlock scalability and to enable multiple references, a reference counter is usually used in combination with mutual exclusion. Each owner thread increments the counter to acquire access. To release the rights or to remove the object, the counter is decremented. The thread which reaches zero destroys the object. This method involves overhead of two (!) additional atomic operations (per turn) modifying the counter.
The new algorithm was developed (but not yet released) to overcome described issues while combining theirs strengths in concurrent_hash_map container distributed within Intel® Threading Building Blocks. But it is general enough to be used widely in similar situations and with any other container or collection of objects that accessed and destroyed concurrently.
Usually, a mutual exclusion and a reference counter are represented by separate synchronization objects because their functions are considered as orthogonal. It means separate synchronization. But used in combination, all status fields can be read at once without explicit synchronization and some joint operations can be optimized to only one atomic RMW operation which performs synchronization only once reducing the overhead.
The algorithm enables this optimization by combining spin lock and reference counter inside one machine word of one synchronization object. It which provides the same functionality as its parts while also enabling extra functionality such as full and recurrent lifetime cycle of a dynamic object with stages of construction and destruction, which are protected by mutual exclusion. Of course, recycling is useful only for user objects that are not deallocated after destruction.
Separate operations that don’t benefit from the combination can still be executed separately by addressing only necessary parts of the synchronization object and thus reducing the synchronization effort.
Another optimization which becomes possible due to one atomic read of all fields is considering the acquired mutex to be a reference to the object without an explicit increment of the reference counter. In other words, the lock operation does not touch the reference counter at all, while still keeping the object marked as “in use”. There are also other tricks improving scalability and fail-safety guarantees that may be interesting to experts.
However, users interested in improving scalability and performance of theirs synchronization may need some new high-level usage model to apply this synchronization algorithm. A prototype of such an interface may be like std::shared_ptr but with some additional semantics to manage also access rights to an object.
If you are interested in this idea, I will probably provide more details in the next blog. And perhaps then, your feedback will help us to finally release the code (which already 2 years waits its turn) as part of TBB.
Update: Thanks for comments! Let me clarify the purpose of this technique. It is a small-sized synchronization primitive that is suitable to individually protect each object from a swarm of similar objects. And so, it is *not* supposed to handle heavy contended case with a lot of threads trying to lock/attach to the same object nor to support thousands of references (yes, 16 bit ref counter is enough, otherwise it is not the primitive you need). Here, under scalability I meant that it allows to unlock a list (bucket) which contains up to several objects while desired object is locked by a busy thread. Thus, many threads accessing many different objects which can share the same list finally meet less contention - it is what reference counter is about. It is probably a subject for separate blog where I can show the use case in pseudo-code.