transforming single-threaded allocators...

transforming single-threaded allocators...

Ritratto di Chris M. Thomasson

My name is Chris M. Thomasson and I have developed a non-blocking algorithm that allows one to transform an inherently single-threaded memory allocator into a scaleable multi-threaded allocator. This invention allows a programmer to just create an allocator algorithm without worrying about any multi-threading issues. Once your done with your single-threaded allocator, well, when you apply my non-blocking algorithm, and PRESTO! Its now a scaleable multi-threaded allocator! Is that cool or what?

Okay, here is an initial proof of concept, I have transformed the Windows Heap object created with the HEAP_NO_SERIALIZE flag. This flag disables the internal lock used by the heap, which makes it a heck of a lot more efficient. However, since the heap is no longer serialized, one cannot use it in a multi-threaded application as a general allocator; until now... The algorithm I created is fairly simple; basically, it goes like this:
___________________________________________________________________________
I create thread local user allocator in every thread.

Alloc request is forwarded to this thread-local user allocator directly.

If free request goes from thread that allocate block, then free request is forwarded to this thread-local user allocator.

If free request goes from another thread, then you accumulate this block in per-thread stack-based freelist.

Blocks from this freelist are actually freed in batches when thread allocates/deallocates another block.
___________________________________________________________________________

I mentioned this algorithm a while back:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/eb87f9cb2cb62d84

Well, I have released a pre-alpha version of it which demonstrates how to transform the Windows Heap object under GPL license; the example source-code can be found here:

http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_win_v000_c.html

Can you please run the test as-is, which uses stock malloc impl, and record the output... Then uncomment the following line:

#define USE_MEM_ALLOC

which engages my algorithm; run the test again; and post the differences in timings between the two. I think you will be fairly surprised at the performance enhancement. The lock in the Windows Heap object that `malloc' is based on severely damages its scalability when used in a multi-threaded environment. When `USE_MEM_ALLOC' is defined, ALL of the locking is removed and the heaps are created on a per-thread basis. This distribution and clever support algorithm I invented allows for the Windows Heap to scale quite nicely indeed.

I am going to tweak the algorithm such that one could plug in different allocators on a per-thread basis! In other words, thread A can use a different allocation algorithm than thread B, and so on. My algorithm actually supports this. I think the possibilities are truly amazing. It even allows thread A to allocate and free in thread B even though they use completely different allocation algorithm. That's pretty darn cool. Some may think its even impossible; its not!

BTW, I am thinking that this algorithm might be able to be used to transform other single-threaded things besides allocators... Humm, need to think some more on that.

P.S.

The test portion of the code uses the pthread-win32 library which can be downloaded from here:

http://sourceware.org/pthreads-win32

I am thinking of removing dependence on it and just use pure Windows primitives.

Any thoughts?

43 post / 0 new
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione
Ritratto di Chris M. Thomasson
I posted the WRONG VERSION! I did not include the reclaim threshold experimental feature I am playing around with. Also, there was a bug in the following function `mem_thread_sys_reclaim()'. Here was the original code: LONG
mem_thread_sys_reclaim(
union mem_block* const reset
) {
LONG count = 0;
union mem_block* mblk = XCHG(&g_mem_tls->rfree, reset);
while (mblk) {
union mem_block* const next = mblk->next;
if (! HeapFree(g_mem_tls->heap, HEAP_NO_SERIALIZE, mblk)) {
assert(0); abort();
}
++count;
mblk = next;
}
if (! reset) {
g_mem_tls->refs += count;
}
return count;
} Can you spot the bug? Well, IMVHO, its kind of easy... See, the reclaim function is suppose to free the blocks that have been deferred by remote threads... Well, a free is suppose to actually decrement the owner thread reference count... I was incrementing it! OUCH! ;^(... Anyway, its fixed on the web-site. I needed to change the code: if (! reset) {
g_mem_tls->refs += count;
} to: if (! reset) {
g_mem_tls->refs -= count;
} So, you should re-download the source-code! Sorry for any trouble. http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_win_v000_c.html Anyway, I am not sure how the reclaim threshold feature is going to effect performance. Humm... Need to think about this.
Ritratto di Chris M. Thomasson
BTW, in case anybody is interested, here is the first time I made the base non-blocking algorithm itself public: http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8245f4b48591fc69
Ritratto di Chris M. Thomasson
I improved the test application; the web-page is updated. See, before, it was actually calling `my_malloc/free' functions under the protection of the test mutex. Well, that's no good. So, I moved them outside of the critical-region. Now, it would be an interesting experiment to run both versions of the test. One with the allocation functions in the critical-section, and one without. On my end, my algorithm beats normal malloc/free in either occasion, but really whips them when they are not protected by the mutex. Can you please post some numbers? I am curious to see how horrible stock malloc/free actually are! ;^D
Ritratto di Dmitry Vyukov

Intel TBB's scalable malloc uses the same technique.

The interesting things about TBB's scalable malloc:

- for large objects (currently bigger than 8064 bytes) global allocator is used (plain malloc, or mmap - chosen by define)

- there is also global cache of superfluous memory superblocks, i.e. if thread has too many superblocks it moves superblocks to global cache

- there is no caching, i.e. if thread constantly frees remote memory it will constantly execute atomic ops (very bad in some scenarios - producer/consumer)

- inter-thread queue organized as lock-free stack (IBM freelist), or as mutex-protected stack

- thread grabs lock-free stack with CAS, not with XCHG

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Ritratto di Dmitry Vyukov
Quoting - Dmitriy V'jukov

Intel TBB's scalable malloc uses the same technique.

The interesting things about TBB's scalable malloc:

- for large objects (currently bigger than 8064 bytes) global allocator is used (plain malloc, or mmap - chosen by define)

- there is also global cache of superfluous memory superblocks, i.e. if thread has too many superblocks it moves superblocks to global cache

- there is no caching, i.e. if thread constantly frees remote memory it will constantly execute atomic ops (very bad in some scenarios - producer/consumer)

- inter-thread queue organized as lock-free stack (IBM freelist), or as mutex-protected stack

- thread grabs lock-free stack with CAS, not with XCHG

I think that TBB's approach has some advantages over automatically turning purely single-threaded allocator into multi-threaded. Consider points 1 and 2.

Though it's beyond question that your automatic approach is faaaaaaaaar better than just protecting single-threaded allocator with single mutex (or with fine-grained mutexes) - the thing we currently see in many allocators.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Ritratto di Chris M. Thomasson
Quoting - Dmitriy V'jukov

I think that TBB's approach has some advantages over automatically turning purely single-threaded allocator into multi-threaded. Consider points 1 and 2.

Though it's beyond question that your automatic approach is faaaaaaaaar better than just protecting single-threaded allocator with single mutex (or with fine-grained mutexes) - the thing we currently see in many allocators.

Here are some of my current thoughts: http://groups.google.com/group/comp.programming.threads/browse_frm/thread/1689f58f066a5995 I seems like Intel's per-thread allocator instances areprobably aheck of a lot more optimized than Windows Heap API w/ `HEAP_NO_SERIALIZE' flag. I also like how they move super-blocks out of threads and defer to global allocator onlarge allocations. However, I do find it a little strange that they don't use XCHG for reclaiming remote deallocations. As for my "generic" algorithm, well, with Windows Heaps I simply cannot get it to beat my region allocator: http://webpages.charter.net/appcore/vzoom/malloc/sergey_vzmem_thread.html Also, I can't get it to beat the allocator in the current vZOOM commercial package. I guess its good in a sense because it can definitely transform basically any single-threaded allocation technique into a distributed thread-safe entity. IMVHO, its a good algorithm to study... Perhaps, I am going to try some optimizations... Such as limiting the size of per-thread heaps by utilizing the max_size parameter in the HeapCreate function. Perhaps mess around with a caching algorithm on remote frees. When `HeapAlloc()' fails, it defers to a global low-frag heap. However, I don't know how much this is going to buy me because it seems like calling into `HeapAlloc()/Free()' is an expensive operation no matter if `HEAP_NO_SERIALIZE' is set or not. Although, the "generic" nature of the current algorithm cannot take advantage of header-less blocks. This has major overhead on a per-allocation basis. Since the algorithm does not know anything about the underlying per-thread allocators, well, it basically has to have a nasty header. I am not sure how to overcome this without forcing a deeper interaction between the algorithm and the allocators its managing. BTW, do you know off hand if TBB allocator uses header-less blocks?
Ritratto di Dmitry Vyukov
Quoting - lockfree Although, the "generic" nature of the current algorithm cannot take advantage of header-less blocks. This has major overhead on a per-allocation basis. Since the algorithm does not know anything about the underlying per-thread allocators, well, it basically has to have a nasty header. I am not sure how to overcome this without forcing a deeper interaction between the algorithm and the allocators its managing.

Indeed. Interaction yields fruits.

Quoting - lockfree BTW, do you know off hand if TBB allocator uses header-less blocks?

extern "C" void scalable_free (void *object) {
Block *block;
ThreadId myTid;
FreeObject *objectToFree;

if (!object) {
return;
}

if (isLargeObject(object)) {
freeLargeObject(object);
return;
}

objectToFree = (FreeObject *)object;

myTid = getThreadId();

block = (Block *) ((uintptr_t)object & ~(blockSize - 1));/* mask low bits to get the block */
MALLOC_ASSERT (block->allocatedCount, ASSERT_TEXT);

if (myTid == block->owner) {

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Ritratto di Chris M. Thomasson
Quoting - Dmitriy V'jukov

Indeed. Interaction yields fruits.

Quoting - lockfree BTW, do you know off hand if TBB allocator uses header-less blocks?

extern "C" void scalable_free (void *object) {
Block *block;
ThreadId myTid;
FreeObject *objectToFree;

if (!object) {
return;
}

if (isLargeObject(object)) {
freeLargeObject(object);
return;
}

objectToFree = (FreeObject *)object;

myTid = getThreadId();

block = (Block *) ((uintptr_t)object & ~(blockSize - 1));/* mask low bits to get the block */
MALLOC_ASSERT (block->allocatedCount, ASSERT_TEXT);

if (myTid == block->owner) {

That will definitely do it. I use the same technique in vZOOM and the region allocator. Its basically the only way to accomplish it efficiently. I take it that their `Block' object is a slab of fixed sized objects, and that don't have support for `realloc()' right? I mean they could do it for sure, if they used copying... As for remote deallocations, well, I am thinking of a way to get it done, but its going to be a little tricky. This is because a blocks from a slab could be scattered across multiple threads instead of always being directed to the owner. Interesting problem indeed! ;^D
Ritratto di Dmitry Vyukov
Quoting - lockfree I take it that their `Block' object is a slab of fixed sized objects, and that don't have support for `realloc()' right? I mean they could do it for sure, if they used copying...

Yes, Block is fixed-size slab allocator.

If requested via realloc size is smaller than current size, then no copying. This is true and for small fixed-size objects and for big objects.

If requested via realloc size is bigger than current size, then, yes, new block is allocated and memory copied.

Quoting - lockfree As for remote deallocations, well, I am thinking of a way to get it done, but its going to be a little tricky. This is because a blocks from a slab could be scattered across multiple threads instead of always being directed to the owner. Interesting problem indeed!

I don't get you. What do you mean by "blocks from a slab could be scattered across multiple threads instead of always being directed to the owner"?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Ritratto di Dmitry Vyukov

Another interesting moment about TBB's scalable malloc - how it determines large objects. This is exactly what we discussed several days ago:

http://groups.google.com/group/comp.programming.threads/browse_frm/threa...

/* Does this object start on a 16K boundary + the size of a large object header? */
static inline unsigned int isLargeObject(void *object)
{
return ((uintptr_t)object & (blockSize - 1)) == sizeof(LargeObjectHeader);
}

Hmmm... I thinking how this can be improved...

For example, we have 64kb superblocks in slab-allocator. We can not use objects starting at 0kb, 4kb, 8kb, 12kb, ..., 62 kb in slab-allocator. And place big objects exactly at those offsets. So we will have a bit more freedom where we can place big objects.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Ritratto di Chris M. Thomasson
I forget to clarify. I was referring to moment when `realloc()' is invoked from a thread that does not own the slab in which the block resides. AFAICT,this would require creating a new block in calling threads slab, copying the origin block, then pushing the origin block onto its remote owner thread. As for my comment on blocks being scattered, well, I again forget to clarify another important moment. I was thinking of ways to cache blocks in remote threads. Well, when the owner thread dies, there could be multiple threads which have blocks from its slabs cached. Humm... Am I making sense now? Anyway, sorry about that! :^/
Ritratto di jimdempseyatthecove

>> block = (Block *) ((uintptr_t)object & ~(blockSize - 1));/* mask low bits to get the block */

I haven't looked at the allocation code in TBB and in particular the definition of Block but...

The above would requre sizeof(Block) to be a power of 2 (not much of a problem)
and that the first few bytes object reside within Block. This could introduce problems when (if) alligned allocations are supported and then would require the size of Block to be at least as large as the largest aligned allocation supported.Then consider thatthe alignment requirements of the application are unknown to the author of the allocation routine.

I would think a better route would be to prepend a locator in front of the returned object pointer (residing within the prepended Block object). This could either bethe Block* itself, or a -byte count from the object pointer to the base of the Block.

Jim Dempsey

www.quickthreadprogramming.com
Ritratto di Dmitry Vyukov
Quoting - jimdempseyatthecove

>> block = (Block *) ((uintptr_t)object & ~(blockSize - 1));/* mask low bits to get the block */

I haven't looked at the allocation code in TBB and in particular the definition of Block but...

The above would requre sizeof(Block) to be a power of 2 (not much of a problem)
and that the first few bytes object reside within Block. This could introduce problems when (if) alligned allocations are supported and then would require the size of Block to be at least as large as the largest aligned allocation supported.Then consider thatthe alignment requirements of the application are unknown to the author of the allocation routine.

I would think a better route would be to prepend a locator in front of the returned object pointer (residing within the prepended Block object). This could either bethe Block* itself, or a -byte count from the object pointer to the base of the Block.

Allocator ALWAYS knows alignment requirements of the platform. I'm not sure what you mean by "alignment requirements of the application".

What you are proposing is exactly the thing which developers (and Chris) tried to avoid. Header-prefixed allocations can be too expensive for small objects.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Ritratto di Chris M. Thomasson
Quoting - Dmitriy V'jukov

Another interesting moment about TBB's scalable malloc - how it determines large objects. This is exactly what we discussed several days ago:

http://groups.google.com/group/comp.programming.threads/browse_frm/threa...

/* Does this object start on a 16K boundary + the size of a large object header? */
static inline unsigned int isLargeObject(void *object)
{
return ((uintptr_t)object & (blockSize - 1)) == sizeof(LargeObjectHeader);
}

Hmmm... I thinking how this can be improved...

For example, we have 64kb superblocks in slab-allocator. We can not use objects starting at 0kb, 4kb, 8kb, 12kb, ..., 62 kb in slab-allocator. And place big objects exactly at those offsets. So we will have a bit more freedom where we can place big objects.

Humm... Very interesting fine-grain approach. I am thinking how to handle allocations that need to span multiple super-blocks. Well, I guess a possible, perhaps naive, approach would be to align them on a boundary that is greater thanthe size of asuper-block. Then one could tell allocations that are over 64k apart from all the others...
Ritratto di Dmitry Vyukov
Quoting - lockfree As for my comment on blocks being scattered, well, I again forget to clarify another important moment. I was thinking of ways to cache blocks in remote threads. Well, when the owner thread dies, there could be multiple threads which have blocks from its slabs cached. Humm...

Do you mean that they can be cached for a way too long (i.e. infinitely)?

Caching for limited amount of time can not be a problem, because when thread dies there still can be blocks owned by that thread in use.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Ritratto di Dmitry Vyukov
Quoting - lockfree Humm... Very interesting fine-grain approach. I am thinking how to handle allocations that need to span multiple super-blocks. Well, I guess a possible, perhaps naive, approach would be to align them on a boundary that is greater thanthe size of asuper-block. Then one could tell allocations that are over 64k apart from all the others...

Possibly we have some misunderstanding because of terminology.

What is the problem to handle objects bigger than superblock? One just have to allocate then via ::malloc. They are big so it must not be a problem. Or they can be allocated directly via mmap()/VirtualAlloc(). This is approach used in TBB's scalable malloc.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Ritratto di Chris M. Thomasson
Quoting - jimdempseyatthecove

>> block = (Block *) ((uintptr_t)object & ~(blockSize - 1));/* mask low bits to get the block */

I haven't looked at the allocation code in TBB and in particular the definition of Block but...

The above would requre sizeof(Block) to be a power of 2 (not much of a problem)
and that the first few bytes object reside within Block. This could introduce problems when (if) alligned allocations are supported and then would require the size of Block to be at least as large as the largest aligned allocation supported.Then consider thatthe alignment requirements of the application are unknown to the author of the allocation routine.

I would think a better route would be to prepend a locator in front of the returned object pointer (residing within the prepended Block object). This could either bethe Block* itself, or a -byte count from the object pointer to the base of the Block.

Jim Dempsey

Noblocks within the slab would reside anywhere in the header whatsoever; what makes you think that? Consider this trivial naive simple layout: _________________________________________________________________ #define L2_CACHE_LINE 128 #define PAGE_SIZE 8192 #define BLKBUF_SIZE (PAGE_SIZE - L2_CACHE_LINE) struct header { /* [blah, blah]; */ }; union header_l2pad { struct header this; char l2pad[L2_CACHE_LINE]; }; struct slab { union header_l2pad header; unsigned char blkbuf[BLKBUF_SIZE]; }; typedef char sassert[ (sizeof(union header_l2pad)== L2_CACHE_LINE) && (sizeof(struct slab)== PAGE_SIZE) ? 1 : -1 ]; _________________________________________________________________ If you ensure that `struct slab' objects reside on a `PAGE_SIZE' boundary, then you can find the header from any address residing in `struct slab::blkbuf' by using the masking technique. The actual buffer space starts out aligned on a l2 cache line boundary. This is fairly trivial task indeed. If the application needs specific alignment, well, it does what they all do and over allocate and perform the alignment themselves:

http://groups.google.com/group/comp.programming.threads/msg/f2b1cac59c8e98a3

The allocator does not care if the application requests more memory than it needs because of special alignment requirements. If the application needs page alignment, well, in this setup objects larger than `BLKBUF_SIZE' would be considered large and can defer to `malloc/free', or whatever.
Ritratto di Chris M. Thomasson
Quoting - Dmitriy V'jukov

Possibly we have some misunderstanding because of terminology.

What is the problem to handle objects bigger than superblock? One just have to allocate then via ::malloc. They are big so it must not be a problem. Or they can be allocated directly via mmap()/VirtualAlloc(). This is approach used in TBB's scalable malloc.

Nevermind I was thinking of something else. Sorry. One time I did a region allocator based on reserved address space, and large object could be created out of multiple contigous pages. It went something like: ____________________________________________________________________ #define PAGE_SIZE 8192 struct page { unsigned char buf[PAGE_SIZE]; }; struct vmregion { struct page* base; size_t count; size_t max_count; size_t refs; }; static pthread_mutex_t g_vmregion_mtx = PTHREAD_MUTEX_INITIALIZER; static struct vmregion g_vmregion = { 0, 0,0, 0}; /* reserve a very large number of pages and store base in `g_vmregion::base' and max count in `g_vmregion::max_count'*/ struct page* vmregion_commit( size_t pages ) { struct page* p = NULL; size_t count; pthread_mutex_lock(&g_vmregion_mtx); count = g_vmregion.count + pages; if (count <= g_vmregion.max_count) { p = g_vmregion.base + g_vmregion.count; g_vmregion.count = count; ++g_vmregion.refs; } pthread_mutex_unlock(&g_vmregion_mtx); if (p) { /* commit the pages */ } return p; } void vmregion_decommit( struct page* pages, size_t count ) { /* decommit the pages */ pthread_mutex_lock(&g_vmregion_mtx); --g_vmregion.refs; if (! g_vmregion.refs) { g_vmregion.count = 0; } pthread_mutex_unlock(&g_vmregion_mtx); } ____________________________________________________________________ But, this has all the caveats that any region type allocation does. Oh well.
Ritratto di Chris M. Thomasson
Quoting - Dmitriy V'jukov

Do you mean that they can be cached for a way too long (i.e. infinitely)?

Caching for limited amount of time can not be a problem, because when thread dies there still can be blocks owned by that thread in use.

They can definitely be cached infinitely in certain scenarios indeed... Think of a thread which freesseveral blocks from several different threads, and caches them... However, from that point onward,the threadsimply never interacts with the allocator interface again, and stays alive for the duration of the program. Well, theslabs which hold those blocks will persist for the lifetime of the program. This can be a problem indeed... In my algorithm I use reference counting trick to ensure that slabs are freed as soon as the last reference to them has been dropped. You can see example of this moment in the following algorihtms: http://webpages.charter.net/appcore/vzoom/malloc/sergey_vzmem_thread.html http://groups.google.com/group/comp.programming.threads/browse_frm/thread/eb87f9cb2cb62d84 http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_win_v000_c.html All of those algorithms use the same basic reference counting trick. However, it only works because there is no remote caching. Well, there is definitely one way to solve this problem: You could require the application threads to periodically or episodically free a NULL pointer, or call a special allocator function (e.g., mflush()), which would scan the calling threads cache and perform cleanup as necessary. Humm, what do you think?
Ritratto di Chris M. Thomasson
I have implemented a _significant_ improvement in the way nodes are reclaimed in the following algorithm: http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8245f4b48591fc69 I have it in an experimental next version of vZOOM commercial library that is not available to anyone yet. However, I think I will post it here. I have to create a proof, and run some more tests, but it definitely makes things work WAY faster when there are a significant number of reclaimed remote blocks. Let's see if you can guess what I did...

Here is a hint, I found out a easy way to completely remove the NASTY loop in the `slab_sys_shared_pop()' function... How do you think I accomplished this and still maintain a valid reference count?

:^D

Ritratto di Dmitry Vyukov
Quoting - lockfree I have implemented a _significant_ improvement in the way nodes are reclaimed in the following algorithm: http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8245f4b48591fc69 I have it in an experimental next version of vZOOM commercial library that is not available to anyone yet. However, I think I will post it here. I have to create a proof, and run some more tests, but it definitely makes things work WAY faster when there are a significant number of reclaimed remote blocks. Let's see if you can guess what I did...

Here is a hint, I found out a easy way to completely remove the NASTY loop in the `slab_sys_shared_pop()' function... How do you think I accomplished this and still maintain a valid reference count?

:^D

It looks like a challenge. Ok, let's see how I would implement it commercial library, if I would have one.

struct block_t
{
struct block_t *next;
struct per_thread_t *thread;
};

struct block_anchor_t
{
block_t *front;
};

struct per_thread_t
{
block_anchor_t local;
block_anchor_t shared;
block_t *blocks;
size_t block_sz;
int refs;
};

void* slab_malloc( size_t sz )
{
per_thread_t *_this = pthread_getspecific(...);
block_t *block = slab_sys_local_pop( _this );
return block;
}

void slab_free( void *mem )
{
block_t *block = ((block_t*)mem) - 1;
per_thread_t *_this = pthread_getspecific(...);
if ( block->thread == _this )
{
slab_sys_local_push( block );
}
else
{
slab_sys_shared_push( block );
}
}

block_t* slab_sys_local_pop( per_thread_t *_this )
{
block_t *block = _this->local.front;
if (block)
{
_this->local.front = block->next;
return block + 1;
}
return slab_sys_shared_pop (_this);
}

void slab_sys_local_push( block_t *_this)
{
_this->next = _this->thread->local.front;
_this->thread->local.front = _this;
}

block_t* slab_sys_shared_pop( per_thread_t *_this)
{
block_t *block = XCHG( &_this->shared.front, 0 );
if ( block )
{
_this->local.front = block->next;
return block + 1;
}
return 0;
}

void slab_sys_shared_push( block_t *_this )
{
block_t *cmp;
do
{
cmp = _this->thread->shared.front;
_this->next = cmp & ~1;
}
while ( ! CAS( &_this->thread->shared.front,
cmp,
( cmp & 1 ) ? _this | 1 : _this ) );
if ( cmp & 1 && XADD( &_this->refs, -1 ) == 1 )
{
/* destroy _this->thread */
}
}

void slab_sys_tls_dtor( void *state )
{
per_thread_t *_this = state;
_this->refs = 1 + BLOCK_COUNT_PER_SUPERBLOCK;
int count = 1;
block_t* block = _this->local.front;
while ( block )
{
count += 1;
block = block->next;
}
block = XCHG( &_this->shared.front, 1 );
while ( block )
{
count += 1;
block = block->next;
}
if ( XADD( &_this->refs, -count ) == count )
{
/* destroy _this */
}
}

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Ritratto di Chris M. Thomasson
Well Dimity, you got it right! :^D Bingo: I simply moved the loop/countingto the thread dtor. The tweak works and definitely improves performance. AFAICT, wrt my original algorithm, moving the loop to dtor was the only way to accomplish this improvement. I can't really think of another way to do it so efficiently. Removing/deferring loopscan usually bea major plus to basically any algorithm, IMVHO that is! =^)
Ritratto di Alexey Kukanov (Intel)
Quoting - lockfree I forget to clarify. I was referring to moment when `realloc()' is invoked from a thread that does not own the slab in which the block resides. AFAICT,this would require creating a new block in calling threads slab, copying the origin block, then pushing the origin block onto its remote owner thread. As for my comment on blocks being scattered, well, I again forget to clarify another important moment. I was thinking of ways to cache blocks in remote threads. Well, when the owner thread dies, there could be multiple threads which have blocks from its slabs cached. Humm... Am I making sense now? Anyway, sorry about that! :^/

If the new size is not bigger than the fixed size of the slab, why the block should be reallocated, even if "owned" by another thread? It can simply be reused. If the new size is bigger than can be served by the slab, it should be reallocated, againindependently on whether the block is local or remote.

As for caching blocks in remote threads, in the TBB allocator it is impractical to implement.

Ritratto di Dmitry Vyukov
Quoting - Alexey Kukanov (Intel)

As for caching blocks in remote threads, in the TBB allocator it is impractical to implement.

Why? Please clarify.

Well, if allocator is used with tasks, then, yes, it is impractical. Because stealing is rare, so there is a little remote deallocations.

But TBB's scalable allocator is a stand-alone thing. It can be used in arbitrary software. And substantial amount of software is relying on producer-consumer pattern in some form, for which caching can be crucial. Because there is *constant* stream of remote deallocations.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Ritratto di Alexey Kukanov (Intel)
Quoting - jimdempseyatthecove >> block = (Block *) ((uintptr_t)object & ~(blockSize - 1));/* mask low bits to get the block */

I haven't looked at the allocation code in TBB and in particular the definition of Block but...

The above would requre sizeof(Block) to be a power of 2 (not much of a problem)
and that the first few bytes object reside within Block. This could introduce problems when (if) alligned allocations are supported and then would require the size of Block to be at least as large as the largest aligned allocation supported.Then consider thatthe alignment requirements of the application are unknown to the author of the allocation routine.

I would think a better route would be to prepend a locator in front of the returned object pointer (residing within the prepended Block object). This could either bethe Block* itself, or a -byte count from the object pointer to the base of the Block.

Jim Dempsey

blockSize in the TBB allocator is not the same as sizeof(Block). The former is the total size of any slab in bytes, while the latter is the size of the slab header (yes, Block is in fact the slab header structure). So sizeof(Block) does not formally need to be power of two. And yes, due to alignment requirements the actual allocation area in the slab might start not immediately after the header; in a sense, it effectively increases the "virtual" size of the header (i.e. memory overhead). But there is no problem with alignment requirements of an application, because aligned allocations should use a separate set of API functions and explicitly specify the required alignment; this allows to select the slab with proper alignment, though it may mean even more memory overhead - allfor sake of speed.

Ritratto di Dmitry Vyukov
Quoting - lockfree Well Dimity, you got it right! :^D Bingo: I simply moved the loop/countingto the thread dtor. The tweak works and definitely improves performance. AFAICT, wrt my original algorithm, moving the loop to dtor was the only way to accomplish this improvement. I can't really think of another way to do it so efficiently. Removing/deferring loopscan usually bea major plus to basically any algorithm, IMVHO that is! =^)

The only moment is still unclear for me. How you cache nodes in caching mode in VZOOM memory allocator?

Nodes must be sorted into batches by owner thread id. If number of threads is constant and known, then it's simple - we create array of lists, we give every thread unique id 0,1,2... Then the task is trivial.

But if number of threads is unknown and variable, then one have to employ something like hash-map or... don't know... it must something very cheap...

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Ritratto di Dmitry Vyukov
Quoting - Dmitriy V'jukov

The only moment is still unclear for me. How you cache nodes in caching mode in VZOOM memory allocator?

Nodes must be sorted into batches by owner thread id. If number of threads is constant and known, then it's simple - we create array of lists, we give every thread unique id 0,1,2... Then the task is trivial.

But if number of threads is unknown and variable, then one have to employ something like hash-map or... don't know... it must something very cheap...

I've remembered that I was already thinking about it:

http://groups.google.ru/group/comp.programming.threads/msg/547b4edb494beeb9

I was able to come up with 2 variants: "Cache list + global lock" and "Hashed cache array"

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Ritratto di jimdempseyatthecove

Dmitriy,

Depending on the operating system two references could be used to access the cache list. One could be via Thread Local Storage, which depending on O/S and runtime system may be reduce to an overhead of supplying asegment override prefix and the second would be your global table of tables for inter-thread allocation/deallocation.

Jim Dempsey

www.quickthreadprogramming.com
Ritratto di Chris M. Thomasson
Quoting - Alexey Kukanov (Intel)

If the new size is not bigger than the fixed size of the slab, why the block should be reallocated, even if "owned" by another thread? It can simply be reused. If the new size is bigger than can be served by the slab, it should be reallocated, againindependently on whether the block is local or remote.

As for caching blocks in remote threads, in the TBB allocator it is impractical to implement.

Yes of course; I was not thinking clearly; your right! As for remote caching of memory, well, agree with you. It can create odd scenarios. Dmitriy and I discussed this already else

Ritratto di Alexey Kukanov (Intel)
Quoting - Dmitriy V'jukov I've remembered that I was already thinking about it:

http://groups.google.ru/group/comp.programming.threads/msg/547b4edb494beeb9

I was able to come up with 2 variants: "Cache list + global lock" and "Hashed cache array"

Cache list plus global lock does not seem viable. Not only it adds a serialization bottleneck into otherwise distributed design, but also it makes the thread go through the bunch of different block headers, many of which could be cold in CPU cache.

The second design looks more appealing, though instead of hash collisions I would probably try a kind of LRU policy to flush some lists - isn't it cache, after all? :) On the other hand, this design might lead to a scenario when you flush just one or two objects at a time - almost as if you do returning an object immediately, exceptthe corresponding block header being already cold in your CPU cache.

The biggest problem with object caching policies is thata general-purpose allocator never knows how many more objects would come from this exactly thread (in case of TBB allocator, from this exactly slab block - the free lists, both local and global, are per slab in the current design) so you do not know whether it makes sense to cache an object to return a whole bunch at once.

Also isn't it you who think that sharing is almost the root of all evils in multithreaded programming? ;)) And I tend to agree:) From this point of view, object exchange between threads should be as rare as possible, so contention and atomic operations on the cold path of cross-thread deallocation should be well amortized by fast hot path of working with local free list.

Ritratto di Dmitry Vyukov
Quoting - Alexey Kukanov (Intel)

Cache list plus global lock does not seem viable. Not only it adds a serialization bottleneck into otherwise distributed design, but also it makes the thread go through the bunch of different block headers, many of which could be cold in CPU cache.

Yes, I am not fully satisfied with it too. The idea was that remote deallocations are infrequent, further thread tries to access global lock once per N (N=128, 256...) deallocations, further thread can use try_lock() and if attempt to lock fails just try next time.

But yes, blocks will be cold in cache with substantial probability. And yes, this scheme requires periodical activity from thread.

Quoting - Alexey Kukanov (Intel)

The second design looks more appealing, though instead of hash collisions I would probably try a kind of LRU policy to flush some lists - isn't it cache, after all? :) On the other hand, this design might lead to a scenario when you flush just one or two objects at a time - almost as if you do returning an object immediately, exceptthe corresponding block header being already cold in your CPU cache.

Yeah, the problem of most of such designs is that they perfectly fit some situations, but at the same time it's easy to come up with counter-example when such design will suck

Quoting - Alexey Kukanov (Intel)

The biggest problem with object caching policies is thata general-purpose allocator never knows how many more objects would come from this exactly thread (in case of TBB allocator, from this exactly slab block - the free lists, both local and global, are per slab in the current design) so you do not know whether it makes sense to cache an object to return a whole bunch at once.

Btw, why is it impossible to create per thread remote freelists, instead of per slab freelists? Because you don't want thread to go through the bunch of different block headers in order to sort them out into different slabs? Right?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Ritratto di Dmitry Vyukov
Quoting - Alexey Kukanov (Intel)

Also isn't it you who think that sharing is almost the root of all evils in multithreaded programming? ;)) And I tend to agree:) From this point of view, object exchange between threads should be as rare as possible, so contention and atomic operations on the cold path of cross-thread deallocation should be well amortized by fast hot path of working with local free list.

Yes, it's me who was saying that sharing is the root of all evil.

Actually I think that I was saying both things, that it's the root of all evil and that it's necessary thing :) because w/o sharing it's not multi-threaded program no more, it's just a set of independent processes which is unable to solve single task.

Well, I think that you are right for substantial number of well-written multi-threaded programs - i.e. there is no need to optimize remote deallocations.

Although, it seems that substantial number of multi-threaded programs relying on producer-consumer pattern in some form (stages, pipelines etc). And producer-consumer pattern means that there is *constant* flow of remote deallocations.

Also because we are talking about general-purpose allocator (and not about allocator for some particular program), we must consider also "no so well-written programs". Is not it good for library that non-expert user can create no so well-written program and it will still behave adequately, because underlying library uses a bunch of smart distributed and amortized techniques?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Ritratto di Chris M. Thomasson
Quoting - Dmitriy V'jukov

The only moment is still unclear for me. How you cache nodes in caching mode in VZOOM memory allocator?

Nodes must be sorted into batches by owner thread id. If number of threads is constant and known, then it's simple - we create array of lists, we give every thread unique id 0,1,2... Then the task is trivial.

But if number of threads is unknown and variable, then one have to employ something like hash-map or... don't know... it must something very cheap...

The caching mode in vZOOM does something "kind of" like; let me quickly sketch this out here: ______________________________________________________________________ struct block { struct block* next; }; struct slab { struct block* local; struct block* remote; struct thread* owner; size_t blksize; size_t remote_count; unsigned remote_max; [...]; }; struct thread { struct slab* slabs[16]; }; void free(void* mem) { struct block* blk = mem; struct slab* slab = get_slab_from_block(blk); struct thread* thread = pthread_getspecific(...); if (slab->owner == thread) { blk->next = slab->local; slab->local = blk; } else { struct slab* lslab = get_slab_from_size(thread, slab->blksize); if (lslab->remote_count < lslab->remote_max) { blk->next = lslab->remote; lslab->remote = blk; ++slab->remote_count; } else { remote_free_to_owner_slab(slab, blk); } } else { remote_free_to_owner_slab(slab, blk); } } void* malloc(size_t size) { struct thread* thread = pthread_getspecific(...); struct slab* slab = get_slab_from_size(thread, size); struct block* blk = slab->remote; if (blk) { slab->remote = blk->next; --slab->remote_count; } else { blk = slab->local; if (! blk) { blk = reclaim_remote_blocks_from_slab(slab); } else { slab->local = blk->next; } } return blk; } void sys_slab_flush(struct slab* slab) { struct block* blk = slab->remote; while (blk) { struct block* const next = blk->next; struct slab* bslab = get_slab_from_block(blk); remote_free_to_owner_slab(bslab, blk); blk = next; } slab->remote = NULL; slab->remote_count = 0; } void mflush(void) { unsigned i; struct thread* thread = pthread_getspecific(...); for (i = 0; i < 16; ++i) { struct slab* slab = thread->slabs[i]; if (slab) { sys_slab_flush(slab); } } } voidmcache(size_t size, unsigned max_count) { struct thread* thread = pthread_getspecific(...); struct slab* slab = get_slab_from_size(thread, size); slab->remote_max = max_count; if (slab->remote_count > max_count) { sys_slab_flush(slab); } } ______________________________________________________________________ Applications can enable caching mode on a per-thread-slab basis.They can set it upfor a slab which contains a specific size. For instance, an application can say they want caching enabled for consumer threads A-D for allocations the size of (sizeof(void*) * 2)). There is a caveat... If a thread enables caching of ANY of its slabs, itshould call `mflush()' every now and then. Allocations always check the remote list of the slab first...
Ritratto di Dmitry Vyukov
Quoting - lockfree The caching mode in vZOOM does something "kind of" like; let me quickly sketch this out here:

Applications can enable caching mode on a per-thread-slab basis.They can set it upfor a slab which contains a specific size. For instance, an application can say they want caching enabled for consumer threads A-D for allocations the size of (sizeof(void*) * 2)). There is a caveat... If a thread enables caching of ANY of its slabs, itshould call `mflush()' every now and then. Allocations always check the remote list of the slab first...

I see. It's interesting. Thank you.

I was thinking that you are able to return a bunch of remote blocks to owner at once (with one atomic RMW). So you are able only to reuse cached nodes, or to return them to owner one by one. It's interesting how you sort cached blocks by size, so thread is able to quickly find cached block of needed size. Even such caching can have great positive impact in some situations. For example when 2 threads exchanging messages with roughly equal frequency.

But what bothers me is the following. Assume there is 'asymmetric' message exchange, i.e. only one thread sending messages to another thread. Such caching won't help here, it will only make things worse, because there will be additional bookkeeping, and some blocks will be already cold in cache.

I think it's possible to create general-purpose caching scheme, in which it will be possible to return a bunch of memory blocks to owner with one atomic RMW. It will have N^2 space complexity.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Ritratto di Dmitry Vyukov
Quoting - Dmitriy V'jukov

I think it's possible to create general-purpose caching scheme, in which it will be possible to return a bunch of memory blocks to owner with one atomic RMW. It will have N^2 space complexity.

Something like:

Assign to every thread dense indexes from 0 to N (if thread dies then it's number will be reused by next created thread). Create in TLS array of remote thread descriptors. This array is accessed only by owner, so it's easy to implement growing of array, i.e. if thread want to fetch descriptor of thread N+1, but current array size is N, then thread just allocates new bigger array, copy information and replace pointer to array.

The rest is pie. Thread caches remote blocks in according descriptor in its TLS array.

Here is little problem. When thread wants to allocate block it has to check ALL remote thread descriptors. In order to solve this 2 lists of blocks can be maintained: one per-remote-thread list (needed when we want to return a bunch of blocks to owner), and one 'global' (i.e. per-thread) list which combines all non-empty per-remote-thread lists (needed when thread want to allocate block from cache).

Although some details must be still worked over...

What do you think?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Ritratto di Dmitry Vyukov
Quoting - Dmitriy V'jukov

Something like:

Assign to every thread dense indexes from 0 to N (if thread dies then it's number will be reused by next created thread). Create in TLS array of remote thread descriptors. This array is accessed only by owner, so it's easy to implement growing of array, i.e. if thread want to fetch descriptor of thread N+1, but current array size is N, then thread just allocates new bigger array, copy information and replace pointer to array.

The rest is pie. Thread caches remote blocks in according descriptor in its TLS array.

Here is little problem. When thread wants to allocate block it has to check ALL remote thread descriptors. In order to solve this 2 lists of blocks can be maintained: one per-remote-thread list (needed when we want to return a bunch of blocks to owner), and one 'global' (i.e. per-thread) list which combines all non-empty per-remote-thread lists (needed when thread want to allocate block from cache).

Although some details must be still worked over...

What do you think?

Maybe it's already not worth it... I'm not sure... What do you think?

On the other hand, if it can be implemented efficiently (especially w/o additional overheads for other usage patterns), then it's always nicely to provide efficient support for one more usage pattern...

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Ritratto di Alexey Kukanov (Intel)
Quoting - Dmitriy V'jukov Btw, why is it impossible to create per thread remote freelists, instead of per slab freelists? Because you don't want thread to go through the bunch of different block headers in order to sort them out into different slabs? Right?

It isn't impossible, but I think the current design is at least not worse, and might be faster. One of the reasons is exactly as you mentioned - I don't want to go through all different headers for different allocation sizes. The other one is that single remote free list, even non-blocking, is a bottleneck again; per slab lists should have less contention to update. Also the overhead of reclaiming remotely freed objects is distributed more over time than with a single public list, yet it is on demand only. Though it might be interesting to test other schemes.

I also agree that some caching scheme is worth benchmarking at least, and agree that it should return a bunch of memory objects at once. Also, in the light of probable widespread of NUMA systems, I would avoid reusing remote memory - so returning a bunch becomes the only reason to cache.

Ritratto di Chris M. Thomasson
Quoting - Alexey Kukanov (Intel)

It isn't impossible, but I think the current design is at least not worse, and might be faster. One of the reasons is exactly as you mentioned - I don't want to go through all different headers for different allocation sizes. The other one is that single remote free list, even non-blocking, is a bottleneck again; per slab lists should have less contention to update. Also the overhead of reclaiming remotely freed objects is distributed more over time than with a single public list, yet it is on demand only. Though it might be interesting to test other schemes.

I also agree that some caching scheme is worth benchmarking at least, and agree that it should return a bunch of memory objects at once. Also, in the light of probable widespread of NUMA systems, I would avoid reusing remote memory - so returning a bunch becomes the only reason to cache.

I suspect that you already read through my quickly sketched caching algorithm; this was for SMP on an intra NUMA node basis.Anyway, why would you refrain from reusing remote blocks on NUMA? See, what if the memory was begin reused on the same node? Lets say that there was 6 processors per-node... If memory from P1 on NodeA, was passed to P4 on NodeA, why wouldn't P4 be allowed to reuse memory when the request came from a thread residing on NodeA?IMHO, caching algorithms can be highly NUMA friendly indeed. Passing memory on a inter node basis is a VERY expensive operation as far as I am concerned. So, I agree with you that caching on remote NUMA nodes is BADDDDD. However, you did not specify that moment, so I can only guess here... vZOOM on NUMA takes advantage of NUMA APIS for the platforms underlyingOS. Its takes REALLY remote deallocations into account indeed! When I write "remote deallocation" I am referring to intra-thread deallocations residing on the same NUMA node. When I write "REALLY remote deallocation" well, I am talking about retarded program which frees memory on from CPU1-NodeA to CPU1-NodeB; NOT GOOD! I have not worked on NUMA for a while not, but I DO know all of the caveats. When I finally decide to purchasemy NEATSiCortex box (should be in next few months), well, I will be able to play when its snowing outside! I live in South Lake Tahoe. We had the first snow of the year two days ago. It was 13degrees last night. Well, I need something to do!!!!! ;^/
Ritratto di Chris M. Thomasson
Actually, during the `mflush()' function, it correlates and organizes thread id's into cohorts and can free multiple blocks to remote slabs in single CAS. However, I did not display this moment in sketched pseudo-code. I defer hashing and organizing from per-free calls to amortizedper-mflush calls. See, I expect `mfree()' to be called rather frequently. However, I expect `mflush()' calls to be called every now and then... I don't see problem with deferring sorting until an explicit call to `mflush()' occurs. ;^D
Ritratto di Alexey Kukanov (Intel)
Quoting - lockfree I suspect that you already read through my quickly sketched caching algorithm; this was for SMP on an intra NUMA node basis.Anyway, why would you refrain from reusing remote blocks on NUMA? See, what if the memory was begin reused on the same node? Lets say that there was 6 processors per-node... If memory from P1 on NodeA, was passed to P4 on NodeA, why wouldn't P4 be allowed to reuse memory when the request came from a thread residing on NodeA?IMHO, caching algorithms can be highly NUMA friendly indeed. Passing memory on a inter node basis is a VERY expensive operation as far as I am concerned. So, I agree with you that caching on remote NUMA nodes is BADDDDD. However, you did not specify that moment, so I can only guess here...

Well, yes of course "really remote" memory reuse is very bad while intra-node (actually, it's rather intra-socket nowadays) reuse might be fine. But, it adds to bookkeeping complexity and reduces portability. Being NUMA-oblivious on object level is simpler. On the level of VM mapping and slab reuse, however, it makes more sense to me to play with non-portable NUMA APIs - as the VM mapping is non-portable either.

Quoting - lockfree ... well, I will be able to play when its snowing outside! I live in South Lake Tahoe. We had the first snow of the year two days ago. It was 13degrees last night. Well, I need something to do!!!!! ;^/

In my region of Russia, we have seen the first snow a couple weeks ago, much earlier this year than usually. It did not lie down of course. But when it will lie down, I'd prefer to go cross-contry skiing on my spare time :)

Ritratto di Chris M. Thomasson
Quoting - Dmitriy V'jukov Maybe it's already not worth it... I'm not sure... What do you think?

On the other hand, if it can be implemented efficiently (especially w/o additional overheads for other usage patterns), then it's always nicely to provide efficient support for one more usage pattern...

Humm... I think I understand what your getting at. I am not sure if its worth it or not. It definitely extra bookkeeping. Humm... I really need to stop and think about this scheme. At the moment, I am not really sure how much better it is than the current scheme vZOOM uses. I apologize for not putting in the sorting and coalescing of blocks from the same slab in the `mflush()' function. This sorting does have overheads indeed, however,I assume that calls to `mflush()' will be far and few between.
Ritratto di Dmitry Vyukov
Quoting - CHRIS M THOMASSON Quoting - Dmitriy V'jukov Maybe it's already not worth it... I'm not sure... What do you think?

On the other hand, if it can be implemented efficiently (especially w/o additional overheads for other usage patterns), then it's always nicely to provide efficient support for one more usage pattern...

Humm... I think I understand what your getting at. I am not sure if its worth it or not. It definitely extra bookkeeping. Humm... I really need to stop and think about this scheme. At the moment, I am not really sure how much better it is than the current scheme vZOOM uses. I apologize for not putting in the sorting and coalescing of blocks from the same slab in the `mflush()' function. This sorting does have overheads indeed, however,I assume that calls to `mflush()' will be far and few between.

I was confused because you didn't display sorting and organization into cohorts. I think it's Ok to sort only in mflush(). I think it's even better than my variant.

Though I think you must employ something like this ("Hashed cache array" variant):

http://groups.google.ru/group/comp.programming.threads/msg/547b4edb494beeb9?pli=1

And if we would have dense thread ids, it can help with sorting and organizing into cohorts.


All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Accedere per lasciare un commento.