how to free memory allocated by scalable_allocator

how to free memory allocated by scalable_allocator

Hi!
I have 1 worker thread that processes requests from other threads. Requests are relatively small objects ~2K each, allocated by scalable_allocator<>::allocate. Worker thread calls scalable_allocator<>::deallocate once it is done with request. When I have peak load from client threads and pending requests queue is growing (since worker thread cannot process requests as fast as they are supplied), scalable_allocator allocates memory for peak number of requests and will not retrun it to OS. AFAIK this is by design as it assumesthe threadswill be re-using the memory. Still in my case thismay result in too large memory consumption. Since peak load is very rare and I know when it can be generated I'd like to simply make scalable_allocator release its pooled memory.

Is there any way to make scalable_allocator decrease or completely release its pooled memory?

29 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

One of the TBB support people may give you a better answer to this...

The TBB allocator allocates within a process (an application). This allocation is a virtual memory association between the application and the system page file, not physical memory. Therefore there is not a concept as returning the memory to the OS (or taking memory from the OS). The only "negative" effect of the TBB allocator not "returning memory" isyour systemmay requre a larger page file. When the TBB memory is returned from the application to the TBB pool, that memory may (eventualy) get swapped out to the page file and sit there unused (until application closes). In the era of 1TB disks for < $100 a few extra megabytes of disk space should not be of too much concern.

Note, if you are running on Windows, and depending on version of Windows, you may have to up the page file size (or limit). For some lame reason MS thought Page File Size == Physical Memory Size was what everyone needs. I could never understand that.

Jim Dempsey

www.quickthreadprogramming.com
Best Reply

Quoting - redcat76
Hi!
I have 1 worker thread that processes requests from other threads. Requests are relatively small objects ~2K each, allocated by scalable_allocator<>::allocate. Worker thread calls scalable_allocator<>::deallocate once it is done with request. When I have peak load from client threads and pending requests queue is growing (since worker thread cannot process requests as fast as they are supplied), scalable_allocator allocates memory for peak number of requests and will not retrun it to OS. AFAIK this is by design as it assumesthe threadswill be re-using the memory. Still in my case thismay result in too large memory consumption. Since peak load is very rare and I know when it can be generated I'd like to simply make scalable_allocator release its pooled memory.

Is there any way to make scalable_allocator decrease or completely release its pooled memory?

In addition to what Mr. Dempsey has written, I feel that I should point out the following thread that shows a potential problem with per-thread allocation schemes, such as TBB allocator, in general:

http://software.intel.com/en-us/forums/showthread.php?t=61716

If Thread `A allocates large amount of memory `M which is subsequently freed by Thread `B, and Thread `A does not allocate any more memory, well, all those allocations which make up `M are leaked for the duration of `A lifetime. A possible solution is to allow Thread `A to periodically or episodically call a flush function that will reclaim memory on its remote-free list.

Hi, Jim! Thank you very much for your reply.

Quoting - jimdempseyatthecove

One of the TBB support people may give you a better answer to this...

The TBB allocator allocates within a process (an application). This allocation is a virtual memory association between the application and the system page file, not physical memory. Therefore there is not a concept as returning the memory to the OS (or taking memory from the OS). The only "negative" effect of the TBB allocator not "returning memory" isyour systemmay requre a larger page file.

Jim Dempsey

Unfortunately the problem is not that imaginary for me as VM size is also limited: when thread generating requests is too fast for worker thread to process them, I see a rapid grouth of my process's VM size and soon run out of virtual space. I launch my programunder 32-bit Windows XP so this happens at program's VM size ~2Gb.Then Iget an access violation elsewhere in the program as regular malloc returns 0.

Watching "Mem Usage" and "VM size" deltas in Task manager I calculated that 1 request costs me 5K of VM space. So if I have ~400 000 pending requests ...crash-boom-bang.

Note that with STL allocator based on MSC RT mallocfree I don't have this problem: a call to deallocate returns VM space, so the program survives even peak loads, although it works very much slower.

I introduced an artificial delay for generating thread if the number of pending requests exceeds pre-defined limit but in this case the performance drops as well as generating threads are waiting. I'd have a chance of fast performance even with peak loads if I could make scalable_allocator release reserved virtual memory (I believe in Windows implementation a call to VirtualFree with MEM_RELEASE or smth like that).

Quoting - Chris M. Thomasson

A possible solution is to allow Thread `A to periodically or episodically call a flush function that will reclaim memory on its remote-free list.

Thank you, Chris!

Actually I have seen this thread before but you made me read it more carefully. Still as far a s I could get it ended in no decision: such "flush" function does not yetexist at least in "official" scalable_allocator interface. Or am I mistaken?
I can try to compile souce code (that was also described on some thread here) and call some implementation-detail function but imho this is not a good decision.

What you might want to consider is restrict the use of the TBB allocator to objectswith high frequency of allocation/deallocation. Then use new or STL allocator for the lower frequency allocation/deallocation. This may get you through the problem.

A second technique would be to create and use persistant objects and object pool. Then in place of deleting an object you retire it to a pool (first to a non-interlocked thread private pool, then next to an interlocked global pool). This can all be hidden away in your own allocator and then you have the capability to fine tune the allocation/deallocation as a trade-off of footprint vs overhead. And you can slip this into your existing code without too much effort.

In my QuickThread allocator I use a similar technique (but which is extended for NUMA considerations). The QT allocator is not so sensitive to allocations performed in thread A, node passed to thread B, node passed to other than thread A and deleted. A larger footprint will occure but excessive allocations/deallocations in this manner are accounted for and handled appropriately.

As to if this would work for your application it would be hard to say. I think a similar technique that you roll for yourself should work just fine.

Jim Dempsey

www.quickthreadprogramming.com

Another technique I forgot to mention, when your ~400,000 pending requests result in allocations of differing sizes, try creating a polymorphic object that can be any of the differing sized objects (which cause the allocation problem). The polymorphic object would always allocate to the size of the largest encapsulated object. There will be some memory waste when used on smaller objects, but you may make this up in having more re-usibility of the objects once deallocated. (a simple way of doing this is with a union and a typeID)

Jim Dempsey

www.quickthreadprogramming.com

Quoting - jimdempseyatthecove

What you might want to consider is restrict the use of the TBB allocator to objectswith high frequency of allocation/deallocation. Then use new or STL allocator for the lower frequency allocation/deallocation. This may get you through the problem.

A second technique would be to create and use persistant objects and object pool. Then in place of deleting an object you retire it to a pool (first to a non-interlocked thread private pool, then next to an interlocked global pool). This can all be hidden away in your own allocator and then you have the capability to fine tune the allocation/deallocation as a trade-off of footprint vs overhead. And you can slip this into your existing code without too much effort.

In my QuickThread allocator I use a similar technique (but which is extended for NUMA considerations). The QT allocator is not so sensitive to allocations performed in thread A, node passed to thread B, node passed to other than thread A and deleted. A larger footprint will occure but excessive allocations/deallocations in this manner are accounted for and handled appropriately.

As to if this would work for your application it would be hard to say. I think a similar technique that you roll for yourself should work just fine.

Jim Dempsey

Thank you for proposed solutions!
I'll give a try with object pool model.I just had a feeling from documentation that tbb::scalable_allocator behaves in a way similar to such an object pool. The only problem is there's no way to shrink its allocated VM space after peak loads...

Quoting - jimdempseyatthecove

Another technique I forgot to mention, when your ~400,000 pending requests result in allocations of differing sizes, try creating a polymorphic object that can be any of the differing sized objects (which cause the allocation problem). The polymorphic object would always allocate to the size of the largest encapsulated object. There will be some memory waste when used on smaller objects, but you may make this up in having more re-usibility of the objects once deallocated. (a simple way of doing this is with a union and a typeID)

Jim Dempsey

For me this is not the problem as all requests have fixed size of ~2K. In a way they are already polymorphic just as you suggested.

I described the problem in general, but to be more exact I use this model for logging: many threads are posting log records, 1 thread persists them to log file. To avoid dynamic allocations for stream buffers of different sizes I use fixed-size buffer of 2K inside each log record object. The problem is that most log records use on average ~5% of 2K buffer space. So the immediate solution would be to optimize this and place more data in one log record (e.g. several lines of log messages up to buffer maximum space). Thus I hope to reduce the number of log record objects allocated.

Possibly the easiest way to "hack in"a fix is to modify the tail-end TBB task to pass the data into a filter handled by a non-TBB thread. This filter could compact the data (write when necessary)and return the 2K buffer (or release the 2K buffer). It is not pure TBB but so what. The compaction could be done on either the TBB side or the ancilliary thread. The benefit of using the ancilliary thread is the TBB thread won't block for I/O (since I/O is now done in the ancilliary thread).

Jim Dempsey

Jim Dempsey

www.quickthreadprogramming.com

Quoting - jimdempseyatthecove

In my QuickThread allocator I use a similar technique (but which is extended for NUMA considerations). The QT allocator is not so sensitive to allocations performed in thread A, node passed to thread B, node passed to other than thread A and deleted. A larger footprint will occure but excessive allocations/deallocations in this manner are accounted for and handled appropriately.

As to if this would work for your application it would be hard to say. I think a similar technique that you roll for yourself should work just fine.

Jim Dempsey

I read the Quick Thread PDF. It sounds interesting , where can I find more details ?

Quoting - jimdempseyatthecove

Possibly the easiest way to "hack in"a fix is to modify the tail-end TBB task to pass the data into a filter handled by a non-TBB thread. This filter could compact the data (write when necessary)and return the 2K buffer (or release the 2K buffer). It is not pure TBB but so what. The compaction could be done on either the TBB side or the ancilliary thread. The benefit of using the ancilliary thread is the TBB thread won't block for I/O (since I/O is now done in the ancilliary thread).

Jim Dempsey

Jim Dempsey

Thank you, Jim! I'm really impressed by the number of options you gave me to solve this!

For now I used the following simple solution: I know the places in source code that can generate lots of logging. So for such cases I implemented a stream-based logger that writes all incoming log messages of the same type to internal fixed-size buffer and creates LogRecord objects using tbb::scalable_allocator only in 3 cases:
- buffer overflows (then I effectively use all 2K memory space and do not create extra costly LogRecords);
- logging thread explicitly calls flush (not used for now);
- stream object destroys (flushes all pending data);
This allowed me to greatly reduce the number of LogRecord allocations and the resulting VM footprint is now quite low.

Still imho all these are just workarounds. I have a feeling that generic scalable allocator should definately help to scale up at the cost of more memory resources but also should have an option to scale down in case of peakstress loads. Or have an explicitnote in documentation on possible unbounded VM waste under certain conditions. And arecommendation to"usemostly under64-bit systems". Since as I have seen in my case it can quite quickly and not very expectedly consume all 2G of 32-bit virtual memory space available for processes in 32-bit Windows systems. Again I may be wrong, just imho.

Anyway, Jim, thanks again! I hope to get your valuable assistance if I have other posts in this forum.

Quoting - Vivek Rajagopalan

I read the Quick Thread PDF. It sounds interesting , where can I find more details ?

Send me your email and I will email a current .doc file and other supporting documents. If after reading, you find you are interested in exploring further, I can send you a beta test kit. Note, QuickThread runs on Windows platforms now, when I have time, and if someone can help me figure out Linux (Ubuntu installed on my system on seperate disk), then I can get to work on a Linux version.

The beta license will restrict you to "evaluation purposes only". Things are in a little bitin flux now and a few more revisions are expected before you ship applications built with it. The more testers I have the better the shake-down.

Jim Dempsey
jim (dot) (zero) (zero) (dot) dempsey (at) gmail (dot) com
or
jim (underbar) dempsey (at) ameritech (dot) net

www.quickthreadprogramming.com

You may find it a non-option to dynamicly tune any scalable allocator. Your only option may be to tune your application or make concessions, one or more of:

throttlethe applicationdown when you get too much of a back log
discard some log messages if acceptible
Convert "Warning Will Rogers..." into codes byte, short, word, long, ...
Devise a compression technique that is fast and can reconstitute the original log
Use a pipe to push log data to seperate process (on x32 system)
Use memory mappend file in lieu of pipe to push log data to seperate process (on x32 system)
... there may be a few other tricks you can use too.

Jim Dempsey

www.quickthreadprogramming.com

Quoting - redcat76

Thank you for proposed solutions!
I'll give a try with object pool model.I just had a feeling from documentation that tbb::scalable_allocator behaves in a way similar to such an object pool. The only problem is there's no way to shrink its allocated VM space after peak loads...

Indeed the TBB allocator does have the pool of objects (more exactly, a variety of per-thread pools for objects of different sizes).

I think the real issue is that due to faster allocations the peak itself is higher than what the system can tolerate. I.e. the application just does not survive the peak. If VM was returned to OS, this would slow down both allocation and deallocation, thus lowering the rate of allocations, and decreasing the peak load. To me, that sounds morelike an artificialworkaround, while what you have implemented is the right solution: first, improve memory utilization in the app, and second, check if there is so much data not yet processed that lack of memory might be a problem.

We will consider adding a function to flush unused VM in future versions of the allocator. This is hard to do in the current design, when VM is mapped by rather big pieces (1M) and is then distributed across many threads - it is hardly ever possible to have every piece of it freed to return it all back to OS. Allocation speed in multi-threaded environment comes with a price [added] - though possibly the price component in the question can be reduced.

I agree here with Alexey. It would be questionable if a slab based allocator (TBB and QuickThread) could ever return slabsof VM. Alexey might be able to answer this for TBB. It could be possible, that when memory is low, and you find an over abundance of returned allocations of some size, and if these former allocations were large enough, that test allocations could be split in order to satisfy the current allocation problem. If you are tight on memory, you may have to go with malloc/free. Or some mixture.

Have you experimented with using a pipe to a seperate process? Note, each process on 32-bit system is a seperate virtual address space. Until you migrate to 64-bit, you can shove some data (or processing) into a seperate VM.

Or you could even use OpenMPI on the same system to accomplish the same thing. Place major functions with low communication overhead into seperate processes. Then use OpenMPI in place of pipe/memory mapped file. Your work effort would apply for use on larger MPI based systems later.

Jim Dempsey

www.quickthreadprogramming.com

Quoting - Alexey Kukanov (Intel)

Indeed the TBB allocator does have the pool of objects (more exactly, a variety of per-thread pools for objects of different sizes).

I think the real issue is that due to faster allocations the peak itself is higher than what the system can tolerate. I.e. the application just does not survive the peak. If VM was returned to OS, this would slow down both allocation and deallocation, thus lowering the rate of allocations, and decreasing the peak load. To me, that sounds morelike an artificialworkaround, while what you have implemented is the right solution: first, improve memory utilization in the app, and second, check if there is so much data not yet processed that lack of memory might be a problem.

We will consider adding a function to flush unused VM in future versions of the allocator. This is hard to do in the current design, when VM is mapped by rather big pieces (1M) and is then distributed across many threads - it is hardly ever possible to have every piece of it freed to return it all back to OS. Allocation speed in multi-threaded environment comes with a price [added] - though possibly the price component in the question can be reduced.

Thank you, Alexey! Your arguments sound quite convincing. So I agree it is more app-level responsibility to fine -tune such things.

One of the reasons I started this thread though is that with MSVC run-time mallocfree my app survives even the peak load at the cost of performance. And the memory allocation after the peakis the same as before as all the memory is returned back to the system. But withTBBscalable_allocator itcrashes.So another option for me is to temporary switch to mallocfree-based allocator to process peak loads.

Not sure how it fits into current design as I still did not study the sources, but I think this feature can be added to tbb::scalable_allocator. If the memory allocated by TBB is higher than a threshold set by a call to some interface function (set_max_scalable_memory or smth) than TBB passes all allocations to mallocfree. By default there is no threashold.I suspect TBB may not be aware of the total allocations made by all threads, so it may be a thread-based feature. Then app programmer can configure thisthreshold per box based on VM size (32-bit or 64-bit), amount of physical memory and results of stress testing.

Quoting - jimdempseyatthecove

I agree here with Alexey. It would be questionable if a slab based allocator (TBB and QuickThread) could ever return slabsof VM. Alexey might be able to answer this for TBB. It could be possible, that when memory is low, and you find an over abundance of returned allocations of some size, and if these former allocations were large enough, that test allocations could be split in order to satisfy the current allocation problem. If you are tight on memory, you may have to go with malloc/free. Or some mixture.

Have you experimented with using a pipe to a seperate process? Note, each process on 32-bit system is a seperate virtual address space. Until you migrate to 64-bit, you can shove some data (or processing) into a seperate VM.

Or you could even use OpenMPI on the same system to accomplish the same thing. Place major functions with low communication overhead into seperate processes. Then use OpenMPI in place of pipe/memory mapped file. Your work effort would apply for use on larger MPI based systems later.

Jim Dempsey

So far I resolved the issue by making a simple stream buffer that makes more effective use of memory. So now lack of VM space is not a problem.

Still using a separate process for logging may be a good idea. But smth tells me that pipes may not be fast enough touse them directly from logging threads. So I may leave current design unchanged (logginmg threads just postb requests to processing thread) and make processing thread send log requests through a pipe to a separate logging process instead of directly writing to a file on disk. Should be faster, so memory consumption in main app should be lower. Logging processmay have 1 thread for accepting incoming requests and 1 - for actual disk writes. Otherwise this chain will again work with speed of disk access.

Quoting - redcat76

So far I resolved the issue by making a simple stream buffer that makes more effective use of memory. So now lack of VM space is not a problem.

Still using a separate process for logging may be a good idea. But smth tells me that pipes may not be fast enough touse them directly from logging threads. So I may leave current design unchanged (logginmg threads just postb requests to processing thread) and make processing thread send log requests through a pipe to a separate logging process instead of directly writing to a file on disk. Should be faster, so memory consumption in main app should be lower. Logging processmay have 1 thread for accepting incoming requests and 1 - for actual disk writes. Otherwise this chain will again work with speed of disk access.

Pipes are fairly fast. However, if you use a memory mapped file, not for the I/O buffer, but as a ring buffer visible to both processes, then what you have isshared memory block(s).

Buffer[(fillPointer++)%bufferSize] = data;

The data will be visible to the other thread in the other process as fast as it takes the cache coherency system to percolate the write. Caution, this buffer may reside at different addresses in each process. Don't pass information usingpointers. Instead use offsets or values.

Jim Dempsey

www.quickthreadprogramming.com

Quoting - redcat76

Thank you, Alexey! Your arguments sound quite convincing. So I agree it is more app-level responsibility to fine -tune such things.

One of the reasons I started this thread though is that with MSVC run-time mallocfree my app survives even the peak load at the cost of performance. And the memory allocation after the peakis the same as before as all the memory is returned back to the system. But withTBBscalable_allocator itcrashes.So another option for me is to temporary switch to mallocfree-based allocator to process peak loads.

Not sure how it fits into current design as I still did not study the sources, but I think this feature can be added to tbb::scalable_allocator. If the memory allocated by TBB is higher than a threshold set by a call to some interface function (set_max_scalable_memory or smth) than TBB passes all allocations to mallocfree. By default there is no threashold.I suspect TBB may not be aware of the total allocations made by all threads, so it may be a thread-based feature. Then app programmer can configure thisthreshold per box based on VM size (32-bit or 64-bit), amount of physical memory and results of stress testing.

Your suggestion (threshold) wouldn't work out because the free memory inside the TBB pool would not be available to the subsequentmalloc/free allocations. This would compound the tight memory situation. The best solution, under the tight memory situation, is to sparingly use the TBB pool system for rapid allocations that do not experience a balooning effect (e.g. your 2K transactions under peak load). For those types of allocations you would write your own allocator that can and should use a threshold system. Wether it incorporates TBB into the scheme would be up to you.

Jim Dempsey

www.quickthreadprogramming.com

Quoting - jimdempseyatthecove

Pipes are fairly fast. However, if you use a memory mapped file, not for the I/O buffer, but as a ring buffer visible to both processes, then what you have isshared memory block(s).

Buffer[(fillPointer++)%bufferSize] = data;

The data will be visible to the other thread in the other process as fast as it takes the cache coherency system to percolate the write. Caution, this buffer may reside at different addresses in each process. Don't pass information usingpointers. Instead use offsets or values.

Jim Dempsey

True, shared memory is quite fast, but again if any logging thread is to access and hence to write to it it requires low-latency lock for simultaneous writers and besides a lock for reader (log-processing app). Since writers are threads of one process they can use a critical section object. Reader can be implemented without a lock if writers maintain current write position in the shared memory and reader will check write position twice: before and after reading to avoid dirty reads... Just some fast comments, but definately worth thinking this over.

Speaking of memory-mapped files... I have another interesting task. 1 processis writing a stream of data to the shared memory that is read by other processes. I implemented a non-blocking access to it by similar technique I described above: writer updates write position in the shared memory using interlocked function. Readers check write position before and after reads to determine dirty data (writer has ovewritten memory while reader accessed it). Everyhting is fine but I have 1 problem: effective notification of readers that there's data available in the shared memory. I don't want reading processes constantly poll the memory to see if there's anything to read. Currently I use a separate auto-reset event object for each reader: once writer appends new chunk to the shared memory it sets all registered events. Readers read up to the write position and then wait on their individual events. But calling SetEvent for each registered reader introsuces performance spikes in Writer process, which I'd like to avoid. I squeezed my brains trying to figure out a scheme that would use 1 synchronization object... But so far no luck. Since this topic is not directly related to TBB and in case you have some ideas to discuss here's my e-mail: RRedCat@Yandex.Ru.
Again thank you for your suggestions!

Quoting - redcat76
Hi!
I have 1 worker thread that processes requests from other threads. Requests are relatively small objects ~2K each, allocated by scalable_allocator<>::allocate. Worker thread calls scalable_allocator<>::deallocate once it is done with request.

Is there any way to make scalable_allocator decrease or completely release its pooled memory?

Okay so its my turn to run into this :-)

If Thread-1 scalable_mallocs work tokens for Thread-2, should used tokens be passed back to Thread-1 to be scalable_freed ? Is this a best practice ?

In my case, both these Threads host tbb::pipelines. I observed that memory usage climbed up to almost 90% under load, but did not grow significantly after that. It may well be harmless, but I 'd like to have some control over it.

Thanks,

"If Thread-1 scalable_mallocs work tokens for Thread-2, should used tokens be passed back to Thread-1 to be scalable_freed ? Is this a best practice ?"
Refurbishing memory, or returningit to the thread that allocated it,may be useful optimisations, but you know what they say about premature optimisation... If thread 1 continues to allocate more memory, it will soon get around to reusing the memory deallocated from inside another thread, and unless I'm mistaken it is gradually made available for other threads as well (if all its neighbours in a block are also free). Still, the assumption seems to be that there aren't that many threads and that they don't specialise in a role, which makes sense for a task-based system. If thread 1 stops doing anything with the scalable memory allocator, the memory deallocated from another thread may stay in limbo for an extended amount of time, again not very likely with tasks, but if you encounter this problem with a user thread you might just want to close it down to have its memory taken over by another thread (to be confirmed).

"In my case, both these Threads host tbb::pipelines. I observed that memory usage climbed up to almost 90% under load, but did not grow significantly after that. It may well be harmless, but I 'd like to have some control over it."
Difficult to say anything definite without more information...

Quoting - Raf Schietekat

"If Thread-1 scalable_mallocs work tokens for Thread-2, should used tokens be passed back to Thread-1 to be scalable_freed ? Is this a best practice ?"
Still, the assumption seems to be that there aren't that many threads and that they don't specialise in a role, which makes sense for a task-based system. If thread 1 stops doing anything with the scalable memory allocator, the memory deallocated from another thread may stay in limbo for an extended amount of time, again not very likely with tasks, but if you encounter this problem with a user thread you might just want to close it down to have its memory taken over by another thread (to be confirmed).

Thanks once again Raf,

I am restructuring the code and will update this thread on what I learnt from this.

Quoting - Vivek Rajagopalan

Thanks once again Raf,

I am restructuring the code and will update this thread on what I learnt from this.

I'm curious exactly what you changed, why,and whether it was beneficial?

A clarification to what I wrote above: directly returning memory to the thread that allocated it would require being very careful to ride piggyback on a synchronisation cost that you are paying anyway, I think, otherwise you would have gained nothing, or worse. I don't know if anybody has successfully applied it yet (?), but it remains a theoretical possibility. You would be far more likely to benefit from refurbishing, though, if the opportunity presents itself.

Quoting - Raf Schietekat

I'm curious exactly what you changed, why,and whether it was beneficial?

A clarification to what I wrote above: directly returning memory to the thread that allocated it would require being very careful to ride piggyback on a synchronisation cost that you are paying anyway, I think, otherwise you would have gained nothing, or worse. I don't know if anybody has successfully applied it yet (?), but it remains a theoretical possibility. You would be far more likely to benefit from refurbishing, though, if the opportunity presents itself.

I gave up the refurbishing idea, because it appears to be very difficult to know which thread scalable_malloced a given chunk. I guess this is because the memory was allocated by a parallel tbb::filter, which could be mapped to any available thread by the scheduler.

I restructured the code to allocate in a serial filter and sure enough the memory usage is now just 12% under load. Maybe the allocator is getting around to reusing the memory more frequently. I must admit I have not tried very hard to isolate the problem I reported earlier.

I dont yet know if this has given me any benefits. I am wary of allocating in a serial filter because of my incomplete understanding of how the cache works. If you allocate in a serial filter, does it not mean that the memory is pulled into the same cache every time ? This appears to be a waste because the actual work is done by a series of parallel filters which will pull it into a different cache (of another CPU core) in short time. On the other hand, if I allocate in parallel filter, the memory will be directly pulled into the cache in which the work happens.

Thanks,

"I gave up the refurbishing idea, because it appears to be very difficult to know which thread scalable_malloced a given chunk. I guess this is because the memory was allocated by a parallel tbb::filter, which could be mapped to any available thread by the scheduler."
I meant "refurbish" for another purpose, to entirely avoid a free/malloc detour, as opposed to "returning" to the original thread. Well, maybe only an object deserves use of this word, and memory would be just "reused".

"I restructured the code to allocate in a serial filter and sure enough the memory usage is now just 12% under load. Maybe the allocator is getting around to reusing the memory more frequently. I must admit I have not tried very hard to isolate the problem I reported earlier."
So you weren't using a separate user thread before? Hmm... if the other filters are parallel, the data item flows through the pipeline uninterruptedly (last time I looked at the code, anyway, as this is not guaranteed), and would be freed in the same task execution, which means in the same thread. Maybe that's another thing that changed in the restructuring?

"I dont yet know if this has given me any benefits. I am wary of allocating in a serial filter because of my incomplete understanding of how the cache works. If you allocate in a serial filter, does it not mean that the memory is pulled into the same cache every time ? This appears to be a waste because the actual work is done by a series of parallel filters which will pull it into a different cache (of another CPU core) in short time. On the other hand, if I allocate in parallel filter, the memory will be directly pulled into the cache in which the work happens."
It may seem paradoxical, but looking across time each item in a serial filter is likely to be processed in a different thread, and if you follow the item through the pipeline it stays in the same thread if it only moves to parallel filters. So your reasoning is correct, but the assumption was wrong.

(Added 2009-09-12) See text.

Quoting - Raf Schietekat
"I restructured the code to allocate in a serial filter and sure enough the memory usage is now just 12% under load. Maybe the allocator is getting around to reusing the memory more frequently. I must admit I have not tried very hard to isolate the problem I reported earlier."
So you weren't using a separate user thread before? Hmm... if the other filters are parallel, the data item flows through the pipeline uninterruptedly (last time I looked), and would be freed in the same task, which means in the same thread. Maybe that's another thing that changed in the restructuring?

I was using a separate user thread. I had hosted a pipeline in this thread (tbb_thread if that makes a difference), and the memory was being allocated by various parallel filters. The memory was being freed in another tbb_thread by other pipeline stages in that thread. I must apologize for not spending enough effort to tracking down the issue after reporting it in this forum. I moved some allocations from the parallel stage to the serial stage and the memory issues seemed to go away.

Here is my setup in all its ugly :-)

Background :
Actually this is based on an open source project I had launched earlier (not task based) called Trisul Network Metering and Forensics (http://code.google.com/p/trisul/) I completely knocked down the thread based approach of that project and am completely rewriting it as task / event based. So the project is dead in its current form. I just could not get over that fact that 3 of my 4 cores were just relaxing while the other core was at 100%. I tried some flow pinning techniques using pthreads, but did not like the results or the intricate synchronization stuff.

TBB Thread 1 : PIPELINE 1

1. Filter 1 (Serial) : Input Packets read off wire (using special hardware if available or libpcap), batches into 100-200 packets until a 1MB to total payload.

2. Filter 2 (Parallel) : Uses an instance of a framework to decode the packets protocols

3. Filter 3 (Parallel) : Uses an instance of a metering framework to count. The net result of this is a set of messages that update counters for various things like IP addresses, TCP flows, etc, etc. This could be intensive. scalable_malloc happens here.

4. Filter xx (Parallel) : Several filters that perform various deep inspection of the payload.

5. Filter 4 (Parallel) : Figure out if the packet needs to be saved (eg, for forensic purposes). If yes, apply block encryption to the payload. Pass it along

6. Filter 5 (Parallel) : Some more magic that takes the set of messages and compresses them. Outputs the messages to a concurrent_queue.

7. Filter 6 (Serial ) : All packets that need stateful handling (marked as such by a parallel stage) get processed here. Examples : IP fragment reassembly, TCP flow construction, VOIP etc. * This is a known bottleneck because essentially this stage is single threaded.

8. Filter 7 (Serial ) : Saves the packets marked as such for forensics purposes. (Could be another bottleneck).

(After this step, further stages work on the messages not on the packet data. So I decided to use another pipeline that can handle the very different work profile). The only way I could figure out how to do this was to use another tbb_thread and put a pipeline inside it. So there it was, thread 2/ pipe 2.

TBB Thread 2 : PIPELINE 2

1. Filter 1 (Serial) : Reads messages from the concurrent queue. Generates work tokens based on target data structures to be updated. We can figure it out by looking at the messages coming out of the queue.

2. Filter 2 (Parallel) : Carries out the various operations contained in command messages. Most of these actions update data structures. Some of these generate additional messages. ** scalable_free ** the messages.

3. Filter 3 (Parallel) : Carries out some calculations and summarization.

4. Filter 4 (Serial) : Does not do much actually. Just a way to keep track of tokens exiting. Occasionally, there needs to be a pruning of data structures. When it is time to do that the input stage (1) will check with this stage to confirm if the pipeline is empty and is safe to prune. See this thread http://software.intel.com/en-us/forums/showthread.php?t=68114 for more.

>>> It may seem paradoxical, but looking across time each item in a serial filter is likely to be processed in a different thread, and if you follow the item through the pipeline it stays in the same thread if it only moves to parallel filters. So your reasoning is correct, but the assumption was wrong.>>>>

Wow!! That is quite a relief to know. Now that you told me this, it seems logical that a serial filter switches threads too. My new understanding is : "The guarantee TBB provides is that only one instance of a serial filter will execute at a time. TBB does not guarantee that it will execute on any specific thread." Is this correct ? I had a wrong mental picture of a factory with lines of conveyor belts and the serial filter sitting in one of them and distributing tokens to his own and other belts.

Really appreciate your help Raf. There are very few experts out there to ask for help.

Quoting - Vivek Rajagopalan
My new understanding is : "The guarantee TBB provides is that only one instance of a serial filter will execute at a time. TBB does not guarantee that it will execute on any specific thread." Is this correct ?

You provide the only filter instances that ever exist (they may be noncopyable). Only one item at a time will be processed by a serial filter, not necessarily on the same thread as its predecessor.

(Added) I myself would try to integrate the pipelines, perhaps wiith scalable_malloc memory only moving to parallel filters.

Leave a Comment

Please sign in to add a comment. Not a member? Join today