Performance issues with tbbmalloc in case of large memory allocation

Performance issues with tbbmalloc in case of large memory allocation

Hello,

We're using tbb42_20131118oss version for Linux 64bit ( CentOS 5 ) in our product. Recently, following issue was discovered - for cases with large memory allocation we've noticed significant performance degradation after reaching some "critical point" in terms of memory. Specific example is:

1. Machine has about 1 terabyte of memory, it is mostly free, only our application was running.

2. Our application runs some algorithm that builds some data structure of the large size. And, ideally, we're expecting to see approximately constant memory increase over time till the end.

3. But, until physical memory is not reaching around 250-300 Gb, algorithm works relatively fast. After reaching this "critical point", it slows down dramatically ( i.e., the same portion of job which was completed in 5 minutes before reaching critical point, after critical point was completed in about 5 hours ). Finally, it finishes with physical memory about 465 Gb. And runtime profiler shows definite bottleneck in tbbmalloc.

4. It is worth to mention again, that machine is free, there is plenty of free memory, and only our process consumes most of CPU resources also. We also do not see any fragmentation of memory in our application.

We've tried to do not link tbbmalloc at all ( in this case, standard malloc is used ), but it causes crashes in multi-threaded runs, so it seems like we cannot use TBB without linking tbbmalloc.

It is also worth to mention that while using older version of TBB ( 2.2 ), we were not able to finish the run at all, it gave us St9bad_alloc error while reaching certain amount of memory ( ~ 400 Gb ). With 4.2 it is goes through but with unacceptable runtime, unfortunately.

Could you please advise us something?

4 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

Vladimir,

I would like to clarify.

  1. You reported physical memory consumption, reported by OS, right? Could you also report virtual memory consumption when “fast” became “slow”? Is something known about how many actual memory usage by user objects (i.e., size of memory that was malloc()-ed but not free()-ed)?
  2. Are you use system malloc replacement (via LD_PRELOAD or linking with tbbmalloc_proxy) or not? TBB able to use tbbmalloc without linking, it’s enough to have libtbbmalloc.so.2 in same directory with libtbb.so.2.
  3. What is your allocation workload? I.e., is it many small objects of different sizes, single-size “hot” large object size, etc? Is allocation done from multiple threads, or only single one does allocation? Is same thread do releasing that do allocation?
  4. What specific part of tbbmalloc is bottleneck?

Hello Alexandr,

Please see my answers below:

Quote:

Alexandr Konovalov (Intel) wrote:

Vladimir,

I would like to clarify.

  1. You reported physical memory consumption, reported by OS, right?

Yes, this is correct.

Quote:

Alexandr Konovalov (Intel) wrote:

Could you also report virtual memory consumption when “fast” became “slow”?

Virtual memory is about 30% higher, so it is around 300-320 Gb, approximately. Cannot tell you more exactly, unfortunately, because this is special machine, I'm not allowed to run my jobs there whenever I want to do it.

Quote:

Alexandr Konovalov (Intel) wrote:

Is something known about how many actual memory usage by user objects (i.e., size of memory that was malloc()-ed but not free()-ed)?

Again, cannot tell exactly, it is really hard to control, depends upon how often system cleans up free()-ed memory. But, at least, as I said, we believe that there should be no significant memory fragmentation made by our application. Not sure this answers your question, though.

Quote:

Alexandr Konovalov (Intel) wrote:

  1. Are you use system malloc replacement (via LD_PRELOAD or linking with tbbmalloc_proxy) or not? TBB able to use tbbmalloc without linking, it’s enough to have libtbbmalloc.so.2 in same directory with libtbb.so.2.

We link tbbmalloc_proxy explicitely in makefile.

 

Quote:

Alexandr Konovalov (Intel) wrote:

  1. What is your allocation workload? I.e., is it many small objects of different sizes, single-size “hot” large object size, etc?

    Well, most of the memory is consumed by relatively small ( about ~50-100 objects ) set of std::vector<some_objects> and std::vector< std::vector<some_object> > containers of different size, they may be of really huge total size, as I said ( ~ 450-500 Gb total, finally ). And we're adding objects there by small portions, using std::vector::push_back or std::vector::insert

Quote:

Alexandr Konovalov (Intel) wrote:

  1. Is allocation done from multiple threads, or only single one does allocation?

From multiple threads. Although we're trying to avoid mutexes and make different threads filling different containers wherever possible, but it is not 100% guaranteed.

Quote:

Alexandr Konovalov (Intel) wrote:

  1. Is same thread do releasing that do allocation?

No, this is not guaranteed. Moreover, while the builder of this data structure is essentially multi-threaded algorith, this data is destroyed after it is done, during the exit, in a single thread.

Quote:

Alexandr Konovalov (Intel) wrote:

  1. What specific part of tbbmalloc is bottleneck?

We always see following in the stack trace: operator new() from libtbbmalloc_proxy.so.2

Please let me know in case of any more clarifications needed.

Thank you,

Vladimir.

While your system may have about 1TB of physical memory, it may also have a memory policy that no single process can occupy more than a set number of pages. When this number is exceeded, then the process may shift into a paging mode. 5 minutes to 5 hours seem like a paging issue.

You could test this by allocating say 500GB, then time a loop that does an easy/fast "diddle". Repeat a few times such that you have touched more than once. ** make sure that the "diddle" isn't optimized out by compiler optimization. Note, you only need to write once per page (typically 4KB but may be 4MB).

You could modify the "diddle" test such that you repeat a few times diddling 100GB, 200GB, 300Gb, 400GB, 500GB (using the same allocation)

If at some point the loop run time (2nd and later iteration) is not proportionately longer, but 50x or more longer, then this indicates policy issue restricting the amount of RAM your application can consume at any one point in time. If so,l you may need to discuss this with the system administrator. Also ask them to see if they are running OOM (Out Of Memory service) to see if they need to tune this for your application too.

Jim Dempsey

 

www.quickthreadprogramming.com

Kommentar hinterlassen

Bitte anmelden, um einen Kommentar hinzuzufügen. Sie sind noch nicht Mitglied? Jetzt teilnehmen