concurrrent_unordered_map crash when insert() or find()

concurrrent_unordered_map crash when insert() or find()

I have a multithread program which uses concurrent_unordered_map (latest tbb version, built from src). It will crashes after several hours running. The core dump is

(gdb) where
#0 flist_iterator (other=..., this=<synthetic pointer>) at /usr/include/tbb/internal/_concurrent_unordered_impl.h:90
#1 get_bucket (bucket=<optimized out>, this=<optimized out>) at /usr/include/tbb/internal/_concurrent_unordered_impl.h:1333
#2 internal_insert (value=..., this=0x7f75a404e500) at /usr/include/tbb/internal/_concurrent_unordered_impl.h:1127
#3 insert (value=..., this=0x7f75a404e500) at /usr/include/tbb/internal/_concurrent_unordered_impl.h:860
#4 ngram_segment<20>::segment (this=this@entry=0x7f75a404e4f0, sentence=..., result=..., debug=debug@entry=false) at /usr/local/include/segment/segment.h:319

The code around 319 is:

{

auto it = dp_cache_.find(sentence);

if (it == dp_cache_.end()) {

  dp_cache_.insert(make_pair(sentence, result));

}

}

And the types:

typedef pair<float, vector<int>> segmentation;

concurrent_unordered_map<string, segmentation> dp_cache_;

I tried concurrent_hash_map and different versions of tbb or pre-built binaries. They all crashes. 

Anyone seeing similar cases? 

Helps are appreciated!

30 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I just tried debug version of tbb libraries, and now it shows more information, while I still don't understand why.

(gdb) where
#0 0x00007f310ee37425 in __GI_raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x00007f310ee3ab8b in __GI_abort () at abort.c:91
#2 0x00007f310a763234 in rml::internal::assertion_failure (filename=0x7f310a76f948 "../../src/tbbmalloc/frontend.cpp", line=1805, expression=0x7f310a76fc68 "allocatedCount < (slabSize-sizeof(Block))/objectSize", comment=0x0)
at ../../src/tbbmalloc/../tbb/tbb_assert_impl.h:84
#3 0x00007f310a76c9da in rml::internal::Block::allocateFromFreeList (this=0x7f310a4a0000) at ../../src/tbbmalloc/frontend.cpp:1805
#4 0x00007f310a76cb09 in rml::internal::Block::allocate (this=0x7f310a4a0000) at ../../src/tbbmalloc/frontend.cpp:1832
#5 0x00007f310a76a76a in rml::internal::internalPoolMalloc (memPool=0x7f310a978ea0 <rml::internal::defaultMemPool_space>, size=56) at ../../src/tbbmalloc/frontend.cpp:2239
#6 0x00007f310a76aae5 in rml::internal::internalMalloc (size=56) at ../../src/tbbmalloc/frontend.cpp:2324
#7 0x00007f310a76b076 in scalable_malloc (size=56) at ../../src/tbbmalloc/frontend.cpp:2566
#8 0x00007f310f9f1adb in tbb::internal::allocate_via_handler_v3 (n=56) at ../../src/tbb/cache_aligned_allocator.cpp:237
#9 0x000000000040a671 in allocate (this=<optimized out>, n=<optimized out>) at /usr/include/tbb/internal/../tbb_allocator.h:109
#10 create_node (order_key=<optimized out>, this=<optimized out>) at /usr/include/tbb/internal/_concurrent_unordered_impl.h:276
#11 insert_dummy (order_key=<optimized out>, it=..., this=0x10c3290) at /usr/include/tbb/internal/_concurrent_unordered_impl.h:510
#12 init_bucket (bucket=1812, this=0x10c3280) at /usr/include/tbb/internal/_concurrent_unordered_impl.h:1289
#13 tbb::interface5::internal::concurrent_unordered_base<tbb::interface5::concurrent_unordered_map_traits<std::string, std::pair<float, std::vector<int, std::allocator<int> > >, tbb::interface5::internal::hash_compare<std::string, tbb::tbb_hash<std::string>, std::equal_to<std::string> >, tbb::tbb_allocator<std::pair<std::string const, std::pair<float, std::vector<int, std::allocator<int> > > > >, false> >::init_bucket (this=0x10c3280, bucket=3860) at /usr/include/tbb/internal/_concurrent_unordered_impl.h:1284
#14 0x000000000040b5ba in init_bucket (bucket=7956, this=0x10c3280) at /usr/include/tbb/internal/_concurrent_unordered_impl.h:1284
#15 internal_find (key=..., this=0x10c3280) at /usr/include/tbb/internal/_concurrent_unordered_impl.h:1181
#16 find (key=..., this=0x10c3280) at /usr/include/tbb/internal/_concurrent_unordered_impl.h:921

Hello, Benyu,

Thanks for the report.  Can you tell me what kind of failure it is?  The traceback is at the copy-construction of an flist_iterator as a return value, which is constructed on the stack, so it doesn't seem to be a heap failure.  And the initialization of an element when inserting into the map should get a segfault if there is no memory available, but it should be in the split_ordered_list.

Do you have an idea of how many items are inserted into the map before the failure?

Regards,
Chris Huson

Hi Christopher,

I don't have the exact number at hand, but I believe it's less than 10K. 

Hi Christopher,

BTW, I tried to use the debug version of tbb libraries. Here is the core dump of another crash (this time is in find()) Wish it provide more information.

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `build/segment_service -w segment_service/ -P 8003'.
Program terminated with signal 6, Aborted.
#0 0x00007f310ee37425 in __GI_raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) where
#0 0x00007f310ee37425 in __GI_raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x00007f310ee3ab8b in __GI_abort () at abort.c:91
#2 0x00007f310a763234 in rml::internal::assertion_failure (filename=0x7f310a76f948 "../../src/tbbmalloc/frontend.cpp", line=1805, expression=0x7f310a76fc68 "allocatedCount < (slabSize-sizeof(Block))/objectSize", comment=0x0) at ../../src/tbbmalloc/../tbb/tbb_assert_impl.h:84
#3 0x00007f310a76c9da in rml::internal::Block::allocateFromFreeList (this=0x7f310a4a0000) at ../../src/tbbmalloc/frontend.cpp:1805
#4 0x00007f310a76cb09 in rml::internal::Block::allocate (this=0x7f310a4a0000) at ../../src/tbbmalloc/frontend.cpp:1832
#5 0x00007f310a76a76a in rml::internal::internalPoolMalloc (memPool=0x7f310a978ea0 <rml::internal::defaultMemPool_space>, size=56) at ../../src/tbbmalloc/frontend.cpp:2239
#6 0x00007f310a76aae5 in rml::internal::internalMalloc (size=56) at ../../src/tbbmalloc/frontend.cpp:2324
#7 0x00007f310a76b076 in scalable_malloc (size=56) at ../../src/tbbmalloc/frontend.cpp:2566
#8 0x00007f310f9f1adb in tbb::internal::allocate_via_handler_v3 (n=56) at ../../src/tbb/cache_aligned_allocator.cpp:237
#9 0x000000000040a671 in allocate (this=<optimized out>, n=<optimized out>) at /usr/include/tbb/internal/../tbb_allocator.h:109
#10 create_node (order_key=<optimized out>, this=<optimized out>) at /usr/include/tbb/internal/_concurrent_unordered_impl.h:276
#11 insert_dummy (order_key=<optimized out>, it=..., this=0x10c3290) at /usr/include/tbb/internal/_concurrent_unordered_impl.h:510
#12 init_bucket (bucket=1812, this=0x10c3280) at /usr/include/tbb/internal/_concurrent_unordered_impl.h:1289
#13 tbb::interface5::internal::concurrent_unordered_base<tbb::interface5::concurrent_unordered_map_traits<std::string, std::pair<float, std::vector<int, std::allocator<int> > >, tbb::interface5::internal::hash_compare<std::string, tbb::tbb_hash<std::string>, std::equal_to<std::string> >, tbb::tbb_allocator<std::pair<std::string const, std::pair<float, std::vector<int, std::allocator<int> > > > >, false> >::init_bucket (this=0x10c3280, bucket=3860) at /usr/include/tbb/internal/_concurrent_unordered_impl.h:1284
#14 0x000000000040b5ba in init_bucket (bucket=7956, this=0x10c3280) at /usr/include/tbb/internal/_concurrent_unordered_impl.h:1284
#15 internal_find (key=..., this=0x10c3280) at /usr/include/tbb/internal/_concurrent_unordered_impl.h:1181
#16 find (key=..., this=0x10c3280) at /usr/include/tbb/internal/_concurrent_unordered_impl.h:921
#17 segmentation_in_cache (result=..., sentence=..., this=0x10c3270) at /usr/local/include/segment/segment.h:249

The code is

     auto it = dp_cache_.find(sentence);

     if (it == dp_cache_.end()) {

        return false;

      }

      result = it->second;

      return true;

}

Thanks, Benyu, for the detailed traceback.  It looks like the TBB allocator is failing in an attempt to get a block from the free list.  I will talk to the person responsible for the allocator.

Regards,
Chris

Thanks! Here is the memory configuration.

$ free
total used free shared buffers cached
Mem: 32865628 22334056 10531572 0 433452 12874616
-/+ buffers/cache: 9025988 23839640
Swap: 33443836 0 33443836

Hi, Benyu,

Could you try something for me?  In the directory with the TBB shared library (tbb_debug.dll or libtbb_debug.so) there is also a library for the allocator (tbbmalloc_debug.dll or libtbbmalloc_debug.so.)  Could you rename that file to something else so it is not loaded when you run your test and try it one more time?

Thanks for the help, and regards,
Chris

Cannot run my test:

error while loading shared libraries: libtbb_debug.so.2: cannot open shared object file: No such file or directory

(I am linking my binary with TBB_DEBUG_Libraries, is it the reason?)

Hello, Benyu,

It looks like you may have renamed the TBB library instead of (or in addition to) the tbbmalloc library.  Only rename the allocator library (The TBB library should then use the regular malloc/free, if I remember right.)

Thanks for the quick reply, and regards,
Chris

Sorry, you are right. Now running.

It's still running. I guess it's the reason as it crashed pretty fast when running the debug libraries. 

Unfortunately, crashed:

(gdb) where
#0 0x000000000040a4e5 in insert_dummy (order_key=<optimized out>, it=..., this=0x154e290) at /usr/include/tbb/internal/_concurrent_unordered_impl.h:518#1 tbb::interface5::internal::concurrent_unordered_base<tbb::interface5::concurrent_unordered_map_traits<std::string, std::pair<float, std::vector<int, std::allocator<int> > >, tbb::interface5::internal::hash_compare<std::string, tbb::tbb_hash<std::string>, std::equal_to<std::string> >, tbb::tbb_allocator<std::pair<std::string const, std::pair<float, std::vector<int, std::allocator<int> > > > >, false> >::init_bucket (this=0x154e280, bucket=472) at /usr/include/tbb/internal/_concurrent_unordered_impl.h:1289
#2 0x000000000040b5ba in init_bucket (bucket=2520, this=0x154e280) at /usr/include/tbb/internal/_concurrent_unordered_impl.h:1284
#3 internal_find (key=..., this=0x154e280) at /usr/include/tbb/internal/_concurrent_unordered_impl.h:1181
#4 find (key=..., this=0x154e280) at /usr/include/tbb/internal/_concurrent_unordered_impl.h:921

Hello, Benyu,

I'm sorry I haven't got back to you before now.  If it took longer to fail, it was probably because of the global lock in the system malloc/free.  In any case, it looks like there is something going on in concurrent_unordered, whether or not there is a problem in the TBB malloc.

One more question.  You are only adding and fetching items from the table, not erasing, right? 

You've given me a snippet of the code you are using to insert items in the table.  I will try to write a small program that duplicates the problem.

I'll let you know how it goes.  Thanks for your patience.

Regards,
Chris

Hi Christopher,

I do call clear(). Is it unsafe for multithread?

Hello Chris,

I do call clear(). Is it unsafe?

Hi, Benyu,

Yes, clear() should not be called in a concurrent context.  I am not sure what your application is doing, so I don't have a suggestion how to protect the call.

I've been having a problem with the small test case I wrote.  I insert an element to the concurrent_unordered_map, and then try to find it immediately.  The test fails, so i am still working on it.

Regards,
Chris

clear() is not thread-safe.

(Added after seeing previous posting) Inserting an answer into this forum isn't, either... :-)

Hi Chris,

So it seems there are two issues on-going. I will add a lock to guard clear(), and you will work on the insert/find test case?

Hi, Benyu,

One additional comment I would make on the design side.  The objects in the concurrent_unordered_map are heap-allocated (both the std::string key and the vector<int> component of the value).  The use of malloc/free on these structures will involve a global lock that will serialize their creation, as well as resizing of the vectors.

I would also ask if the accesses of the items in the concurrent_unordered_map are all distinct (per thread)?  The structure allows the insertion of items without locking, but accessing the same item by multiple threads will also require a per-item lock, especially in the vector case if resizes occur, unless we guarantee algorithmically only one thread at a time accesses each item.

If you know the likely largest size of the vector, you can reserve() that size and get rid of most reallocations.

I haven't tried using a heap-based object as a key before; I don't know if that is what is causing the insertions to fail in my test or (more likely) there is some oversight in my code.

Regards,
Chris

Hi Chris,

You raised several very good points. I am using the concurrent_unordered_map as a cache of dynamic programming. The sub-problem is the key (string), and the optimal solution for that sub-problem is the value (vector). The sub-problem could be reused within a problem, or across multiple problems, and each thread is working on a different problem. So multiple threads may access the same item at the same time.

And the value (vector) will not resize. So we don't need the reserve? 

Re: heap or stack, I feel it's non-trival to host the sub-problems and optimal solutions in local variables. I could use a memory pool as the class member, but it still need lock on it.

Hi, Benyu,

I have a small program that seems to work, which I am attaching to this note.  The command line is

cumap_demo [ <insertions> [ <value-mask> [ <do-serially> [ <find-first> ] ] ] ]

where

<insertions> - number of total insertions to attempt

<value-mask> - maximum number of distinct values to insert (see below)

<do-serially> - 1 if you want it to execute serially

<find-first> - 1 if you want a find() before an insert()

random values in the range [1 : value-mask] are used as keys (after being converted to strings) in the concurrent_unordered_map, and the vector in the record keeps track of the indices that had this value.  I added a spin_mutex to the record to serialize operations on the vector.

I don't know if this resembles what you are trying to do.

Regards,
Chris

Attachments: 

AttachmentSize
Downloadtext/x-c++src cumap-demo.cpp7.12 KB

Hi, Benyu,

Just noticed I left some older comments on compiling the demo in the file.  You can ignore the comments on -D values; I made them into command-line values so you don't have to recompile.

Regards,
Chris

HI Chris,

thanks for the demo code. It looks good, but in my case, different threads may run insert_body() or check_body() at the same time. Would it cause some issues?

Hi Chris,

It seems it's clear() which causes the crash. It runs well for one day after adding a guard around clear(). I will keep it running for one more day and see.

Just to make sure, what kind of guard are you using, and how? It doesn't help if the concurrent operations aren't also covered, and it defeats the purpose if the guard is used around individual supposedly concurrent operations...

I used a RWSpinLock to guard the single thread which calls clear() (WriteHolder), and the multiple threads which call concurrent safe operations(ReadHolder). 

OK, although a kernel-level mutex might be more appropriate in this case; I'll assume that you implied that you do many concurrent operations between acquiring and releasing a read lock.

Yes, your assumption is right. Any suggestion on the kernel-level mutex libraray/wrapper?

Leave a Comment

Please sign in to add a comment. Not a member? Join today