Intel® Threading Building Blocks

Distributed, Docker*-ized Deep Learning with Intel® Nervana™ technology, neon™, and Pachyderm*

The recent advances in machine learning and artificial intelligence are amazing! It seems like we see something groundbreaking every day, from self-driving cars, to AIs learning complex games. Yet, in order to have real value within a company, data scientists must be able to get their models off of their laptops and deployed within a company’s data pipelines and infrastructure.

  • Students
  • Artificial Intelligence
  • Python*
  • Intermediate
  • Intel® Data Analytics Acceleration Library (Intel® DAAL)
  • Intel® Distribution for Python*
  • Intel® Math Kernel Library
  • Intel® MPI Library
  • Intel® Threading Building Blocks
  • Neon*
  • Pachyderm*
  • neon
  • Big Data
  • Machine Learning
  • Tbb container free causing segfault


    We are using the Intel® TBB concurrent container concurrent_unordered_map (as an outer level) and concurrent_vector (as the inner level) to create a hashmap, allowing concurrent fetching and growth. However, when using large amount of memory (>500G, the Linux machine has ~1T RAM space), the free operation brings segfault as follows:

    Enqueue tasks and wait for the workers completion



    I have an application in which my application thread spawns an std::thread at the beginning of the program. I define two task_arena and two task_group that are shared by the two master threads of my application. I want the first thread to use the first arena and first group and the second thread to use the second arena and second group.

    For the moment my code looks like that:

    Nested parallel_for with mutex hangs

    I have a set of data blocks that I process using a parallel_for loop. These data blocks are held in a pool that may be compressed. The first thread to access a block that is in the compressed pool triggers an uncompress routine.Now, I have a mutex that ensures that the uncompress routine is only executed by one task thread. But the uncompress routine uses its own parallel_for loop to speed up the decompression. When the inner parallel_for loop ends, control doesn't go back to the parent task that started the uncompress routine.

    Implementing a Synchronous DataFlow Graph using Intel Flow Graph


    I started investigating Intel TBB recently and was thinking of the possibility of implementing an application specified as a Synchronous DataFlow Graph using function and queue nodes. I seems to me doable in a straightforward manner. Could someone confirm? Any thoughts?

    how to find where my program is spinning?


    I have looked through the forums and other TBB resources and based on vtune I can see my program is spending a lot of time spinning but I have not found out where it is spinning yet.

    I have parallel studio and would appreciate any advice on how to find out where the program is spinning so I can fix it. Overall it seems my parallelization is not very well balanced and I am trying to figure out where the problems are.



    I remember reading somewhere that if you link TBBMalloc or potentially use Scalabale Allocator, TBB will pre-allocates some amount of memory per thread to avoid implicit synchronizations. But i can't find this any more. I thought I found this in TBB Book but looks like it wasn't. 

    Is there any per thread preallocation happens in Scalable Allocator or in TBBMalloc?

    Many Thanks.

    Subscribe to Intel® Threading Building Blocks