Intel® Threading Building Blocks

Enqueue tasks and wait for the workers completion



I have an application in which my application thread spawns an std::thread at the beginning of the program. I define two task_arena and two task_group that are shared by the two master threads of my application. I want the first thread to use the first arena and first group and the second thread to use the second arena and second group.

For the moment my code looks like that:

Nested parallel_for with mutex hangs

I have a set of data blocks that I process using a parallel_for loop. These data blocks are held in a pool that may be compressed. The first thread to access a block that is in the compressed pool triggers an uncompress routine.Now, I have a mutex that ensures that the uncompress routine is only executed by one task thread. But the uncompress routine uses its own parallel_for loop to speed up the decompression. When the inner parallel_for loop ends, control doesn't go back to the parent task that started the uncompress routine.

Implementing a Synchronous DataFlow Graph using Intel Flow Graph


I started investigating Intel TBB recently and was thinking of the possibility of implementing an application specified as a Synchronous DataFlow Graph using function and queue nodes. I seems to me doable in a straightforward manner. Could someone confirm? Any thoughts?

how to find where my program is spinning?


I have looked through the forums and other TBB resources and based on vtune I can see my program is spending a lot of time spinning but I have not found out where it is spinning yet.

I have parallel studio and would appreciate any advice on how to find out where the program is spinning so I can fix it. Overall it seems my parallelization is not very well balanced and I am trying to figure out where the problems are.



I remember reading somewhere that if you link TBBMalloc or potentially use Scalabale Allocator, TBB will pre-allocates some amount of memory per thread to avoid implicit synchronizations. But i can't find this any more. I thought I found this in TBB Book but looks like it wasn't. 

Is there any per thread preallocation happens in Scalable Allocator or in TBBMalloc?

Many Thanks.

flow::graph : graph.wait_for_all() loads one core while do nothing usefull

void graph_test2()
	const int NPROC = 20;

	tbb::flow::graph g;

	tbb::flow::broadcast_node< tbb::flow::continue_msg > start( g );

	std::vector< tbb::flow::continue_node< int > > workers;
	workers.reserve( NPROC );

	std::vector<double> SUM( NPROC, 0. );

	for(int i=0; i<NPROC; ++i) {
		auto work = [&, i](const tbb::flow::continue_msg &) -> int
			double & sum = SUM[i];
			for(int k = 0; k < 1000000000; ++ k) {
				sum += k;

			return i;
Subscribe to Intel® Threading Building Blocks