Intel® Threading Building Blocks

Intel TBB: Performance under hyperthreading

At the moment I am testing some parallel scheduling libraries like TBB, OpenMP, XKAAPI and so on. To get a first quick impression I implemented a naive matrix-matrix multiplication, first for floating point entries, then for uint64 entries. The tests I run on a NUMA with 4 nodes of 8 cores of Intel Xeon CPU E5-4620 0 @ 2.20GHz. Each node has 96 GB of RAM.

Proposing the boost range interface for parallel constructs

I want to know what you guys think about the following additions to the interface of some parallel constructs.

I think almost everybody agrees that the interface for std:: algorithms are a bit verbose and that the boost::range algortihms together with their adaptors have a much cleaner interface.

std::vector<int> vec{3, 2, 1};
boost::for_each(vec, some_lambda);
boost::sort(vec);

We could introduce these interfaces to tbb, too.

ERROR: enumerable_thread_specific prematurely deleted

I encountered a strange problem with enumerable_thread_specific: a enumerable_thread_specific object got deleted prematurely when a root task using it was spawned from a variadic template function (which I added for convenience to a base of the task). See attached code. Tested using gcc 4.7.0 and 4.8 only.

size of task_list

I wonder whether there is a (legal) way to obtain the size of a tbb::task_list.

I would like to write a little template class that implements a continuation task, using CRTP to do the actual work. The simplest way to do this seems to get the size of a tbb::task_list for setting the correct ref_count, see attached sample code.

Btw, wouldn't it be nice if tbb comes with little helper classes like this?

Two questions about tbb::memory_pool< tbb::scalable_allocator<char> >

For tbb::memory_pool< tbb::scalable_allocator < char > > shared_memory_pool_ .  Am I correct that, it pre allocate a subset of memory to avoid malloc system call during runtime? For example, after we called shared_memeory_pool_.malloc(15000000), it wouldn't call the system malloc again but just allocate from the pre-allocated memory until it's out of range (over 15000000), and need to extend the pool size?

Possible concurrent_queue improvement

Hi,

I was wondering why the concurrent queue is using compare and swap to get the next ticket in stead of using fetch and increment. (in concurrent_queue_base_v3::internal_pop_if_present and in concurrent_queue_base_v3::internal_push_if_not_full).

Calling compare and swap may harm the performance of the queue under high contention - the reasons for this are best explained by Dave Dice here: https://blogs.oracle.com/dave/entry/atomic_fetch_and_add_vs

Pagine

Iscriversi a Intel® Threading Building Blocks