At the moment I am testing some parallel scheduling libraries like TBB, OpenMP, XKAAPI and so on. To get a first quick impression I implemented a naive matrix-matrix multiplication, first for floating point entries, then for uint64 entries. The tests I run on a NUMA with 4 nodes of 8 cores of Intel Xeon CPU E5-4620 0 @ 2.20GHz. Each node has 96 GB of RAM.
Whereas TBB outperforms all of the other libraries used up to using 32 threads, libraries like OpenMP and XKAAPI are way better when it comes to hyperthreading. In some examples TBB gives even worse timings for 64 than for 32 cores. I run the program with big matrices, so we are talking about computations in the range of an hour for 32 cores. I also vary on the shape of matrices, but the overall behaviour stays the same. I have implemented all the libraries in different variants, mostly 1D and 2D, using for TBB blocked_range1d and blocked_range2d (blocked_range3d gave worse performance overall). I also test with the auto partitioner, the affinity partitioner and the simple partitioner. For TBB, as one might figure out for this kind of naive matrix-matrix multiplication the 2D implementation with the simple partitioner mostly works best. Also trying to use different setting with numactl does not change the differences in performance for the used libraries.
Note that the stated behaviour of Intel TBB happens exactly for the setting: 2D plus simple partitioner. All other implementations scale rather good for hypthreading. The strange fact is just that TBB 2D plus simple partitioner outperforms all other libraries and variants by far up to the number of physical cores. And above that, it behaves the worst to all mentioned libraries.
Looking at the assembly of my code the compiler (gcc) seems to use for all the schedulers the very same conversion to xmm SSE vectors for multiplication and addition. I thought that this could be some bottleneck, but it seems not to be so.
Does anyone of you have made similar experiences? Is there any hint you could give me in order to improve Intel TBB's behaviour when hyperthreading? Still note that the naive matrix-matrix multiplication is just an easy testing tool for comparing the above mentioned libraries in their behaviour, the final application will be different, but also way more time-consuming to implement. I am thankful for any comments, hints or suggestions.