I'm looking for a TBB implementation of (dense 2D) matxtix-matrix multiplication that scales to ccNUMA (e.g., QuickPath) machines. I tried the cache oblivious version but I was surprised to find it didn't scale much more than the naive N^3 (cache-terrible) version. I did try a whole spectrum of base-case sizes (size of the sequential leaves) of the recursive cache-oblivious algorithm without success. On a 4x6core machine I get good scaling up to 6 workers and then performance flattens out: ~1x w. 1 worker, ~5x with 6 workers, ~7x with 24 workers (with exclusive use of the machine).
I am attempting to pin workers to cores (http://software.intel.com/en-us/forums/topic/365339) but I'm unsure if that will help substantially if I succeed.