I had a small window of opportunity to run my CnC++ code on a machine with a high number of processors and cores. Strangely enough, the performance graph was absolutely flat; the number of threads had no effect at all. The exact same code scales normally under the same linux OS on our single-socket 8-core machine.
The benchmark results from "Performance Evaluation of Concurrent Collections
on High-Performance Multicore Computing Systems" do show scaling over NUMA machines, so I was wondering what I could do to make it run properly.


