I had recently the chance to test my raytracer based on TBB 3.0 on a 48 Core AMD NUMA plattform (so that are 8 Processor organized in 2 boards connected with hypertransport, each board with 4 CPUs). Sadly the results were disastrous.
My first try was to just use parallel-for with a grainsize of 24x24 which gave me the best results on SMP Machines so far. This actually resulted in 48 Cores being actually about 20% slower than 24 Cores.
So my new approach was to just use a small parallel_for loop from 1 to number of cores and maintain a simple stack where I pull blocks to render from (so just 1 atomic increment for each tile, with about 2000 tiles per FullHD Frame and no mutexes whatsoever). The results where a lot better, 24 Cores were about 10% faster, 48 Cores were about 30% faster than before.
Nonetheless: 48 Cores are about 4% faster than 24 Cores which is a littlebit rediculous. Whats even more interesting: Using 24 Cores I get about 50% CPU usage (so exactly 4 of the 8 NUMA nodes run close to 100%just like it should be). Upping to 48 Cores gives me about 60% CPU usage with still 4 NUMA nodes peaking at 100% while the other 4 are more or less idle at 5% maximum. It also doesnt improve if I massively increase the amount of work to be done for each pixel, so I doubt that the atomic operation for pulling from my stack have any influence.
Although the hypertransport will slow down memory access a little ( from what I have read so far it should be 30% slower compared to a direct memory access on the same board), this is nowhere near the performance that should be possible. It actually looks to me like the TBB Scheduler / Windows 2008 Server R2 Scheduler puts 2 threads on each core of one board and leaves the second board pretty much idle.
Does anyone have an Idea what might go wrong?
P.S. By the way, scaling up to 24 Cores is pretty ok, considering there is still some serial part in my tracer:
1 Core: Factor 1.0
2 Cores: Factor 2.0
4 Cores: Factor 3.98
Cores: Factor 7.6
12 Cores: Factor 10.9
16 Cores: Factor 14.2
Cores: Factor 19.5
32 Cores: Factor 23.18
48 Cores: Factor 22.0 <- This is not good