As part of some other research I was doing (for the memory system of our parallel dialect of ML, Manticore), I gathered a bunch of information on the memory bandwidth and latencies on the MTL machines and a 48-core AMD machine using enhanced versions of the STREAM benchmark. It ended up not fitting into the paper, but I thought the results might be useful/interesting to people here:
The biggest punchline is that Intel QPI provides so much bandwidth between processors that it's actually difficult to see any NUMA effect even with bad location strategies. You basically have to allocate and use everything on a single processor to get any bad effects. This is not the case with AMD, where it's trivial to see spikes in latencies with even partially badly allocated data.
There's quite a bit more data in the little tech report above, though clearly not as much as some of you would like or if this was being prepared for more general publication. Hope somebody finds it interesting/useful!