(cross-posted from my blog post on the Intel Software Blog):
Memory access characteristics in manycore NUMA systems are not always obvious to the programmer. A process may see widely varying latency and bandwidth for memory accesses depending on which CPU the process is running and on which memory node the data is located.
My initial results show that the Intel MTL machines exhibit nearly-constant memory access latency and bandwidth varying by up to 2 GB/s, compared to another architecture which exhibits latencies varying by up to 200ns and memory bandwidth varying by up to 4 GB/s. This speaks well of the Intel design, as latency-sensitive applications may not notice effects of the NUMA architecture when running on MTL machines. For bandwidth-sensitive applications, however, NUMA still presents a significant programming design challenge regardless of the architecture.
Further tests are underway to determine how individual cores' memory access might vary, followed by higher-level application benchmarking in order to better understand the effects of manycore NUMA designs on high-performance computing applications.






