All about NUMA with Intel Sr. SW Engineer David Ott - Parallel Programming Talk Show #113

Welcome to Parallel Programming Talk Show #113 –Intel Senior Software Engineer David Ott is here to talk about NUMA and we will introduce him in a few minutes.

Don’t forget – if you have comments, questions, suggestions send them to

Here’s the news:

· Multicore Programming Summer School week of July 25th Universal Parallel Programming Research Center in Urbana Champagne if you haven’t registered – hurry up it is almost full – June 24 is deadline (or until it is full). Clay is teaching. More info: .

· The 24th International Workshop on Languages and Compilers for Parallel Programming (LCPC) – will take placeColorado State University, Fort Collins, Colorado on September 8-10, 2011 - Third Annual Workshop on Concurrent Collections will be co-located with LCPC on September 7, 2011 more info:

· The European Event - International Supercomputing Conference – June 19-23 in Hamburg, Germany Intel will definitely be there.

Guest Segment

A little bit about our guestDavid Ott is a Senior Software Engineer with Intel's Software and Services Group.  He joined Intel in 2005 as a middleware systems engineer for the Technology and Manufacturing Group. David holds M.S. and Ph.D. degrees in Computer Science from the University of North Carolina at Chapel Hill.

Currently, David works on enterprise server platforms with a focus on performance and power efficiency

Dave – thanks for being on the show

Q: What is NUMA?

An acronym for "Non-Uniform Memory Access"

· A shared memory architecture that describes the placement of main memory modules with respect to processors in a multiprocessor system. Good way to understand is to compare to UMA or "Uniform Memory Access." Each processor uses the same shared bus to access memory - memory access time is uniform across all processors - memory access time is uniform no matter which memory module contains your data.

Q: So how is NUMA different than UMA?

Each processor has its own LOCAL memory module
Memory access time is reduced, great performance advantage
At the same time, each processor can also access the local memory module of other processors using a shared bus, but at the cost of greater memory access time. So memory access time is NON-UNIFORM. If data is in local module, then access is fast. If data is in a remote module belonging to another processor, then access is slower.

Q: How is this different than cache?

Middle layer – numa me archoitec

Q: Do Intel Xeon servers based on Core i7 use NUMA or UMA?

Here's where things get a little more complicated: Cores that are grouped together on the same processor package or "node" share access to memory modules using the UMA shared memory architecture. At the same time, cores can also access memory modules from other nodes using a fast interconnect technology called Intel QuickPath Interconnect (QPI). QPI is a great new technology that mitigates the problem of slower remote memory access, but doesn't eliminate it entirely. So within a single processor node, the model is uniform memory access, but in the multi-node context of the server as a whole, the model is NUMA.

Q: What is the key advantage of NUMA?

The potential to reduce memory access time in the average case. Each processor can access its local memory in parallel and avoid throughput and contention issues associated with a shared memory bus, but note that there is also a risk here: If data is not in the processor's local memory, then retrieving it from remote memory of an adjacent processor will be significantly slower. In general, the DISTANCE from a given processor to the processor with a memory module containing the data increases, then the cost of accessing memory increases.

So the key issue when working with a NUMA system is DATA PLACEMENT. If you can keep your data in memory local to the processor, you'll exploit the performance benefits of NUMA, but if your data fails to be local, then your performance may suffer from the architechture.

Q: I"ve heard you use the term "NUMA-aware" or "NUMA-friendly" with software.  What do you mean by that?

Software applications that effectively manage the placement of their data to take advantage of the NUMA architecture are said to be "NUMA-aware," "NUMA-friendly," or "optimized for NUMA. The good news here is that there are several strategies available to realize this.

FIRST, you can make use of PROCESSOR AFFINITY. Today's operating systems support explicit assignment of application threads to a single core or group of cores within the same processor node. The idea is that once the thread has been affinitized, it will always run on that node.the advantage here is that data placed in local memory will remain local to the thread whenever it runs. Without affinity, the OS scheduler may sometimes choose a different processor node to run the thread. Whenever that happens data access becomes remote and memory performance suffers. Using OS affinity support can be a big aid in working with (and not against) NUMA characteristics of the platform.

Q: What other strategies can a programmer use to make their software NUMA-friendly?

Another strategy to impove memory performance on a NUMA system is managing DATA PLACEMENT.  One way to do this is to look for special system APIs that support location configuration of memory pages. One example is "libnuma" library for linux. Some ops that are supported include:

o Associating particular virtual memory address ranges with a particular node

o Designating a particular node when making a memory allocation call

o Migration of memory pages from one node to another

o Monitoring memory access behavior

These operations are especially helpful when a memory allocation and memory access threads are different.

Q: What if a NUMA system API is not available?

A programmer can study OS patterns of memory allocation and work with them smartly. For example, some OSes will assign memory pages to phsyical memory local to the requestor, while others will wait. Fr first memory access to commit memory page assignement (REQUSTER LOCATION vs FIRST ACCESS). In the former case, a programmer can try to ensure that memory allocation requests are made by the same thread that will later access it. In the latter case, a programmer can strategically generate preliminary memory accesses to establish page location In general, multiple threads accessing the same data should be co-located on the same node.

Q: We’re out of time.  Where can our listeners go for more info?

· UMA and NUMA are widely discussed in the Computer Science literature, including most college level textbooks.

· Computer Architecture textbooks, distributed and parallel computing textbooks

· Intel 64 and IA-32 Architecture reference books

We want to hear from you – have you got questions, suggestions, ideas for the show? Do you know of an interesting guest? What about you? Clay – what is that email address? Thanks for joining us again. Look for a new, on-demand show every Friday.

“Serial programming really is just for breakfast”

For more complete information about compiler optimizations, see our Optimization Notice.