Learning Experience of NUMA and Intel's Next Generation Xeon Processor I

As a technical engineer, I took NUMA as a recent focus, and studied relevant public materials and Intel's next generation Xeon processor in my spare time. Here I would like to share with you my learning experience.

As time is limited, I will introduce my gains by several parts.

1 Overview

From the perspective of system architecture, current mainstream enterprise servers fall into three categories: SMP (Symmetric Multi Processing), NUMA (Non-Uniform Memory Access) and MPP (Massive Parallel Processing). The three architectures have their own features, and I will focus on NUMA.

To help you know NUMA better, I introduce the difference between NUMA and the other two architectures.

1) SMP (Symmetric Multi Processing)

SMP is a common architecture, in which multiple processors are connected on the system memory symmetrically and access the system memory equally and uniformly. This is both an advantage and a disadvantage, since all processors in the SMP system share the bus and competition conflict upgrades dramatically when the number of processors increases. Due to the performance bottleneck of the system bus, current SMP system usually has only tens of processors with limited scalability.

2) MPP (Massive Parallel Processing)

MPP is a non-sharing architecture that divides the whole system into multiple nodes, where processors can access to local resources only. MPP has excellent scalability, but difficult data swap among nodes, which is enabled by software. As MPP needs lots of operations of management software to realize communication and task distribution and scheduling, it is too complex and inefficient for common enterprise applications.

3) NUMA (Non-Uniform Memory Access)

NUMA integrates the features of SMP and MPP in a sense. The whole system logically divides into multiple nodes, which can access both local and remote memory resources. It is faster to access local memory than access remote memory. NUMA is easy to be managed and expanded, but costs a lot of time to access remote memory.

SMP and NUMA architectures are widely used in practice. For example, traditional IA (Intel Architecture) is SMP architecture and many mainframes adopt NUMA.

Today the increasing cores of processors pose higher requirements on I/O and latency. Accordingly, NUMA architecture is applied in Intel’s next generation Xeon processor.

After briefly introducing NUMA, I will talk about the relationship between NUMA and Intel’s next generation Xeon processor.

As shown in IDF, Intel’s next generation 45nm Xeon processor will become the mainstream processor of Intel’s full line products from desktops, notebooks to servers. Compared with previous generation Core™ processor platform, 45nm Xeon processor makes a comprehensive segmentation for micro-architecture while greatly changing system architecture and memory layer system, including:

 >New core architecture, it is redesigned new infrastructure, and may have 4 or more cores per processor
 >Simultaneous Multi-Threading (SMT) technology allows each core processing 2 threads, in another word, one 4-core processor operating 8 threads
 >Latest point-to-point Direct Connect Architecture: Intel® QuickPath Interconnect (Intel® QPI) technology
 > Intel® QuickPath Integrated Memory Controller (IMC), supports DDR3
 >The improvement of micro-architecture functions, such as the new SSE4.2 instruction set
 >Better energy saving feature

The four main technologies of next generation Xeon processor are:

 > Intel® QPI

Replacing FSB architecture with QPI architecture, QPI is a point-to-point transportation technology based on data packet transportation, high bandwidth and low latency, with the rate of up to 6.4 GT/s. For two-way transportation QPI bus connection, it can reach 25.6 GB/s of theoretical peak data bandwidth, much higher than FSB-based data bandwidth.

 >Intel® QuickPath IMC

As an independent DDR3 IMC is integrated on each socket for the interface to access memory, the IMC platform remarkably increases the bandwidth (the peak bandwidth of DDR3-1333 can be up to 32GB/s, 4-6 times wider than previous platforms), reduces the memory latency, improves performance, and offers each CPU a fast channel to local memory. Unlike previous generation platform, IMC platform uses NUMA architecture for memory access, thus greatly advancing the performance of NUMA-aware applications. DDR3 IMC supports up to 96GB DDR3 memory capacity per CPU interface, and even up to 144GB in the future, which provides a strong memory support for high-end enterprise computing.

Now NUMA debuts!

 >Improved power supply management

The power supply management integrated on chips makes energy control more efficient.

 >SMT

SMT technology allows each core to implement two threads simultaneously, so for Quad-Core CPU, each chip can accommodate up to 8 logical processors.

We have introduced NUMA’s cool architecture. Then, how does software support NUMA in system? Let’s turn to the software support stack of NUMA architecture.

After several decades of development, the software support stack of NUMA architecture has been fully-fledged. Nearly all mainframe products from OS to database and application server support NUMA.

Operating System (OS)

Now Windows Server 2003, Windows XP 64-bit Edition and Windows XP are all NUMA aware, and Windows Vista supports NUMA scheduling. All Linux OS of kernel version 2.6 or above, as well as UNIX OS such as Solaris and HP-Unix support NUMA.


NUMA is supported by Oracle8i, Oracle9i, Oracle10g and Oracle11g, as well as SQL Server 2005 and SQL Server 2008.

Middleware Server

Now the typical controlled programs in the industry are Java and .Net application. Since memory distribution and thread scheduling are transparent to applications and completely conducted by virtual machines, their performance in NUMA environment mainly depends on the full play of OS support to NUMA in the realization of virtual machine.

In one word, all software stacks fully support NUMA architecture at present. In the following section I will introduce how application software supports NUMA architecture.

In traditional SMP system, all CPUs access the memory and communicate with each other through a shared memory controller in the same way, which often causes congestion. Moreover, a memory controller can only manage a certain number of memories, and the access to memory through an exclusive hub will lead to high latency.

However, under NUMA architecture, each computer will no longer have one memory controller but divide the whole system into multiple nodes, each of which has their own processors and memory. All nodes in the system are interconnected. Therefore, once a new node is added, memory and bandwidth supported by the system will increase too, which improves the scalability of the system.

Now let’s turn to the memory structure of NUMA.

Each CPU of NUMA system can access two types of memories: local memory and remote memory. The local memory is the memory that is on the same node as the CPU, with low latency, while the remote memory is the memory that does not belong to the node of the CPU. CPU shall access to the remote memory through node interconnection, thus the latency will be longer than that of local memory.

From the perspective of software, remote memory and local memory are accessed in the same way. Theoretically, NUMA system can be regarded by software as the same system with SMP, without the distinction of local and remote memory. However, to get better performance, the distinction shall be taken into account.

It is proved that it is faster to access local memory than remote memory for common memory operations such as Memset, Memcpy, Stream Reading and Writing, and Pointer Chase.

As NUMA uses local memory and remote memory simultaneously, it will take a longer time to access some memories than others. Local memory and remote memory are usually used to quote the running thread. Local memory refers to the memory that is on the same node as the CPU currently running thread, otherwise the memory is remote. The ratio of the time taken to access local memory to the time taken to access remote memory is referred to as the NUMA ratio. If the NUMA ratio is 1, the structure is SMP. The higher the ratio, the greater the time to access memory of other nodes. The applications not support NUMA have poor execution effect on NUMA hardware sometimes.

Due to the difference between the time to access local memory and remote memory, if threads access local memory more in NUMA mode, the performance of the system will be better.

Thanks for your support, which encourages me to continue the series. Some friends asked about QPI, which is also one of the features of next generation Xeon processor, and I will give more details.

QPI is the transportation channel between CPUs on the same machine, allowing faster transmission of data among cores. Data in cache can be transported directly via QPI without memory access.

The next generation Xeon processor uses QPI architecture to replace FSB architecture. QPI is a point-to-point transportation technology based on data packet transportation, high bandwidth and low latency, with the rate of up to 6.4GT/s, much higher than FSB-based data bandwidth. Certainly, the number of connected QPIs in specific platforms can be expanded flexibly according to target market and system complexity.

Some friends may doubt how cores in the same CPU exchange data. That is simple. The cores in next generation Xeon processor have cache sharing. Therefore, they can share the data in cache directly, instead of searching data in the memory.

For more complete information about compiler optimizations, see our Optimization Notice.