By Michael Voss, Senior Staff Engineer, Intel Developer Products Division and Terry Wilmarth, Software Engineer, Intel Developer Products Division
Multi-socket servers and workstations built using the new generation of Intel® Microarchitecture, codenamed Nehalem use the Intel® QuickPath Technology and therefore are distributed shared memory systems. This paper discusses the performance impact of non-uniform memory access on Intel® Threading Building Blocks (Intel® TBB) applications. First, the concept of non-uniform memory access (NUMA) is defined and its implications on software performance discussed. Next, the core design of the Intel® TBB library is explored in the context of NUMA platforms, showing that the library has been built from the outset to perform well on these platforms. Following this, a few select Intel® TBB features are highlighted that are of particular importance for NUMA platforms, including the Intel® TBB scalable memory allocator and the affinity partitioner. This paper also presents scalability studies of benchmarks run on three NUMA systems.
What is a Non-Uniform Memory Access Platform?
Figure 1 presents examples of uniform and non-uniform memory access architectures. On both platforms, there are processors with caches and a main memory. The difference between a symmetric multiprocessor (SMP) and a NUMA system is the time (latency) required to satisfy cache misses.
In a uniform memory access system as shown in Figure 1(a), the time needed for a processor to access a memory block in the main memory does not depend on which processor or block is involved. The path from each processor to each memory block has the same latency.
In a NUMA system as shown in Figure 1(b), memory blocks are distributed across the processors. Therefore, if a processor misses in its last level cache, the time it takes to satisfy the miss will depend on whether the block is found in that processor’s local memory or a remote memory .
It is important to note that as long as the data stays in the cache, the processor has quick access to it regardless of whether the platform is an SMP or NUMA system. It is only when data must be fetched from main memory that the differing latencies impact performance.
The Implications of NUMA on Software Performance
In some ways, NUMA simply adds another level to consider when tuning applications to make best use of the memory hierarchy. Just as applications may be tuned to preserve data locality in order to take advantage of data caches, applications may also be tuned to satisfy more of their cache misses from local memory than from remote memory.
To tune for NUMA, one must know what data will be accessed by a thread or task. Knowing this, the task can then be scheduled to execute on a processor that is close to the data – or the data can be placed near the processor where the task will execute. Unfortunately, for most non-scientific applications, it is challenging to know exactly what data will be accessed by a task due to the complex control flow, data structures and modularity that are common in general purpose software. Lacking this information, it is nearly impossible to make good task and data placement decisions.
An alternative to strict data and task placement is to rely on locality optimizations for performance. With this approach, data and tasks are not explicitly placed near specific processors, but instead locality optimizations reduce the likelihood of cache misses. It is this approach that is taken by the Intel® TBB library and will be the focus of the remainder of this paper.
Intel® TBB Exploits Data Locality by Design
Intel® TBB is a C++ library that provides generic parallel algorithms, concurrent containers, a task scheduler and other features for multi-core programming. Intel® TBB has been developed and tested on a wide-range of platforms, including NUMA systems. The library provides scalable and portable performance by using cache oblivious algorithms to enhance locality and a work-stealing scheduler to balance load.
Cache oblivious algorithms are designed to exploit data caches without knowing the details of the caches, for example their sizes, line sizes, associativity, etc. In practice, these algorithms often involve recursive subdivision of problems into progressively smaller problems that eventually fit within a data cache. In Intel® TBB, many of the loop algorithms such as parallel_for, parallel_reduce and parallel_scan support cache oblivious algorithms.
The general design of the underlying Intel® TBB task scheduler is also based on a cache oblivious philosophy. Threads execute tasks that were most recently created in their own local task pool and are likely to be hot in their caches. However, if a thread runs out of work, it steals work from another thread’s task pool (the victim’s pool). To limit impact on the victim’s cache performance, the oldest task in that thread’s pool will be stolen – a task that is less likely to be hot in the victim’s cache.
Since these algorithms do not rely on precise knowledge of the memory hierarchy configuration, they provide good cache reuse while also providing performance that is portable across platforms. The Intel® TBB library is designed to use caches efficiently, thereby reducing the impact of misses (local and remote) by making all types less frequent.
Preserving Locality in Intel® TBB Algorithms
Preserving and enhancing data locality is important for applications running on both NUMA platforms and SMPs. To assist users in preserving locality, Intel® TBB offers two important features: a scalable allocator and an affinity partitioner.
The Intel® TBB Scalable Allocator
On some platforms, the standard allocator maintains a single global heap from which it services requests from all threads. Using a global heap can limit scalability in two ways. First, the allocator may implement thread safety by protecting the global heap with a single lock – preventing threads from concurrently allocating memory. Second, servicing all threads from the same heap can lead to false sharing. False sharing occurs when two different objects are allocated in the same cache line. A write by a processor to one object will cause the other object to be invalidated in other processors’ caches. In NUMA systems, the cost of false sharing may be amplified since the memory block may be closer to one processor than the other.
To avoid these issues, Intel® TBB provides a scalable allocator that maintains per-thread heaps. Using perthread heaps avoids the need for locking and reduces the likelihood of false sharing.
The TBB library exposes the scalable allocator as a C++ class tbb::scalable_allocator and also provides a C interface that includes scalable_malloc, scalable_free, scalable_calloc and scalable_realloc. Beginning with Intel® Threading Building Blocks 2.2, the library also provides support to automatically replace calls to malloc and free with calls to the scalable allocator.
Using affinity_partitioner with Intel® TBB algorithms
The affinity partitioner hints that the execution of a loop template should assign iterations to the same processors as another execution of the loop (or another loop) with the same affinity_partitioner object. Using this partitioner with a loop template can increase performance if data accessed by a previous execution of the loop still resides in the cache during a subsequent execution. Under such conditions, performance may improve for both SMP and NUMA systems due to the increased cache hit rates.
For NUMA systems, it may also be possible to use an affinity_partitioner or the affinity hooks in class task to place tasks on the same processors where another loop previously allocated the data associated with the tasks. As mentioned before however, explicit placement of data and tasks is challenging for general purpose software and therefore we will leave this topic for interested users to explore on their own.
Examples of Intel® Threading Building Blocks 2.2 Scalability
Three benchmarks were run on three NUMA systems to illustrate the scalability of Intel® TBB 2.2. The benchmarks are (1) Tachyon, a parallel 2D ray-tracer and renderer, (2) Primes, a prime number generator, and (3) Black-Scholes, a partial differential equation calculator for portfolio analysis. The NUMA systems are (1) H, an HP1 DL160 2-socket server with Quad-Core Intel® Xeon Processor x5550 chips, using Intel® Hyper-Threading Technology for a total of 16 hardware threads, (2) U, a Unisys1 E7000/1 32-socket server with Dual-Core Intel® Xeon Processor 7100 chips, for a total of 64 hardware threads, and (3) S, an SGI1 Altix1 450 32-socket server with Dual-Core Intel® Itanium® 9140 chips, for a total of 64 hardware threads. While all systems are NUMA, only H is based on the Intel® Microarchitecture, codenamed Nehalem and uses the Intel® QuickPath Architecture.
Figure 2 shows the scalability of all three benchmarks for each of the three systems. For H, all benchmarks scale linearly for all eight physical cores, and continue to scale with the use of the additional hardware threads. In particular, Black-Scholes achieves a speedup of 11.5 on all 16 hardware threads. Performance of the benchmarks on U and S are similar, with Black-Scholes achieving the best speedup when fully utilizing the entire machine.
To further improve performance on the target NUMA systems, the use of affinity_partitioner provided by Intel® TBB was explored. Figure 3 shows the effects of using affinity_partitioner with the Black-Scholes benchmark in comparison to the original Black-Scholes which used the default partitioner (auto_partitioner). On H, the effect was negligible. This benchmark accesses over 200 MB of data, while the 2 processors in the HP system only have a combine 16 MB of cache. On the larger U and S systems, the effect was more pronounced, with Black-Scholes achieving near-linear or super-linear speedups. When all nodes are in use, U provides an aggregate cache size of 512 MB and S provides an aggregate cache size of 576 MB. On both platforms the entire data set can therefore fit within cache once a sufficient number of nodes are in use. In particular, on U, Black-Scholes had a superlinear speedup of over 67 on the full set of 64 hardware threads.
Although non-uniform memory access latencies can impact application performance, Intel® Threading Building Blocks has been designed and implemented with both SMP and NUMA platforms in mind. The cache-oblivious nature of its high-level algorithms and internal work-stealing task scheduler improve data locality for both SMP and NUMA systems. In addition, using the Intel® TBB scalable allocator and affinity partitioner can further insulate application performance from non-uniform memory access latencies. While scalability is application and input dependent, the examples shown in Figure 2 demonstrate that Intel® TBB applications may scale well when executed on NUMA platforms that support as many as 64 threads.