Improved Linux* SMP Scaling: User-directed Processor Affinity

 

Introduction

By Annie Foong, Jason Fung, and Don Newell

Multi-gigabit/s of networking traffic is pushing the limits of current symmetric multi-processor (SMP) systems. Network protocol stacks, in particular TCP/IP software implementations, are known for their inability to scale well in general-purpose monolithic operating systems (OS) for SMP [7, 9]. This article discusses how to improve network performance through the use of processes/threads and interrupts affinity. Specifically, our experiments are based on the Redhat Linux*-2.4.20 stack, where we explore the usage of the set_schedaffinity() series of programming interfaces, officially folded into the mainstream Linux-2.6 kernel [14]). This paper also highlights how interrupt affinity alone provides a throughput gain of up to 25%, and a combined process and interrupt affinity can achieve gains of 30%, for bulk data transfers.  Finally, cache behavior was measured and we quantified the reduction in the number of level 2 (L2) and level 3 (L3) cache misses when affinity is used.

TCP overheads are well documented [2, 3].  However, the most commonly overlooked overheads are those incurred by scheduling and interrupting [6]. Though not cost intensive operations, they have an indirect impact on the cache effectiveness. The scheduler in typical SMP OSes always attempts to load balance, by moving processes from processors with heavier loads to those with lighter loads. Every time scheduling or interrupts occur that cause a process migration, there is a price to pay.  The migrated process needs to warm various levels of data caches in the processor that it has just migrated to. Further, OSes do not attempt to balance interrupts across processors. The I/O Advanced Programmable Interrupt Controller (APIC) is responsible for distributing interrupt from hardware devices to local APICs in processors. Depending on the settings programmed in the I/O APIC, interrupts are distributed either to processor(s) specified in the I/O APIC redirection table or the processor that is executing the lowest priority process. Both Windows* NT and Linux default the SMP configuration to operate in the lowest priority mode. This causes the device interrupts to go to CPU0. During high load, CPU0 saturates before other processors in the system leading to a bottleneck. In Linux 2.6, a more intelligent scheme is implemented, where the kernel dispatches interrupts to one processor for a short duration before it randomly switches the interrupt delivery to a different processor. This random distribution resolves the system bottleneck problem, while the delayed switching provides a best-effort approach to address the affinity issue. However, redistribution of interrupts would mean that the kernel has to update the task priority registers (TPRs) regularly, and increase the number of un-cacheable writes. While previous work on process affinity has shown good potential [11], uncoordinated attempts to distribute interrupts to different processors can also result in some bad side effects [1]. Interrupt handlers ended up being executed on random processors, which created more contention for shared resources. Further, cache contents (e.g. holding TCP contexts) are not reused optimally as packets from devices are sent to different processors on every interrupt.

Our approach is to let the application direct i ts own processor affinity. We assume that an application knows its own workload best and is in the position to better “place” itself than the OS scheduler.  User-directed affinity is a relatively straightforward optimization. General purpose application programming interfaces (API) in Linux [8] and Windows [9] allow applications to statically bind processes/threads and/or interrupts to processors.  Our complete research results [4] have fully characterized the benefits of affinity. The implementation details and some of the more succinct results are highlighted in this article.


Implementation Details

Figure 1(a), summarizes the configuration of the system test (SUT) and clients. Figure 1(b), displays the setup of our tiny cluster. The ttcp micro-benchmark is used to exercise bulk data transmits (TX) and receives (RX) between 2 nodes. This micro-benchmark exercises the fast (common) path of the TCP stacks. In our tests, one connection (a unique IP address) is owned by one instance of ttcp, and serviced by one physical NIC. There are a total of 8 NICs, 8 connections, and 8 ttcp processes running on SUT. 




Figure 1 (a) System under test and client config (b) Cluster setup

We can statically redirect interrupts from a NIC to a specific processor, by manipulating the corresponding bit mask in the  /proc filesystem of Linux. Details can be found in the Linux source tree documentation [13]. The file, /proc/irq/<irq num>/smp_affinity, contains the interrupt mask of interrupt number <irq num>. The first task is to figure out the interrupt number that is assigned to each device, by looking at /proc/interrupts.
For example:

Setting bit <n> (starting at bit 0) to 1 allows interrupts to go the corresponding CPU <n>.

While the support of interrupt affinity has been consistently supported since Linux-2.4 kernels, support of process affinity has stabilized only recently, as shown in Table 1. In addition, the data type expected of the CPU mask may differ across distributions.

Kernel.org distributions Redhat distributions
2.4.20 no support

2.4.21 set_cpus_allowed() only

2.6.5 set_cpus_allowed(), sched_setaffinity()

2.4.18-14 (RH8) set_cpus_allowed only

2.4.20-13 (RH9) set_cpus_allowed(),

sched_setaffinity()


Table 1 Sample of various levels of support of process affinity

We modified ttcp to use the follow ing APIs to set process or thread affinity. In the user mode, sched_setaffinity() can be used. For example, use the following command to affinitize the current processor to CPU1:


Results and Analysis



Figure 2 TCP CPU utilization and throughput (click image for larger view)



Figure 3 TCP processing costs (click image for larger view)

Figure 2 displays the TCP performance comparison of the four affinity models. We have compared 4 models of affinity namely:

I. No affinity.

II. Interrupt-only affinity (IRQ aff). For example, interrupts from NICs 1-4 are directed to go to CPU0.

III. Process-only affinity (proc aff). For example, ttcp processes 1-4 are bound to CPU0.

IV. Full affinity (full aff). In this model, the ttcp process is affinitized to the same processor as the interrupts coming from the NIC that it is assigned to listen on.

Process affinity alone does not have any substantial impact on performance. In this mode, CPU0 not only has to service all interrupts, but also at least 4 ttcp processes. Any affinity benefits are negated by more pronounced load imbalance. On the other hand, interrupt affinity alone can improve throughput by as much as 25%. This behavior is a result of the scheduling algorithm. To reduce cache interference, the scheduler tries to schedule a process onto the same processor that it was previously running on. By the same token, “bottom halves/tasklets” (that is, tasks scheduled to run at a later time), of interrupt handlers are usually scheduled on the same processor where their corresponding “top halves” had previously run. As a result, interrupt affinity indirectly causes process affinity as well.  Sometimes interrupt and process contexts can still land up on different CPUs. The best improvement (up to 29%) is therefore achieved with full affinity. A more illuminating view is normalize processor cycles with work done - GHz/Gbps, (i.e. cycles per bit transferred). This “cost” metric allows us to account for both CPU and throughput improvement at the same time (Figure 4). With full affinity, the cost can be reduced up to 24%.




Figure 4 Two possible permutations of interrupt and process affinity (click image for larger view)

To understand the reasons behind such gains, we focused on characterizing the 2 extreme cases of affinity, as shown in Figure 4. The total number of cycles and the total number of cache misses in the 2 modes were counted. The events available for measuring are those supported by the processor’s hardware event counters [5]. Cycles and cache misses can be measured by using tools such as VTune [12] or Oprofile[10].

Process affinity keeps processes from unnecessary migration to other processors; and interrupt affinity further ensures that both the bottom and top halves of interrupt processing are kept on the same processor.  When full affinity is enforced, interrupts are serviced on the same processor that will ultimately run the higher layers of the stack as well as the user application. In other words, there is a direct path of execution within the processor. Since code execution and data accesses are confined to the same processor, extremely high cache locality can be achieved.  Figure 5 explains that improvement in time (i.e. reduction in cycles) is associated with improvement in cache misses, when going from the no affinity to full affinity mode. All numbers displayed in Figure 5 have been normalized to work done. Further, the cost of a L3 cache miss is much larger than that of a L2 miss, and we expect L3 misses to have a larger impact on the overall performance.


Conclusion

It is encouraging to see how simple mechanisms, without the need for hardware offloads or major software rewrites, have afforded us good gains in SMP scaling for networking.  However, we have investigated only bulk data transfers and affinity in the best possible configuration. While static affinity may not work for non-uniform and dynamically changing applications, dedicated servers (for example, a web server running a known number of worker threads and NICs), may have workloads that can leverage such mechanisms. Two commonly used web servers (Redhat’s TUX and Microsoft’s IIS) are already designed to allow for worker threads and processes to be affinitized [15].  Furthermore, future network adapters will have the ability to look deeper into packets to extract flow information (receive-side scaling) [9] and can direct interrupts, dynamically, to the most correctly-cached processor. We believe the insights gained here will help propose mechanisms that can better leverage affinity. The methodology investigated here is applicable to other applications besides networking. The arrival chip multi-processors (CMP) will bring multiple cores to each processor, effectively turning even single-processor desktops into SMP systems.  As we look towards CMP deployment, we believe that affinity and mechanisms to better manage affinity will undoubtedly take a central role in future operating systems.


References

  • V. Anand and B. Hartner. TCPIP Network Stack Performance in Linux Kernel 2.4. and 2.5. In Proc. of the Linux Symposium, Ottawa June 2002.
  • J. Chase, A. Gallatin and K. Yocum. End-System Optimizations for High-Speed TCP. IEEE Comms, 39:4, April 2001.
  • A. Foong, T. Huff, H. Hum, J. Patwardhan and G. Regnier. TCP performance re-visited. In Proc. of the IEEE Intl. Symposium on Performance of Systems & Software, Austin, Mar 2003.
  • A. Foong, J. Fung and D. Newell. An In-depth Analysis of the Impact of Processor Affinity on Network Performance. To appear in IEEE Intl. Conference on Networks, Nov 2004.
  • IA-32 Intel® Architecture Software Developer’s Manual: Systems Programming Guide, Vol 3, Intel Corporation, 2002.
  • J. Kay and J. Pasquale. The importance of non-data touching processing overheads in TCP/IP. In Proc. of ACM SIGCOMM, San Francisco, 1993.
  • P. Leroux. Meeting the bandwidth challenge: Building Scalable Networking Equipment Using SMP. Dedicated Systems Magazine, 2001.
  • R. Love. Linux Kernel Development, Sams Publishing, 2004.
  • Scalable Networking: Eliminating the Receive Processing Bottleneck-Introducing RSS. Microsoft whitepaper, available at http://www.microsoft.com/whdc/*.
  • Profile: A system-wide profiling tool for Linux. Available at http://oprofile.sourceforge.net*
  • J. Salehi, J. Kurose and D. Towsley, "The effectiveness of affinity-based scheduling in multiprocessor network protocol processing", IEEE/ACM Trans. on Networking, vol 4:4, pp 516530, 1996.
  • VTune performance Analysis Tool. Available at /en-us/articles/intel-vtune-amplifier-xe/
  • IRQ affinity Documentation. Linux source tree>/Documentation/IRQ-affinity.txt
  • Cross-referencing Linux, available at http://lxr.linux.no/+trees*
  • What's New in Internet Information Services 6.0. Microsoft product information, http://www.microsoft.com/windowsserver2003/evaluation/overview/technologies/iis.mspx*, 2003.

 


About the Authors

Annie Foong is a Senior Research Engineer with the Communications Technology Laboratory at Intel R&D. She received her Ph.D. in Electrical and Computer Engineering from the University of Wisconsin-Madison in 1999. Her research interests include new architectures and systems software implementations for network performance. She has worked extensively on the experimental analysis of Linux TCP stacks. Her most recent work involved the use of multiprocessors in asymm etric modes. Before embarking on network research, she has worked in a variety of inter-disciplinary fields, including echocardiography diagnosis and statistical-based web caching. She is a member of ACM and author of more than a dozen technical papers.

Jason Fung is a Senior Research Engineer in the Communications Technology Laboratory at Intel Research and Development. His research interests include data center network performance and scalability, network interfaces, and operating systems. Prior to his research career, he joined Enterprise Product Group at Intel in 1998 and helped define, develop, and validate two generations of enterprise server chipsets products. His work spanned across multiple disciplines, including simulation and modeling, validation, architecture, and performance analysis. Jason received his M.S. degree in Electrical and Computer Engineering from Carnegie Mellon University in 1998. Along with various honors, he received his two B.S. degrees in Computer Science with Mathematics, and Electrical and Computer Engineering from Carnegie Mellon University in 1997. He is a member of IEEE.

Donald Newell joined Intel in 1994 and is a Principal Engineer in the Communications Technology Lab. He has spent most of his career working on networking and real-time systems - usually on IA-based platforms. He has worked on a number of emerging technologies at Intel. This includes leading the group that developed Intel's frameworks for media streaming over the Internet and to support data broadcast for DTV. Don chaired the Advanced Television System Committee (ATSC) work on data broadcast. Currently, he is working on technologies related to the performance of Server I/O. Don has a BS in Computer Science from the University of Oregon.


Categorías:
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.