PICNIC: Prototyping Tomorrow’s Functionality Using Today’s NICs

Submit New Article

October 26, 2009 12:00 AM PDT


by Bryan Veal, Intel Corporation, bryan.e.veal@intel.com &
Annie Foong, Intel Corporation, annie.foong@intel.com

Download Article

Download PICNIC: Prototyping Tomorrow’s Functionality Using Today’s NICs [PDF 158KB]

Abstract

The benefits of proposed features for network interface controllers must be experimentally evaluated before undergoing a costly hardware implementation. Simulation does not allow for testing these features with full-scale workloads and full experimental frameworks in wall-clock time. We propose PICNIC as a means to implement NIC features in software on a core borrowed from a multi-core system. On the remaining cores, PICNIC allows running any conceivable workload without modification. The borrowed core is isolated so that performance on the remaining system can be measured independently with existing tools. Within the experimental framework, NICs and the borrowed core appear as unified NIC. In this paper we describe PICNIC's design and implementation and we evaluate its utilization and latency overheads. We also present a case study: using PICNIC to demonstrate the performance benefits of receive-side scaling (RSS) with a full-scale Web server workload.

Introduction

Each generation of modern network interface controllers (NICs) supports an increasing number of features to improve the performance of host systems. For example, the Intel® 82575EB Ethernet Controller [4] supports multiple queues for virtual machines (VMDq) [2], direct cache access (DCA) [3], header splitting, checksum offloading, segmentation offloading, receive-side scaling [8], packet filtering, and VLAN tagging. We expect future NICs to become even more capable.

Before going through the expense of a full hardware implementation, the potential benefits of new features must be evaluated. Simulation is one approach, but this has drawbacks. Simulators are much slower than wall-clock time, meaning they are only appropriate for microbenchmarks. Furthermore, the validity of the simulation results is dependent upon how closely the simulator models the real system.

Evaluation on a real system would allow using a full-scale benchmark and allow measuring performance directly. Implementation of a hardware prototype, for instance, on an FPGA-enabled NIC, would entail hardware costs and time-consuming implementations. An ideal solution would allow an easy software implementation on real host system with no special hardware.

Modern multi-core computing platforms can be leveraged to provide such a solution. By coupling one or more off-the-shelf NICs with a core on the system, we can prototype NIC features in software on these cores. The combination of the NICs and the core effectively becomes a programmable NIC as far as the rest of the system is concerned. The borrowed cores must be isolated from the remaining cores and they must behave like NICs so that performance measured on the remaining cores will be comparable to the performance of the same cores using a physical NIC.

We implemented this idea using the Linuxii kernel and the Intel PRO/1000 (e1000) driver. The result is called PICNIC, the Programmable Intermediate Core NIC. PICNIC has allowed us to prototype features intended for hardware implementation within the NIC's device driver on the NIC's borrowed cores (Figure 1). We separated the borrowed core from the others by dividing the driver into two parts with separate interrupt service routines (ISRs). The ISR on the borrowed core handles interrupts and data from the physical NIC, executes the new user-defined features, selects a target core, and sends it an inter-processor interrupt (IPI). The ISR on the target core receives the IPI and packet data as if they were coming from a physical NIC. This ISR then sends packets to the network protocol stack as usual, and then the stack processes the packets normally.



Figure 1. PICNIC appearing as a programmable NIC to target cores.

Using PICNIC, we implemented and evaluated a software prototype of receive-side scaling (RSS) [8]. RSS removes network bottlenecks by spreading network flows to multiple cores. Without using an RSS-capable NIC, we measured a 37% increase in throughput when using RSS with PICNIC on a multi-core server running a full Web server benchmark workload. This paper makes the following contributions:
  1. it proposes PICNIC as a general means of prototyping NIC functionality in software;
  2. it describes how PICNIC was implemented as a minimal set of modifications to the kernel and driver;
  3. it describes an API to easily add new functionality to the PICNIC driver;
  4. it quantifies PICNIC's impact on utilization and latency,
  5. and its ability to handle high data rates; and
  6. it demonstrates PICNIC's utility within a full-scale experimental framework through a real-system study of the effects of receive-side scaling on Web server performance.

In Section 2, we provide an overview of PICNIC. In Section 3, we present the details of our PICNIC implementation. Section 4 examines CPU utilization and latency overheads and the scalability of PICNIC. We present the implementation and performance of RSS as a case study in Section 5. We conclude in Section 6.

PICNIC operates by adding an additional step to the NIC device driver which runs on a reserved core on the system. Packets are classified or modified in this new step, and then packets are forwarded to another core. To mirror the actions of a physical NIC, the PICNIC core writes to a descriptor ring and generates an interrupt on the destination core.

2.1 Two Stages of Interrupt Service Routines

PICNIC provides a means of borrowing a system's core to implement new NIC functionality while isolating the borrowed core so that it does not affect the performance of the rest of the system. This is accomplished by splitting the NIC driver's interrupt service routine (ISR) into two routines (Figure 2). The PICNIC ISR runs on the borrowed core or PICNIC core and receives interrupts from the physical NICs. Instead of sending packets up the protocol stack, it enters a user-defined function where packets are classified or modified by the new NIC features under evaluation.



Figure 2. PICNIC and target ISRs.

For each packet, the user-defined function chooses a target core, which is one of the "unborrowed" cores on the system (Figure 1). After copying the packet's descriptor from its original descriptor ring to a new descriptor ring (Section 2.2), the PICNIC ISR then sends an inter-processor interrupt (IPI) to the appropriate target core. The IPI is received by the target ISR, which processes packets as if they originated from a NIC.

The ability for a single physical NIC to interrupt multiple cores was introduced by Extended Message Signaled Interrupts (MSI-X) [1]. NICs can use MSI-X to distribute packet and flow processing to more than one core. PICNIC mimics this by using IPIs to distribute processing to multiple target cores.

2.2 Two Stages of Descriptor Rings

To effectively model a system with physical NICs, the target ISR should handle packets from the PICNIC ISR as if they were generated by a NIC. A NIC normally sends packets and descriptors via direct memory access (DMA), and it puts descriptors into a descriptor rings data structure. As a normal NIC's ISR reads descriptors from descriptor rings, so should PICNIC's target ISR.

NICs with classification and load balancing features support multiple descriptor rings. To avoid contention between cores, each descriptor ring is processed by a single core chosen by the interrupt. There may be multiple descriptor rings per core, but multiple cores should not contend for any single descriptor ring.

PICNIC's target cores emulate this behavior (Figure 3). The PICNIC ISR receives descriptors from the physical NIC on its PICNIC descriptor ring. One or more target descriptor rings are mapped onto target cores. The user-defined function chooses a target descriptor ring along with the appropriate target core. PICNIC then moves the descriptor from the PICNIC descriptor ring to the target descriptor ring. It then sends the IPI to the target core. The target ISR, running on the target core, reads the descriptors as if they were placed on the target descriptor ring by a physical NIC.



Figure 3. PICNIC and target descriptor rings.

Implementation Details

We implemented PICNIC as modifications to both the Linux kernel version 2.6.17.14 and the Intel PRO/1000 (e1000) driver version 7.0.33.

3.1 Linux Kernel

The Linux kernel was modified to support sending inter-processor interrupts as if they were device interrupts. The original Linux API for device driver ISRs was implemented for normal IRQ or MSI, but not IPI. While Linux provides some specific handlers for IPIs, it does not provide a way to dynamically assign new handlers for IPIs. Thus, we created a new hw_interrupt_type for inter-processor interrupts, so that IPI vectors may be assigned to ISRs using the request_irq() function, just as IRQ or MSI vectors are assigned.

Normally, hardware interrupt vectors are created when scanning physical devices, but this is not possible for PICNIC's IPIs. Instead, we supplied a create_ipi() function which PICNIC calls to generate new IPI vectors. The function simply searches for an unused vector and maps it to the IPI hw interrupt type. Likewise, the destroy_ipi() function reverses this process.

3.2 NIC Driver

We modified the NIC driver to split the ISR into physical NIC interrupt and IPI handlers, to set up the target descriptor ring data structures, and to set up IPI mappings.

3.2.1 Initialization

We added an initialization function, picnic_init(), to the NIC driver, and it is called when the driver is loaded. This function sets up the mapping of target descriptor rings to target cores. We used the driver's existing receive descriptor ring as the PICNIC descriptor ring. Our implementation of PICNIC uses a one-to-one mapping of target descriptor rings to cores, but other combinations are possible. Cores which share a last-level cache with the PICNIC core are excluded from the list of target cores. This prevents the PICNIC core from affecting the last-level cache for a target core, which is something that a physical NIC does not normally do.

For each target core, the picnic_init() function allocates and initializes the target descriptor rings using the existing e1000_setup_rx_resources() function. An IPI vector for each target core is created using create ipi() and they are all assigned to the target ISR using request irq(). The creation of IPIs and target descriptor rings is avoided on the PICNIC core and any core which shares cache with the PICNIC core (Section 3.2.5). The function PICNIC_USERDEFINED.init() (Section 3.2.4) is called to perform initialization of the user-defined packet processing functionality, if needed.

When the driver is unloaded, the picnic_exit() function reverses the effects of picnic_init(). It flushes any pending descriptors and then deallocates all the target descriptor rings, disassociates the IPI vectors with the target ISRs using free_irq(), and then removes the IPI vectors using destroy_ipi(). The function PICNIC_USERDEFINED.exit() is also called to perform any needed cleanup for the user-defined functionality.

In the original driver, the function e1000_irq_enable() maps the IRQ of each NIC to the driver's ISR. We added the additional task of assigning the affinity of the IRQs to the PICNIC core, which is set to the highest possible processor ID.

3.2.2 PICNIC ISR

The original ISR, e1000_intr(), was divided into two ISRs for PICNIC (Figure 2). The PICNIC ISR (picnic_intr()) receives the interrupt on the PICNIC core. It is unmodified from the original ISR, except that it calls picnic_clean_rx_irq() instead of e1000_clean_rx_irq(). The function picnic_clean_rx_irq() is a modified version of e1000_clean_rx_irq(). Both functions process newly arrived packets on the descriptor ring. The PICNIC version reads each packet from the PICNIC descriptor ring and passes it to the PICNIC_USERDEFINED.process_packet() function which performs all user-defined packet processing, modification, and classification and then returns the desired target core. The picnic_clean_rx_irq() function then moves the descriptor and its metadata to the appropriate target descriptor ring and updates its tail pointer. The descriptor is removed from the PICNIC descriptor ring and its head pointer is updated. The socket buffer (skb) and packet data are not copied or deallocated; pointers to this data simply get carried to the target descriptor ring. Finally, do_picnic_ipi() sends the IPI to the chosen core.

3.2.3 Target ISR

When the target core receives the IPI, it calls the ISR, target_intr(), which is the second part of the split e1000_intr(). This function is simpler than the normal ISR since it does not deal with physical NIC registers or NIC accounting-this is already taken care of in picnic_intr(). This function simply calls e1000_clean_rx_irq(). The function e1000_clean_rx-irq() works the same as in the original driver, except now it runs concurrently on multiple target cores, each with distinct target descriptor rings. The function calls netif_rx() to process each packet normally.

3.2.4 User-Defined Functionality

The picnic user-defined data structure provides an API for adding the process_packet(), init(), and exit() functions for user-defined packet modification and classification. Adding picnic_userdefined instances as globals allows multiple user-defined functions to coexist. Defining PICNIC_USERDEFINED as the desired instance on the compiler command line lets the user switch between user-defined functions without editing the code. When using the PICNIC driver as a kernel module, one can replace user-defined functions and recompile without rebooting the system.

3.2.5 PICNIC Core Isolation

Our implementation isolates the PICNIC core, and also any cores which share last level cache with the PICNIC core, which prevents cache effects on target cores which would not otherwise occur when communicating with a physical NIC. In the picnic_init() function, we used the kernel-supplied cpu_coregroup_map() function to locate cores which share last-level cache and skip them when assigning IPIs and target descriptor rings.

We further isolated the PICNIC and its shared-cache cores by setting the affinity of all non-NIC interrupts listed in /proc/irq/<IRQ number>/smp_affinity only to the target cores. We also used the system call sched_setaffinity() to allow Linux to schedule user processes only on the target cores. Although the PICNIC core is visible to the operating system in our implementation, nothing besides PICNIC runs on it.

3.3 Design Decisions

Our initial goal was to evaluate receive-side scaling using PICNIC. As such, our implementation of PICNIC does not support user-defined functionality for transmit-side packets. Instead, we emulated multiple transmit queues using multiple physical NICs. However, a PICNIC implementation for the transmit path is conceivable. To emulate the DMA to the NIC, the PICNIC core would need to poll for new packets placed on target transmit descriptor rings by the target cores. It would then copy the descriptor to PICNIC transmit descriptor rings and perform the actual DMA. Just as the NIC sends an interrupt to acknowledge that it has copied the packets, the PICNIC core would send an IPI to the transmitting target core.

Additionally, an unmodified driver uses interrupt moderation to reduce the overhead of high interrupt rates. As our original purpose for PICNIC was to evaluate RSS with a transmit-intensive workload (Section 5.2), our implementation does not moderate IPIs, since interrupt rate is not an important factor in RSS behavior. Unlike the unmodified driver, every packet generates an interrupt. For those workloads with high packet rates and small CPU utilization in the application, implementing moderation PICNIC IPIs should reduce CPU utilization overhead on the target core. If necessary, interrupt moderation could be supported by using a timer on the PICNIC core to generate IPIs instead of generating them after each descriptor copy to a target descriptor ring. However, this would combine the latency effect due to the moderation of both NIC interrupts and IPIs.

Performance Analysis

For PICNIC to be useful to evaluate NIC features as if they were implemented on a physical NIC, packet processing on PICNIC's target cores should perform as similarly as possible to cores handling packets from a physical NIC. Additionally, the packets' detour through the PICNIC core should result in a minimal amount of additional delay. The PICNIC core must also scale to handle large data rates without becoming a bottleneck at full CPU utilization.

4.1 CPU Utilization

Ideally, PICNIC should add no CPU overhead to a target core as compared to a core interacting with an unmodified NIC driver. To compare this overhead, we set up a single bulk TCP receive session on a system with 3GHz Intel Xeon® MP 3.0 processors (Intel NetBurst® microarchitecture) with the NIC receiving at a 1Gbps line rate. We measured CPU utilization both with and without PICNIC.

Figure 4 shows the result. The target core, when using PICNIC, produced 8% higher utilization than the unmodified driver. While this difference is not trivial, it should reduce with lower packet rates. Also, this test uses a trivial TCP receive application, which means that, in relative terms, the difference in utilization should reduce when a more realistic application workload is added. Furthermore, PICNIC only affects the receive path. Transmit-intensive workloads with small receive rates should also see smaller overheads. Furthermore, as explained in Section 3.3, the lack of moderation of IPIs increases overhead for high packet rates.



Figure 4. Utilization with PICNIC versus with an unmodified driver for 1Gbps TCP receive.

As a result of these factors, this increase in utilization should represent the worst case for most workloads.

4.2 Latency

PICNIC has introduced extra steps to the path of a network flow which add latency to incoming packets. Figure 5 shows the extra latency measured when receiving minimum-sized UDP packets at a range of packet rates. The latency increased linearly until the packet rate reached 350 packets/ms, at which point the PICNIC core became fully utilized. When using interrupt moderation on the NIC, there was a delay between the time packets are enqueued and dequeued from the descriptor ring. As packet rates increased, the number packets in the queue also increased per interrupt. As a result, the average time to dequeue a packet also increased due to the average increase in queue depth.



Figure 5. PICNIC latency overhead.

If these packets were 1,500 bytes, a 1GbE line rate would be reached at nearly 80,000 packets per second. This would correspond to the far left of Figure 5 with about 19?s of extra latency. This is on the order of the amount of latency added by a network switch, and it would be dwarfed by milliseconds of latency over the Internet.

4.3 Scaling

Since it is "borrowed" from the system, the PICNIC core is not a concern as long as it is not over-utilized. To determine how much load the PICNIC core can withstand, we set up a system containing two Intel Xeon X5355 quad-core processors (Intel Core™ microarchitecture) for a total of eight cores. We set up one core as the PICNIC core and reserved the core which shared last-level cache with the PICNIC core. We used up to five of the remaining cores as target cores. We ran line-rate TCP receive flows on each of up to five physical NICs all using a single PICNIC core. Using a simple classifier function, each flow was redirected to its own target core. Figure 6 shows that PICNIC was able to handle all five NICs without the PICNIC core becoming a bottleneck. On a single core, PICNIC was able to handle at least 5Gbps of incoming 1500-byte packets.


Figure 6. PICNIC core scaling to five 1Gbe NICs.

Now that the performance characteristics have been presented, the next section describes the implementation and performance of an actual NIC feature that we have prototyped using PICNIC.

Evaluating Receive-Side Scaling

The throughput capacity of a single NIC port continues to grow beyond 10Gbps. At the same time, the growth of compute capacity of a system is achieved through the increase in number of cores. As such, the ability of a NIC to distribute network flows to cores will help the system scale performance with compute resources. One method to achieve this is receive-side scaling (RSS) [8] from Microsoft, which aims to relieve the bottleneck of processing all network flows on the same core by spreading the flows to more cores. It avoids sending packets of the same flow to different cores to keep flow-specific state valid in a core's cache, to prevent reordering of packets within a flow, and to reduce locking for flow state shared between packets. Independent flows, however, can be distributed without contention to exploit flow- or connection-level parallelism.

We used PICNIC to implement RSS and evaluate its effectiveness. Although RSS-capable NICs are now available, they were unavailable at the time our experiments were done. We wanted to evaluate whether the operating system scheduler, when handling a multithreaded server application on a multi-core system, would help or hinder performance when RSS was used. PICNIC was the only available method of testing RSS with a realistic, full-scale server workload and operating system on a real server platform. Even if RSS were available on NICs, PICNIC provides the flexibility to quickly explore changes or alternatives to RSS. This would not be possible with a hardware implementation.

An RSS-enabled NIC maps flows into multiple input queues where each queue signals one CPU core to process all its packets. RSS considers a flow to consist of either packets common to a TCP session, or otherwise, IP packets with common source and destination addresses. Figure 7 shows how flows are distributed by RSS. When a packet arrives, RSS obtains its flow ID which uniquely identifies the flow to which the packet belongs. The flow ID is fed to the hash function, which produces the result h, which corresponds to one of the n indirection table entries. In practice, the hash function returns the result the Toeplitz [8] hash algorithm, modulo n. The lookup for entry h returns the ID of a core, ch. RSS directs the packet, descriptor, and interrupt to core ch as shown in the indirection function in Figure 7.



Figure 7. Receive-side scaling.

5.1 RSS Implementation

In PICNIC, RSS is implemented in software in the user-defined phase of packet processing which runs on the PICNIC core. We chose 128 indirection table entries, which is the maximum allowed by RSS and is supported by current NICs.

We do not depend on the OS to adjust load by changing the core IDs in the indirection table. Since Microsoft has not provided details on RSS's load balance adaptation scheme, we have no basis for implementation. Instead, indirection table entries are simply distributed uniformly to each target core.

Due to concerns that the PICNIC core may become a bottleneck, we replaced RSS's standard Toeplitz hash algorithm with Bob Jenkins' JHash [5] algorithm. While Toeplitz can be implemented to run quickly using fine-grained parallelism in hardware, in software it must compute a sequence of XOR and bit shift operations for each bit of the flow ID, resulting in 96 iterations for each TCP packet. To increase the maximum performance of the PICNIC core, we chose JHash, which has been shown to produce a fast result compared to other software hash algorithms [9].

RSS in its intended hardware implementation will use Toeplitz and not JHash. Toeplitz has been shown to produce a uniform and independent hash result [11]. For JHash to be allowed as an appropriate substitute for Toeplitz in a software prototype, it must also effectively produce results which are indistinguishable from a uniform and independent series. To verify this, we used a pseudorandom number sequence test program, Ent [12], which computes five different tests for uniformity and independence for a sequence of bytes. We captured packet traces for all three SPECweb2005i iii, [10] Web server workloads and used the flow ID for each TCP SYN packet as input to the hash functions. Although RSS uses 128 indirection table entries and thus 7 bits of the hash result, Ent only works with full bytes, so we used the last 8 bits for the tests.

Table 1. Percent error for random sequence tests of Toeplitz and JHash with SPECweb workloads.

Banking E-Commerce Support
Toeplitz JHash Toeplitz JHash Toeplitz JHash
entropy 0.12% 0.51% 0.17% 0.18% 0.23% 0.20%
arithmetic mean 0.04% 0.22% 0.08% 0.01% 0.66% 0.20%
Monte Carlo value for 1.04% 1.09% 0.62% 1.08% 1.16% 0.14%
serial correlation coefficient 0.71% 0.08% 1.22% 0.92% 0.00% 1.29%


Table 1 shows the test results. For the entropy, arithmetic mean, Monte Carlo value of p, and serial correlation coefficient tests, all results for JHash fall within a 1.29% error and are similar to the results produced by Toeplitz. For the X2 test (not shown in Table 1), the documentation for Ent considers sequences producing values greater than 90% or less than 10% as "suspect" of being non-random. Our results fall in the "non-suspect" range. Traces from all of the SPECweb2005 workloads pass all five pseudorandom sequence tests, confirming that JHash is as viable as a substitute for Toeplitz for evaluating RSS.

5.2 Test Workload

Today's Web servers handle up to tens of thousands of concurrent flows, making RSS potentially useful to spread the network load. Web servers generally send much more data than they receive. It may seem that, since RSS operates on incoming packets, it would have little impact on transmit-intensive workloads. However, incoming TCP acknowledgments (ACKs) are handled by RSS just as any other TCP packet. When ACKs arrive at a core, if data is waiting in the transmit buffer, it is transmitted immediately from the same core. As a result, distributing incoming ACKs also distributes outgoing data [6].

We set up a Web server to handle clients running the SPECweb2005 Support workload.ii We focused only on the Support workload since it is produces more network traffic than the others. We used Apachei and PHPi to serve Web content. The system under test contained two Intel Xeon X5355 quad-core processors for a total of eight cores. One core on the system was reserved for the PICNIC core. Since pairs of cores on share L2 cache, we also reserved the core that shares cache with the PICNIC core. Apache and PHP ran on the other six target cores.

Four other systems were set up as Web clients. One of them also doubled as the required backend simulator. The system under test communicated with all these clients using four 1Gbps Intel Pro/1000 NICs that were aggregated by PICNIC to appear as a single NIC. Flows on the single emulated NIC were redistributed to the cores by software RSS implementation. While modern Intel NICs support RSS and MSI-X, the revision of NICs used neither supported nor used RSS, multiple queues, and the ability to interrupt more than one core. As a precaution, we also made sure that that neither the client systems nor the PICNIC core became a CPU utilization bottleneck.

5.3 Results

Figure 8 shows the payload throughput (not including headers) of the Web workload using PICNIC's RSS implementation to spread incoming packets to different numbers of cores. The Web server software ran on all six target cores in each case, even when RSS used fewer cores. With all packets going to a single core, network processing became a bottleneck. Spreading packets to two cores produced a 29% increase over one core. Recalling that the SPECweb2005 Support workload is transmit-intensive and that RSS affects incoming packets, by spreading incoming HTTP requests and TCP acknowledgments, RSS relieved a transmit-intensive bottleneck.



Figure 8. SPECweb2005 Support throughput for 6 cores and 1-6 PICNIC RSS queues.

Further spreading the network load up to six cores produced a 6% improvement over the spreading flows to two cores. A possible explanation is that by spreading the network load to more cores, the OS more effectively scheduled the Web server threads on the same core as their network protocol processing. With fewer cores used for networking, the available cycles for the Web server were greater on target cores without a network load. To make use of these available cycles, the network data must move between the cores receiving packets and the cores running Web server threads. By scheduling networking on all available cores, we avoided the migration of network flow data and state, and the system was able to use its cache more effectively.

Overall, RSS produced a 37% increase in throughput over the single-core case by spreading network flows to all six cores. This result resembles the improvement shown by the RSS-like method known as KNET, which was implemented on a physical NIC [6]. KNET produced a 1.3× speedup in connection rate on four processors by relieving a network processing bottleneck in a Web server workload.

Our insights also resemble a study that showed an improvement in bulk transmit throughput using an Intel Pro/10GbE NIC supporting an actual hardware RSS implementation [7]. They used a simple application transmitting 1MB files in 32 concurrent flows on a system with two four-core Intel Xeon 5355 processors running Microsoft Windows Server 2008._ Using a 4kB send() buffer size, they demonstrated a throughput of 5,090Mb/s, 6,023Mb/s, an 6,274Mb/s with 1, 2, and 4 RSS queues, respectively. This study validates our conclusions from PICNIC by showing that RSS can improve throughput by spreading the work to more cores and relieving the transmit-side processing bottleneck.

Conclusions

We have introduced PICNIC as a tool to prototype NIC features in software and allow direct measurement of the performance impact of these features. PICNIC supports testing on actual systems within full-scale experimental frameworks and with full operating systems, applications, and workloads all running in wall-clock time. We demonstrated these capabilities by implementing RSS and measuring the performance improvement using a full SPECweb2005 testbed and Web server workload. We evaluated RSS performance without using RSS-capable NICs, without simulation, and without modifying any hardware. We were also able to aggregate multiple 1GbE NICs to emulate a single, faster NIC which had a single RSS classification function.

We provided details on how PICNIC may be implemented with only minimal changes to the NIC driver and operating system. In addition to the RSS case study, the results of our latency, utilization, and scalability evaluation show that PICNIC can be used as an effective model.

We have shown how we can borrow and isolate a core on a multi-core system to use software to test functionality intended for hardware components. This concept is not necessarily tied to NICs; it could be used to prototype features for other I/O devices or other hardware components. As with PICNIC, this could allow avoiding simulations which prevent using experimental frameworks with full-scale applications. It could also help avoid the cost of using a hardware prototype to implement and evaluate each potential change in features.

Acknowledgments

We thank Erik Johnson for his support in PICNIC's design and development. We also thank Arun Raghunath and Prasanna Mulgaonkar for reviewing drafts of this paper and offering excellent suggestions.

References

[1] PCI Local Bus Specification, 3.0 edition, 2004.

[2] S. Chinni and R. Hiremane. Virtual Machine Device Queues White Paper. Intel, 2007.

[3] R. Huggahalli, R. Iyer, and S. Tetrick. Direct cache access for high bandwidth network I/O. In ISCA, pages 50-59, 2005.

[4] Intel. Intel 82575EB Gigabit Ethernet Controller Product Brief, 2007.

[5] B. Jenkins. Algorithm alley: Hash functions. Dr. Dobbs, 22(9):107-109, 115-116, Sept. 1997.

[6] É. Lemoine, C. Pham, and L. Lefèvre. Packet classification in the NIC for improved SMP-based Internet servers. In ICN. IEEE, 2004.

[7] Y. Li. RSS performance for bulk TCP workloads. Private correspondence, 2008.

[8] Microsoft Corporation. Scalable Networking with RSS, Apr. 2005.

[9] M.Molina, S. Niccolini, and N. Duffield. A comparative experimental study of hash functions applied to packet sampling. In ITC-19, 2005.

[10] SPEC. SPECweb2005 Release 1.10 Benchmark Design Document, Apr. 2006.

[11] B. Veal and A. Foong. Performance scalability of a multi-core web server. In ANCS, pages 57-66, 2007.

[12] J. Walker. Pseudorandom Number Sequence Test Program. Fourmilab, Oct. 1998.

i Presented to the Workshop on Tools, Infrastructures and Methodologies for the Evaluation of Research Systems (TIMERS-1) in April, 2008.
ii Other names and brands may be claimed as the property of others.
iii Our unofficial SPECweb2005 results are for research purposes only.