Critical Analysis of the Need For Parallelizing Network Stacks

Submit New Article

Published On :   July 30, 2008 12:00 AM PDT
Rate
 


by Annie Foong, Network Research Scientist
Erik J. Johnson, Network Software Engineer

Introduction

In the new era of “parallelize or paralyze”, the rush to thread software is on. In this article, we make a case that it pays to critically analyze which parts of our software to parallelize first. We detailed a case-study on networking stacks and the applications that use those stacks.  When faced with the tradeoff of parallelizing a networking stack or the application, we derived simple equations that helped us decide what we should parallelize first, and the results were surprising. More importantly, these are general equations and apply to any workload. They precisely quantify the impact the parallelization of a particular component of our software has on systems of today and the future.

Background

The general perception in the networking realm is while the many networking applications are well parallelized (threaded), the network stacks (including OS overheads) are not. As such, many efforts exist to parallelize network stacks. However, such parallelization represents a major undertaking, involving modification of large amounts of kernel code with potentially sanity-wrecking consequences.

Therefore, we wanted to be absolutely sure such an effort was warranted. We performed preliminary work before diving deep into code development. This involved carrying out a detailed analysis of a networking stack to determine the extent to which it is already threaded (a subject for another article), as well as quantifying the performance improvement of networked applications if we were to fully parallelize the stack (the subject of this article). We have chosen Linux-2.6 as our reference implementation.

We will discuss the math derivation in our “Theory” section and Appendices. We felt that the derivation is important to show, so that others can further generalize the equations. We have made the section “Applying the Theory to Our Case Study” to be fairly standalone, and insights from that section can be applied directly to workloads that mirror ours.

Theory

Throughout the rest of this article we’ll refer to the symbols found in Table 1.

Table 1 - Symbols and their meanings

Symbol

Meaning

Tn Processing requirement of the networking stack
Ta Processing requirement of the application
p Total number of cores on the system
n Number of cores the networking stack can concurrently execute on. n <= p
a Number of cores the application can concurrently execute on
For a time-shared symmetric OS: a = p
For a dedicated processing asymmetric OS: a = (p – n)
tn Run time of the networking stack
ta Run time of the application
tn/a Run time of the entire workload when the network can concurrently execute on n cores, and the application can concurrently execute on a cores

In approaching our problem of estimating the speed up of a parallel networking application and single threaded networking stack, we first turned to Amdahl’s Law. Conceptually, Amdahl’s law can be thought of as the one-dimensional illustration shown in Figure 1. If a system’s original run-time is composed of the networking stack execution time (tn) followed by the application’s execution time (ta), and ta can be sped up by a factor of two, the new runtime is tn plus half of ta. Intuitively, if ta is a small percentage of the overall runtime, speeding up ta is of little value.

Figure 1 A one-dimensional illustration of Amdahl's Law

One of the beautiful properties of Amdahl’s law – a property we want to keep in our analysis - is that only readily-obtained information about the original application is needed. Specifically, if we know the ratio of the processing requirement of the networking stack and application, and the estimated speedup on either the application or stack, we can apply Amdahl’s law. For example, we readily obtained an estimation of such ratios of several networking workloads (Table 2).

Table 2 Processing requirements of a sample of networking workloads

Sample Workloads Ratio of Application:Network stack processing requirements (Ta/Tn)
IP forwarding 1:64
SNORT (Intrusion Detection) 4:1 [Kunze06]
Web server (file download) 4:1 [Veal07]
Web server (SSL) 8:1 [Patwardhan04]

With the knowledge of common application-to-networking stack ratios, we were ready to apply Amdahl’s law. F irst, we assumed we had p cores. [1] Furthermore, we assumed the application can be parallelized perfectly, but the networking stack cannot. A naïve application of Amdahl’s law led us to the conclusion that t­a is sped up by a factor of and tn is not sped up. That is, the application executes on (p-1) cores while the networking stack executed on the one remaining core. The result is conceptually illustrated in Figure 2.

Figure 2 A naïve application of Amdahl's Law to our parallel application

Unfortunately Figure 2 is not correct. Consider the situation where the parallel application executes faster than the networking stack. In this case, the runtime should simply be the runtime of the networking stack. Moreover, when the networking stack runs faster than the parallelized application, after the networking stack completes executing, the application should be allowed to execute on p cores (not just p-1). Neither of these subtleties is borne out in the one-dimensional view of Amdahl’s law.

A Two-dimensional View of Amdahl’s Law

To resolve these issues, we added a second dimension to the intuitive illustration of Amdahl’s law; this second dimension represents cores. With this two-dimensional conceptual model, the total processing requirement (cycles) of either the application or stack is represented by an area, as opposed to the linear representation in the one-dimensional view.

For example, the processing requirement the un-parallelized network stack is represented by a one-unit high bar of the same length as the original network stack (area = 1 (core) x tn (runtime)). The area for the parallelized application is represented by a bar of height p when the networking stack is not executing (Figure 3(a)); and a bar of height (p-1) when the networking stack is executing (Figure 3(b)).

They show a simple case where the networking stack can only be executed on 1 core and the application can execute on all p cores. In Figure 3(a), the parallelized application takes longer to execute than the networking stack and hence the application utilizes p cores after the networking stack completes. In Figure 3(b), the network stack takes longer to complete.

Figure 3 A two-dimensional view of Amdahl's Law

It is straightforward to extend this model to apply to more general cases (see Appendix A for details). The result is equation (1) that gives us the runtime of the networking workload that comprised of two parts, Ta and Tn. In the time-shared OS, Ta can be parallelized to all p cores, Tn can be parallelized to n (where n < p) cores

Equation (1)


tn/p = otherwise

 

The total area of the two-dimensional chart represents the total amount of processing capacity (cycles). (Ta + Tn) represents the total cycles consumed. Each part of the application consumes this area as best it can, based on its ability to be parallelized. When possible, the application and network stack can be executed “in parallel” if enough cycles are available.

Applying the Theory to Our Case Study

So, how do we put the theory to use?  For a single core system, the total runtime of the workload would be (ta + tn). Incidentally, for a single core, (ta + tn) = (Ta + Tn). We use this as our baseline performance. The maximum speedup possible is therefore given by:

Equation (2)

Speedup                 = (ta + tn) / tn/p       where tn/p is given in equation (1)

                              = (Ta + Tn) / tn/p

 

                                  =

 Figure 4 Performance scaling if the network stack cannot be parallelized

Figure 4 is the graphical representation of equation (2) in the case where we cannot parallelize the networking stack, when compared to the baseline.

The y-axis shows speedup possible. Ideal speedup is achieved when speedup = p. For example, for a packet-forwarding workload (application:network ratio=1:64), we see that the maximum possible speedup is pegged at 1, no matter how many cores we have in the system. On the other hand, for a SSL web server workloads (application:network ratio=8:1), the parallelization of the networking stack has absolutely no impact on performance until the system reaches a core count of 8 (Such a core count is only recently available in general-purpose CPUs).

This simple insight seems obvious in hindsight and yet its implications on our work were enormous. For example, if we had a SSL web server workload on a 4-core system, and we focused solely on optimizing the networking stack performance, not only would we never have seen an improvement in performance, we would have also changed the application:network ratio to something larger than 8:1, making further optimizations on the network stack even less desirable.



Now that we did all the hard work to come up with generic equations, we derived insights that are less obvious (Figure 5). In this figure, we varied the number of cores the networking stack could scale to (i.e., n) while fixing the application to stack execution time ratio. The top chart shows the impact of networking stack parallelization for our SSL web server workload, and the bottom chart shows the impact of network stack parallelization for our packet forwarding workload. Consider the point n=6, which indicates that network stack could scale to 6 cores. In this case, our web server has perfect speedup (assuming the application sc aled to any number of cores) all the way up to and including 32 cores; Our packet forwarder will not scale beyond 6 cores.

 

Figure 5  Speedup as a function of how much we can parallelize the network stack

What we have discussed so far assumes a time-shared OS scheduler, of a workload with 2 independent functional parts. The detailed derivation in Appendix A can easily be extended to model processing that has n independently executing functional parts. Finally, the same derivation is also applied to a dedicated processing paradigm, where some number cores are sequestered to perform a specific task. It is actually a much easier model to work out and is provided in Appendix B.      

Conclusion

For our specific case, we come to many insights about parallelizing network stacks. On one hand, we know that for packet-forwarding workloads, it pays tremendously to keep optimizing the parallelism of network stacks, while such efforts must be undertaken with more care when applied to web server domains.  It is important to keep in mind that for server workloads such as SSL web serving, we should not expect to see any difference in performance on current systems (with less than or equal to 8 cores) no matter how we parallelize the network stack. This is not to say that we stop further efforts in this area. On the contrary, we advocate continuous vigilance, as network parallelism [2] will be hugely important as we continue to push systems with more cores.

In addition, we have derived equations to quantify the benefits possible when dedicating cores to the network stack (Appendix B). Dedicating cores to specific tasks can provide huge opportunities to highly optimize processing of each sequestered domain [Foong05, Kunze06]. However, such schemes imply that hardware resources are statically assigned. In other words, one and only one optimal point of performance exists for a given application to network stack ratio for that given system. When the application:network ratio changes for whatever reasons (e.g. change in driver code), the optimal performance will no longer be valid. For workloads where the application:network ratio is expected to be reasonably constant (e.g. network infrastructure and storage appliances), this dedicated model can be rigorously used whenever practical. However, for other workloads (e.g. generic end systems), the benefits derived from dedicating cores must be carefully weighed against the potential departure from optimum when implementing such systems.

Arriving at such insights required only simple profiling (e.g. Intel® VTune Performance Analyzer, Oprofile) of our favorite workloads, and applying a few simple equations.  What we gained is a quantification of expected performance upon which to base our engineering choices. We believe that applying these models in our daily work is worth the effort. More importantly, we should not presume that parallelizing the most difficult parts of the code is the best way to go. To this the end, combining good coding techniques with critical analysis will make us become far more effective engineers.

For a spreadsheet that implements the models discussed in the article, please email the authors .

Acknowledgements

We are very grateful to our technical reviewers John Wiegert, Charles Congdon and Dave Dunning for their careful reviews and immensely useful comments. Their feedback, together with the editing done by our literary reviewers, definitively improved the quality of this work.

About the authors

Annie Foong is a Network Research Scientist with the Communications Technology Laboratory at Intel, where for the last 7 years, had been involved in optimizing networking performance. She began life at Intel by implementing interfaces that enabled unmodified commercial applications to run on VI Architecture hardware. She had since been working continuously at advancing generic IA platforms as the highly-efficient architecture for TCP processing. She has authored more than a dozen technical papers in these areas. She especially enjoys abstracting complex system issues to their simplest forms. She is currently looking at furthering networking capabilities on end systems such that end to end performance over any network can be improved.

Erik J. Johnson is a network software engineer in the Networking Technology Lab, which is part of the Corporate Technology Group at Intel. After receiving a B.S. and an M.S. both in computer and electrical engineering from the University of Arizona, Erik programmed Tektronix digital oscilloscopes from 1996 through 1999. After joining Intel in 1999, Erik spent four years programming the IXP family of network processors, where he co-authored two books on the subject, the most recent being “IXP2400/2800 Programming.” Currently Erik is researching packet processing technologies for Intel Architecture-based chip-multiprocessors, and platform security technologies for hardening online games against hacking.

References

[Foong05] “An Architecture for Software-based iSCSI on Multiprocessor Servers,” Annie Foong, Gary McAlpine, Dave Minturn, Greg Regnier, and Vikram Saletore, CAC 2005.

[Kunze06] “Symerton: Using virtualization to accelerate packet processing,” Aaron Kunze, Stephen Goglin and Erik Johnson, Proc of ANCS, 2006.

[Padwathan04] Communication Breakdown: Analyzing CPU Usage in Commercial Web Workloads, Jaidev Patwardhan, Alvin Lebeck, and Daniel Sorin, IEEE ISPASS 2004.

[Veal07] SpecWeb2005 Support workload characterization, private communications, 2006

Appendix A Derivation for a time-sliced model

Figure 6  Processing in a time-shared OS

In this appendix, we describe in detail how we derive equation 1. Figure 6 shows the processing that may go on in a system with 2 cores. The “combined area” concept (Figure 7) makes derivation easier because we can visualize network processing filling in the time slices where the application does not use, independent of what core they actually execute on, and the sequence of execution.

Let

Ta=total processing requirement of apps

Tn=total processing requirement of networking

p= total number of cores

To complete any transaction, the application and network stack will b e done in time-slices until completion.  

Assumptions made

1. Application and networking processing can proceed independently of each other

2. Application processing scales up to p (executes concurrently on p cores)

3. Network processing scales up to n (executes concurrently on n cores)

4. Load can be scheduled and balanced across cores without penalty (in reality, this is not true, and should be considered as part of network processing overheads and parallelism)

 With these assumptions, we can represent the total processing requirement of each independent component as:

 

 

To simplify, we say that each core has 1 unit of processing capacity

 

 

In all cases (for any given n or p)

Time taken to do network processing =                 

Time taken to do application processing =                 

 

Figure 7  Performance scaling of n out of p cores

 

Network processing is the bottleneck when total network processing time is larger than application processing time

i.e.                                                     

When networking is the bottleneck (Figure 7a), total processing time is given by:

When networking is not the bottleneck (Figure 7b), that processing time becomes:

It helps to visualize total processing time as the network processing part plus the “spill-over” from the application processing part.

 

In the case where neither is the bottleneck, using either of these will lead us to:

 

Combining everything we discussed, we arrive at Equation 1 discussed in this article.

Let tn/p: Run time when n out of p cores can perform network processing concurrently; and p out of p cores can perform application processing concurrently.

 


tn/p = otherwise

  

Appendix B: Derivation for a dedicated processing model

In this section, we derive similar scaling equations for the dedicated processing model. Equation 2 gives us the total processing time where n cores are dedicated to network processing, and (p-n) cores dedicated to application processing.

Following the steps in Appendix A, we derive the following:

 



tn/p-n =                

                  Otherwise

 

Figure 8 n cores dedicated to the network stack, (p-n) cores dedicated to the application

Baseline selection is slightly trickier here. Technically, this is a dedicated model, and processing on 1 core of 2 separate parts does not make sense. But a baseline is simply something we compare performance against. There is no hard and fast rule. Hence, we have chosen to also use the time-shared OS baseline here. Once we establish what our baseline is, derivation of speedup is as before:

 

Speedup                 =                 (Ta +Tn) / tn/p-n                        

 

                                  =

 

Figure 9 Performance scaling possible if 6 cores are dedicated to networking

As an example, figure 9 shows the performance possible when we dedicated 6 out of the p cores on the system, solely to network processing. We picked the number 6 arbitrarily. What we are more interested in calling out is that fact that there is one and only one optimal point of speedup for any combination of application:network ratio and number of cores. E.g. with 128 cores in the system (Tag A), the optimal speedup occurs when the application:network ratio is around 32:1. With 16 cores (Tag B), the optimal speedup occurs when that ratio is around 2:1. Performance in any other combinations is substantially compromised.


[1] Or hardware threads.

[2] As it turned out, we found Linux’s implementation of network stacks to be highly p arallelized.