Using Intel® TBB in network applications: Network Router emulator

Introduction

Intel® Threading Building Blocks is used in wide range of applications. If performance makes sense and multi core platform is used, TBB is good thing to be added to C++ program. Network applications are usually highly-loaded as they process huge amount of traffic and processing time constraints are high. This article is intended to show how TBB can be used in network packet processing software, improving its productivity and processing time.

For a sample project I've created a simplified Network Router emulator. Network Router is a device that routes and transmits IP (Internet Protocol) packets in local area network (LAN). It connects several PCs, provides them access to Internet and internal network. The device has several internal network interfaces and one external.

The sample project emulates Network Router logic. It provides the following functionality:

  • Input packets from file - the application is just a model so there is no need for real interconnection with network interface. Reading from file emulates real reading from network interface.
  • NAT - Network Address Translation. The router has only one external IP address, but packets should be delivered to several internal devices behind the router. NAT allows port and IP mapping from external to internal and vice versa.
  • IP routing - delivering packets to appropriate router NIC (Network Interface Controller) according to destination IP.
  • Bandwidth management - some traffic is real time and it's critical to deliver these packets as quick as possible (e.g. voice over IP). The VoIP protocols maintain telephone conversation and delays would degrade quality. The router can prioritize these critical packets so they can be processed quicker.

I've created two versions of Network Router: serial and parallel. The latter uses Intel® Threading Building Blocks. I'll describe how TBB was used in the project and will provide performance results of the program parallelization.

Network Router implementation

Network router emulator gets packets from file and processes them. Packet processing includes Bandwidth management, NAT translation and IP routing. Packets are processed by several program modules. These processing modules are ordered sequentially, like in assembly line. This is common composition of packet processing application. Input file is a text file, each line represents one IP packet. There is separate thread that reads packets by big chunks.

Intel® TBB has tbb::pipeline class that provides high level framework for such kind of program structure. It has filters that process packets on each stage. Each packet goes through the pipeline and is processed step by step by its filters. One packet is processed sequentially - from first filter to second, than third, etc. However processing of one packet is independent from another, so filters can operate in parallel.


Network Router scheme



Main function:

#include <iostream> 
#include <sstream>
#include <fstream>
#include <vector>
#include <algorithm>
#include <ittnotify.h>
#include <tbb/pipeline.h>
#include <tbb/concurrent_hash_map.h>
#include <tbb/atomic.h>
#include <tbb/concurrent_queue.h>
#include <tbb/compat/thread>
// Redirects calls to "new" and "delete" to TBB thread safe allocators
#include <tbb/tbbmalloc_proxy.h>

using namespace tbb;
using namespace std;

class bandwidth_manager_t;
class network_adress_translator_t;
class ip_router_t;
class compute_t;
typedef vector<packet_trace_t> packet_chunk_t;

int chunk_size = 1600;
concurrent_queue<packet_chunk_t> chunk_queue;
atomic<bool> stop_flag;

int main(int argc, char* argv[])
{
	ip_addr_t external_ip;
	nic_t external_nic;	
	nat_table_t nat_table;	// NAT table   
	ip_config_t ip_config;	// Router network configuration 					
	int ntokens = 24;	
	
	get_args (argc, argv);	
    ifstream config_file (config_file_name);

    if (!config_file) {
        cerr << "Cannot open config file " << config_file_name << "n";
        exit (1);
    }		
	if (! initialize_router (external_ip, external_nic, 
                            ip_config, config_file)) exit (1);	
	
	thread input_thread(input_function);

	// packet processing objects
	bandwidth_manager_t bwm;	
	network_adress_translator_t nat(external_ip, external_nic, nat_table);
	ip_router_t ip_router(external_ip, external_nic, ip_config);		

__itt_resume();
	bool stop_pipeline = false;	
	
	parallel_pipeline(ntokens,		
		make_filter<void, packet_chunk_t*>(		// Input filter
			filter::parallel,
			[&](flow_control& fc)-> packet_chunk_t*{				
				
				if (stop_pipeline){					
					fc.stop();
				}				
				packet_chunk_t* packet_chunk = new packet_chunk_t(chunk_size);
					
				if(!chunk_queue.try_pop(*packet_chunk)){				
					if (stop_flag) {
						stop_pipeline = true;
					}
				}				
				return packet_chunk;
			}
		)&	// Bandwidth manager filter
		make_filter<packet_chunk_t*, packet_chunk_t*>(		
			filter::parallel,
			[&](packet_chunk_t* packet_chunk)-> packet_chunk_t*{								
				
				for(int i=0; i<packet_chunk->size(); i++){
					packet_trace_t packet;
					packet = (*packet_chunk)[i];				
					
					if (packet.nic == empty){
						break;
					}
					else{
						bwm.prioritize(packet);									
						compute_t compute;
						compute.work();						
					}										
				}
				std::sort(packet_chunk->begin(), packet_chunk->end(),
							packet_comparator);
				return packet_chunk;	
			}
		)&	// NAT filter
		make_filter<packet_chunk_t*, packet_chunk_t*>(	
			filter::parallel,
			[&](packet_chunk_t* packet_chunk)-> packet_chunk_t*{

				for(int i=0; i<packet_chunk->size(); i++){	
					packet_trace_t packet;

					packet = (*packet_chunk)[i];					
					if (packet.nic == empty)
						break;
					else{				
						nat.map(packet);
						compute_t compute;
						compute.work();	
					}
				}				
				return packet_chunk;
			}
		)&	// IP routing filter
		make_filter<packet_chunk_t*, packet_chunk_t*>(		
			filter::parallel,
			[&](packet_chunk_t* packet_chunk)-> packet_chunk_t*{			

				for(int i=0; i<packet_chunk->size(); i++){						
					packet_trace_t packet;
					packet = (*packet_chunk)[i];
					
					if (packet.nic == empty)
						break;
					else{				
						ip_router.route(packet);
						compute_t compute;
						compute.work();	
					}
				}				
				return packet_chunk;
			}
		)&	// Output filter
		make_filter<packet_chunk_t*, void>(	
			filter::parallel,
			[&](packet_chunk_t* packet_chunk){														
				
				for(int i=0; i<packet_chunk->size(); i++){						
					packet_trace_t packet;
					packet = (*packet_chunk)[i];	
					compute_t compute;
					compute.work();	

					if (packet.nic == empty)
						break;
				}	
				// No output is required , just drop packets
				delete packet_chunk; 
			}
		)
	);	
__itt_pause();

	cout << "nAll packets are processednn";		
	return 0;
}

First part is "preparation" - creating objects, reading command line, opening files and initializing. Configuration file contains router interfaces info. Objects bwm, nat and ip_router are packet processing objects. They use containers nat_table and ip_config for storing NAT and IP tables.

The core component of Network Router is pipeline. It is implemented using tbb::parallel_pipeline() function, that takes number of tokens and list of filters as arguments. The element of work that is passed through the pipeline is of type packet_chunk_t. Parameter ntokens controls maximum number of concurrently processed elements. It has value 24 because the project was tested on 24-core machine and making it bigger wouldn't make an effect.

Pipeline filters perform some work execution, particularly packet processing in this application. Filters can be serial or parallel. This mode is controlled by filter parameter that is filter::parallel for all filters. This means that any filter can process some elements at the same time.

First filter extracts packet chunk from chunk_queue and passes it to second filter. Second filter performs bandwidth management operations on each packet from chunk. bwm module assigns priorities to packets according to protocol. Then packets in chunk are sorted by priority. This allows critical traffic to be processed as early as possible.  Subsequent filters make NAT mapping and IP routing. Last filter is output, but for simplicity real output is not done. Packets are just dropped.

Packet chunk is used as pipeline token because it's big enough. If single packets were passed through pipeline there would be too much transitions between threads, and overhead would be bigger than positive effect.

The __itt_resume() and __itt_pause() functions are used by Intel® VTuneTM Amplifier XE that was used for performance measurements. These API functions mark the beginning and the end of area of interest.

Object compute of type compute_t makes workload for CPU. It just performs additional computations to simulate computing in real systems. The application doesn't perform the entire job needed for processing and routing packets in real life network equipment. It is just model framework of real application, so there is not enough CPU usage. Method compute_t:: work()starts computing "N Queens" algorithm.

Input file opening and reading is a job of separate thread. It is instantiated using std::thread class that is a part of new upcoming C++ 11 standard.

Serial implementation

To understand effect from parallelization a serial version was created. It has similar structure. The only difference is that parallel_pipeline is replaced with simple while loop.

Network router serial scheme

While loop (replacing parallel_pipeline):

__itt_resume();
	bool stop = false;

	while (!stop){
		packet_chunk_t packet_chunk(chunk_size);
		
		if(!chunk_queue.try_pop(packet_chunk)){				
			if (stop_flag) {
				stop = true;
			}
		}		
		
		for(int i=0; i < packet_chunk.size(); i++){
			packet_trace_t packet = packet_chunk[i];;			
			bwm.prioritize(packet);	
			compute_t compute;
			compute.work();									
		}
		std::sort(packet_chunk.begin(), packet_chunk.end(), packet_comparator);
		for(int i=0; i < packet_chunk.size(); i++){
			packet_trace_t packet = packet_chunk[i];				
			nat.map(packet);
			compute_t compute;
			compute.work();		
			ip_router.route(packet);				
			compute.work();							
			compute.work();								
		}
	}
__itt_pause();


There are four calls of compute.work() - the same number as in TBB version. This is going to be the most CPU time consuming function, so it's fair to have same number of calls to it.

Data structures

Input file has the following format:

eth3 104.44.44.10 10.230.30.03 4003 5003 ftp
eth3 104.44.44.10 10.230.30.03 4003 5003 rtp
eth0 134.77.77.30 104.44.44.10 2004 4003 sip
eth3 104.44.44.10 10.230.30.03 4003 5003 http

Each line represents one packet. It has network interface, source, destination IP and port, protocol. Packet is stored in packet_trace_t structure:

typedef struct {
	nic_t nic;			// network interface where packet arrived
	ip_addr_t destIp;		// destination IP
	ip_addr_t srcIp;		// source IP
	port_t destPort;		// destination port
	port_t srcPort;		// source port 
	protocol_t protocol;	// protocol type (rtp, ftp, http, sip, etc)
	int priority;			// packet priority
} packet_trace_t;


NAT table and IP configuration table are stores in tbb::concurrent_hash_map. Packet chunk is stored in std::vector and chunk queue is of type tbb::concurrent_queue:

typedef concurrent_hash_map<port_t, address*, string_comparator> nat_table_t; 
typedef concurrent_hash_map<ip_addr_t, nic_t, string_comparator> ip_config_t; 
typedef vector<packet_trace_t> packet_chunk_t;
concurrent_queue<packet_chunk_t> chunk_queue;


Input file reading is made by separate thread that executes input_function. The input_function opens file and reads it. Reading is performed by chunks that are passed to chunk queue. TBB containers are thread-safe, so main thread can read from the chunk queue at the same time without making additional synchronization manually. Input thread function:

void input_function(){	
    ifstream in_file (in_file_name);
    if (!in_file) {
        cerr << "Cannot open input file " << in_file_name << "n";
        exit (1);
    }
	stop_flag = false;	
	
	while(in_file.good()){			
		packet_chunk_t packet_chunk(chunk_size);
								
		for(int i=0; i<chunk_size; i++){
			packet_trace_t packet;
			in_file >> packet;					
			packet_chunk[i] = packet;			
		}
		chunk_queue.push(packet_chunk);			
	}
	stop_flag = true;
}

Performance measurements

The goals of this project were to achieve good performance and scalability by using TBB. For measurements the following setup was used:

CPU: 4 processors Intel® Xeon X7460, 2,66 Ghz, 24 physical cores total
RAM: 16 GB
OS: Microsoft Windows Server® Enterprise 2008 SP2
Workload: input file: 113405 packets (5,1 MB)
Measurement tool: Intel® VTuneTM Amplifier XE 2011
Analysis type: Concurrency with default settings

There were performed two tests: for serial and for parallel versions. Below are summaries from the two analyses. Left is for serial and right is for TBB versions:

It's seen that CPU time is similar. This is sum of CPU times of all cores of the system. But elapsed time is very different. This is clock time that the application takes for processing. In serial version it is near the value of overall CPU time. In TBB version it is 19 times less. So the application worked 19 times faster.

CPU usage for serial version:


CPU usage for TBB version:



Average number of utilized cores for TBB version is 20.5 and most of the processing time all 24 were used. This demonstrates that application is scalable enough and can use almost all cores on multi-core system.

Bottom-up view of serial application shows that almost all the time is spent for computing module simulating real workload:



In TBB version picture is very similar, main hotspot is the same compute_t::do_work method. However it's mostly indicated with green that means good CPU utilization. Also there are more functions in the list because of using TBB constructions:



The results provided show good performance results for TBB-based application. However keep in mind the following conditions:

1) There were used Amplifier XE API functions __itt_resume() and __itt_pause() that bound measured area. The result show performance of tbb::parallel_pipeline for TBB version and while loop for serial version. Measurements of overall application work will give a little bit different results.

2) Simulated job was used to utilize CPU. The compute_t class computes algorithm of "N queens" task. Real processing is different.  If there would be not enough job for CPU, file input would consume relatively more time. So in real application scalability and performance gain can be worse.

Conclusion

This sample project shows possibility of using TBB in composing Network packet processing applications and applicability of tbb::pipeline. These approaches can be applied in IP routing switches, telecommunication servers (VoIP telephony, video conferencing), various gateways and proxies, etc.  Like any hardly-loaded application network software can win from enabling multi-threading. And it is simple and effective to use Intel® Threading Building Blocks for managing parallelism in your application.

The full project source code:
Пожалуйста, обратитесь к странице Уведомление об оптимизации для более подробной информации относительно производительности и оптимизации в программных продуктах компании Intel.
Возможность комментирования русскоязычного контента была отключена. Узнать подробнее.