AES-GCM Encryption Performance on Intel® Xeon® E5 v3 Processors

This case study examines the architectural improvements made to the Intel® Xeon® E5 v3 processor family in order to improve the performance of the Galois/Counter Mode of AES block encryption. It looks at the impact of these improvements on the nginx* web server when backed by the OpenSSL* SSL/TLS library. With this new generation of Xeon processors, web servers can obtain significant increases in maximum throughput by switching from AES in CBC mode with HMAC+SHA1 digests to AES-GCM.

Background

The goal of this case study is to examine the impact of the microarchitecture improvements made in the Intel Xeon v3 line of processors on the performance of an SSL web server. Two significant enhancements relating to encryption performance were latency reductions in the Intel® AES New Instructions (Intel® AES-NI) instructions and a latency reduction in the PCLMULQDQ instruction. These changes were designed specifically to increase the performance of the Galois/Counter Mode of AES, commonly referred to as AES-GCM.

One of the key features of AES-GCM is that the Galois field multiplication that is used for message authentication can be computed in parallel with the block encryption. This permits a much higher level of parallelization than is possible with chaining modes of AES, such as the popular Cipher Block Chaining (CBC) mode. The performance gain of AES-GCM over AES-CBC with HMAC+SHA1 digests was significant even on older generation CPU’s such as the Xeon v2 family, but the architectural improvements to the Xeon v3 family further widen the performance gap.

Figure 1 shows the throughput gains realized from OpenSSL’s speed tests by choosing the aes-128-gcm EVP over aes-128-cbc-hmac-sha1 on both Xeon E5 v2 and Xeon E5 v3 systems. The hardware and software configuration behind data test is given in Table 1. What this shows is that AES-GCM outperforms AES-CBC with HMAC+SHA1 on Xeon E5 v2 by as much as 2.5x, but on Xeon E5 v3 that jumps to nearly 4.5x. The performance gap between GCM and CBC nearly doubles from Xeon E5 v2 to v3.


Table 1. Hardware and software configurations for OpenSSL speed tests.

In order to assess how this OpenSSL raw performance translates to SSL web server throughput, this case study looks at the maximum throughput achievable by the nginx web server when using these two encryption ciphers.


Figure 1. Relative OpenSSL 1.0.2a speed results for the aes-128-gcm and aes-128-cbc-hamc-sha1 EVP's on Xeon E5 v2 and v3 processors

The Test Environment

The performance limits of nginx were tested for the two ciphers by generating a large number of parallel connection requests, and repeating those connections as fast as possible for a total of two minutes. At the end of those two minutes, the maximum latency across all requests was examined along with the resulting throughput. The number of simultaneous connections was adjusted between runs to find the maximum throughput that nginx could achieve for the duration without connection latencies exceeding 2 seconds. This latency limit was taken from the research paper “A Study on tolerable waiting time: how long are Web users willing to wait?”, which concluded that two seconds is the maximum acceptable delay in loading a small web page.

Nginx  was installed on a pre-production, two-socket Intel Xeon server system populated with two production E5-2697 v3 processors clocked at 2.60 GHz with Turbo on and Hyper-Threading off. The system was running Ubuntu* Server 13.10. Each E5 processor had 14 cores for a total of 28 hardware threads. Total system RAM was 64 GB.

The SSL capabilities for nginx were provided by the OpenSSL library. OpenSSL is an Open Source library that implements the SSL and TLS protocols in addition to general purpose cryptographic functions and the 1.0.2 branch is optimized for the Intel Xeon v3 processor. More information on OpenSSL can be found at http://www.openssl.org/. The tests in this case study were made using 1.0.2-beta3 as the production release was not yet available at the time these tests were run.

The server load was generated by up to six client systems as needed; a mixture of Xeon E5 and Xeon E5 v2 class hardware. Each system was connected to the nginx server with multiple 10 Gbit direct connect links. The server had two 4x10 Gbit network cards, and two 2x10 Gbit network cards. Two of the clients had 4x10 Gbit cards, and the remaining four had a single 10 Gbit NIC.

The network diagram for the test environment is shown in Figure 1.


Figure 2. Test network diagram.

The actual server load was generated using multiple instances of the Apache* Benchmark tool, ab, an Open Source utility that is included in the Apache server distribution. A single instance of Apache Benchmark was not able to create a load sufficient to reach the server’s limits, so it had to be split across multiple processors and, due to client CPU demands, across multiple hosts.

Because each Apache Benchmark instance is completely self-contained, however, there is no built-in mechanism for distributed execution. A synchronization server and client wrapper were written to coordinate the launching of multiple instances of ab across the load clients, their CPU’s, and their network interfaces, and then collate the results.

The Test Plan

The goal of the tests were to determine the maximum throughput that nginx could sustain over 2-minutes of repeated, incoming connection requests for a target file, and to compare the results for the AES128-SHA cipher to those of the AES128-GCM-SHA256 cipher on the Xeon E5-2697 v3 platform. Note that in the GCM cipher suites, the _SHA suffix refers to the SHA hashing function used as the Pseudo Random Function algorithm in the cipher, in this case SHA-256.


Table 2. Selected TLS Ciphers

Each test was repeated for a fixed target file size, starting at 1 MB and increasing by powers of four up to 4 GB, where 1 GB = 1024 MB, 1 MB = 1024 KB, and 1 KB = 1024 bytes. The use of files 1MB and larger minimized the impact of the key exchange on the session throughput. Keep-alives were disabled so that each connection resulted in fetching a single file.

Tests for each cipher were run for the following hardware configurations:

  • 2 cores enabled (1 core per socket)
  • 4 cores enabled (2 cores per socket)
  • 8 cores enabled (4 cores per socket)
  • 16 cores enabled (8 cores pre socket)

Hyper-threading was disabled in all configurations. Reducing the system to one active core per socket, the minimum configuration in the test system, effectively simulates a low-core-count system and ensures that nginx performance is limited by the CPU rather than other system resources. These measurements can be used to estimate the overall performance per core, as well as estimate the projected performance of a system with many cores.

The many-core runs test the scalability of the system, and also introduces the possibility of system resource limits beyond just CPU utilization.

System Configuration and Tuning

Nginx was configured to operate in multi-process mode, with one worker for each physical thread on the system.

An excerpt from the configuration file, nginx.conf, is shown in Figure 3.

worker_processes 16; # Adjust this to match the core count

events {
        worker_connections 8192;
        multi_accept on;
}

Figure 3. Excerpt from nginx configuration

To support the large number of simultaneous connections that might occur at the smaller target file sizes, some system and kernel tuning was necessary. First, the number of file descriptors was increased via /etc/security/limits.conf:

*               soft    nofile          150000
*               hard    nofile          180000

Figure 4. Excerpt from /etc/security/limits.conf

And several kernel parameters were adjusted (some of these settings are more relevant to bulk encryption):

net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 30

# Increase system IP port limits to allow for more connections
 
net.ipv4.ip_local_port_range = 2000 65535
net.ipv4.tcp_window_scaling = 1
 
# number of packets to keep in backlog before the kernel starts
# dropping them 
net.ipv4.tcp_max_syn_backlog = 3240000
 
# increase socket listen backlog
net.ipv4.tcp_max_tw_buckets = 1440000

# Increase TCP buffer sizes
net.core.rmem_default = 8388608
net.core.wmem_default = 8388608
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_mem = 16777216 16777216 16777216
net.ipv4.tcp_rmem = 16777216 16777216 16777216
net.ipv4.tcp_wmem = 16777216 16777216 16777216

Figure 5. Excerpt from /etc/sysctl.conf

Some of these parameters are very aggressive, but the assumption is that this system is a dedicated SSL/TLS web server.

No other adjustments were made to the stock Ubuntu 13.10 server image.

Results

The maximum throughput in Gbps achieved for each cipher by file size is shown in Figure 6. At the smallest file size, 1 MB, the differences between the GCM and CBC ciphers are modest because the SSL handshake dominates the transaction but for the larger file sizes the GCM cipher outperforms the CBC cipher from 2 to 2.4x. Raw GCM performance is roughly 8 Gbps/core. This holds true up until 8 cores, when the maximum throughput is no longer CPU-limited. This is the point where other system limitations prevent the web server from achieving higher transfer rates, more dramatically revealed in the 16-core case. Here, the both ciphers see only a modest increase in throughput, though the CBC cipher realizes a larger benefit.

This is more clearly illustrated in Figure 7 which plots the maximum CPU utilization of nginx during the 2-minute run for each case. In the 2- and 4- cases, %CPU for both ciphers is in the high 90’s and in the 8-core case it ranges from 80% for large files to 98% for smaller ones.


Figure 6. Maximum nginx throughput by file size for given core counts

It is the 16-core case where system resource limits begin to show significantly, along with large differences in the performance of the ciphers themselves. Here, the total throughput has only increased incrementally from the 8-core case, and it’s apparent that this is because the additional cores simply cannot be put to use. The GCM cipher is using only 50 to 70% of the available CPU. It’s also clear that the GCM cipher is doing more—specifically, providing a great deal more throughput than the CBC cipher—with significantly less compute power.


Figure 7. Max %CPU utilization at maximum nginx throughput

Conclusions

The architectural changes to the Xeon v3 family of processors have a significant impact on the performance of the AES-GCM cipher, and they provide a very compelling argument for choosing it over AES-CBC with HMAC+SHA1 digests for SSL/TLS web servers.

In the raw OpenSSL speed tests, the performance gap between GCM and CBC nearly doubles from the Xeon E5 v2 family. In the web server tests, the use of the AES-GCM cipher led to roughly 2 to 2.4x the throughput of the AES-CBC cipher, and absolute data rates of about 8 Gbps/core. The many-core configurations are able to achieve total data transfer rates in excess of 50 Gbps before hitting system limits. This level of throughput was achieved on an off-the-shelf Linux installation with very minimal system tuning.

It may be necessary to continue to support AES-CBC with HMAC+SHA1 digests due to the large number of clients that cannot take advantage of AES-GCM, but AES-GCM should certainly be enabled on web servers running on Xeon v3 family processors in order to provide the best possible performance, not to mention the added security, that this cipher offers.

 

For more complete information about compiler optimizations, see our Optimization Notice.