Examining the Impact of the AVX2 instructions on HAProxy* Performance
One of the key components of a large datacenter or cloud deployment is the load balancer. Providing a service with high-availability requires multiple, redundant servers, transparent failover, the ability to distribute load evenly across them and of course the appearance of being a single server to the outside world, especially when negotiating SSL sessions. This is the role that the SSL-terminating load balancer is designed to fill, and it is a demanding one: every incoming session must be accepted, SSL-terminated, and transparently handed off to a back-end server system as quickly as possible since the load balancer is a concentration point and potential bottleneck for incoming traffic. This case study examines the impact of the Intel® Xeon® E5 v3 processor family and the Advanced Vector Extensions 2 instructions on the SSL handshake, and how the AVX2-optimized algorithms inside open OpenSSL* can significantly increase the load capacity of the OpenSource load balancer, HAproxy*.
The goal of this case study is to examine the impact of code optimized for the Intel Xeon v3 line of processors on the performance of the haproxy load balancer.
One of the new features introduced with this processor family is the AVX2 instructions, which expand the AVX integer commands to 256 bits. This is relevant to haproxy in SSL mode because public key cryptography algorithms make heavy use of large integer arithmetic, and the larger registers allow for more efficient execution since they can store larger values. Accelerating this arithmetic on the server directly impacts the performance of the SSL handshake: the faster it can be performed, the more handshakes the server can handle, and the more connections per second that can be SSL-terminated and handed off to back-end servers.
The performance limits of HAProxy were tested for various TLS cipher suites by generating a large number of parallel connection requests, and repeating those connections as fast as possible for a total of two minutes. At the end of those two minutes, the maximum latency across all requests was examined, as was the resulting connection rate that sent to the HAProxy server. The number of simultaneous connections was adjusted between runs to find the maximum connection rate that HAProxy could sustain for the duration without session latencies exceeding 2 seconds. This latency limit was taken from the research paper “A Study on tolerable waiting time: how long are Web users willing to wait?”, which concluded that two seconds is the maximum acceptable delay in loading a small web page.
HAProxy v1.5.8 was installed on a pre-production, two-socket Intel Xeon server system populated with two pre-production E5-2697 v3 processors clocked at 2.60 GHz with Turbo on, running Ubuntu* Server 13.10. Each E5 processor had 14 cores for a total of 28 hardware threads, and with Hyper-Threading enabled the system was capable of 56 threads in total. Total system RAM was 64 GB.
HAProxy is a popular, feature-rich, and high-performance open source load balancer and reverse proxy for TCP applications, with specific features designed for handling HTTP sessions. Beginning with version 1.5, HAProxy includes native SSL support on both sides of the proxy. More information on HAProxy can be found at http://www.haproxy.org/.
The SSL capabilities for HAProxy were provided by the OpenSSL library. OpenSSL is an Open Source library that implements the SSL and TLS protocols in addition to general purpose cryptographic functions. The 1.0.2 branch, in beta as of this writing, is enabled for the Intel Xeon v3 processor and supports the MULX instruction in many of its public key cryptographic algorithms. More information on OpenSSL can be found at http://www.openssl.org/.
The server load was generated by up to six client systems as needed, a mixture of Xeon E5 and Xeon E5 v2 class hardware. Each system was connected to the HAProxy server with one or more 10 Gbit direct connect links. The server had two 4x10 Gbit network cards, and two 2x10 Gbit network cards. Two of the clients had 4x10 Gbit cards, and the remaining four had a single 10 Gbit NIC.
The network diagram for the test environment is shown in Figure 1.
Figure 1. Test network diagram.
The actual server load was generated using multiple instances of the Apache* Benchmark tool, ab, an open source utility that is included in the Apache server distribution. A single instance of Apache Benchmark was not able to create a load sufficient to reach the server’s limits, so it had to be split across multiple processors and, due to client CPU demands, across multiple hosts.
Because each Apache Benchmark instance is completely self-contained, however, there is no built-in mechanism for distributed execution. A synchronization server and client wrapper were written to coordinate the launching of multiple instances of ab across the load clients, their CPU’s, and their network interfaces, and then collate the results.
The goal of the test was to determine the maximum load in connections per second that HAProxy could sustain over 2-minutes of repeated, incoming connection requests, and to compare the Xeon v3 optimized code against previous generation code that does not contain the enhancements. For this purpose, two versions of HAProxy were built: one against the optimized 1.0.2-beta3 release, and one against the unoptimized 1.0.1g release.
To eliminate as many outside variables as possible, all incoming requests to HAProxy were for its internal status page, as configured by the monitor-uri parameter in its configuration file. This meant HAProxy did not have to depend on any external servers, networks or processes to handle the client requests. This also resulted in very small page fetches so that the TLS handshake dominated the session time.
To further stress the server, the keep-alive function was left off in Apache Benchmark, forcing all requests to establish a new connection to the server and negotiate their own sessions.
The key exchange algorithms that were tested are given in Table 1.
Table 1. Selected key exchange algorithms
|Key Exchange||Certificate Type|
|ECDHE-ECDSA||ECC, NIST P-256|
Since the bulk encryption and cryptographic signing were not a significant part of the session, these were fixed at AES with a 128-bit key and SHA-1, respectively. Varying AES key size, AES encryption mode, or SHA hashing scheme would not have an impact on the results.
Tests for each cipher were run for the following hardware configurations:
Reducing the system to one active core per socket, the minimum configuration in the test system, effectively simulates a low-core-count system and ensures that HAProxy performance is limited by the CPU rather than other system resources. These measurements can be used to estimate the overall performance per core, as well as estimate the performance of a system with many cores.
The all-core runs test the full capabilities of the system, show how well the performance scales to a many-core system and also introduces the possibility of system resource limits beyond just CPU utilization.
HAProxy was configured to operate in multi-process mode, with one worker for each physical thread on the system. Because ECDHE can be used with either RSA or ECDSA, it was configured to use one or the other as needed.
An excerpt from the configuration file, haproxy.conf, is shown in Figure 2.
global daemon pidfile /var/run/haproxy.pid user haproxy group haproxy crt-base /etc/haproxy/crt # Adjust to match the physical number of threads # including threads available via Hyper-Threading nbproc 56 tune.ssl.default-dh-param 2048 defaults mode http timeout connect 10000ms timeout client 30000ms timeout server 30000ms frontend http-in # Uncomment one or the other to choose your certificate type #bind :443 ssl crt rsa/combined-rsa.crt bind :443 ssl crt ecc/combined-ecc.crt monitor-uri /test default_backend servers
Figure 2. Excerpt from HAProxy configuration
To support the large number of simultaneous connections, some system and kernel tuning was necessary. First, the number of file descriptors was increased via /etc/security/limits.conf:
* soft nofile 150000 * hard nofile 180000
Figure 3. Excerpt from /etc/security/limits.conf
And several kernel parameters were adjusted (some of these settings are more relevant to bulk encryption):
net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_fin_timeout = 30 # Increase system IP port limits to allow for more connections net.ipv4.ip_local_port_range = 2000 65535 net.ipv4.tcp_window_scaling = 1 # number of packets to keep in backlog before the kernel starts # dropping them net.ipv4.tcp_max_syn_backlog = 3240000 # increase socket listen backlog net.ipv4.tcp_max_tw_buckets = 1440000 # Increase TCP buffer sizes net.core.rmem_default = 8388608 net.core.wmem_default = 8388608 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_mem = 16777216 16777216 16777216 net.ipv4.tcp_rmem = 16777216 16777216 16777216 net.ipv4.tcp_wmem = 16777216 16777216 16777216
Figure 4. Excerpt from /etc/sysctl.conf
Some of these parameters are very aggressive, but the assumption is that this system is a dedicated load-balancer and SSL/TLS terminator.
No other adjustments were made to the stock Ubuntu 13.10 server image.
Results for the 2-core and 28-core runs follow. Because all tests were run on the same hardware all performance improvements are due solely to the algorithmic optimizations for the Xeon v3 processor.
The raw two-core results are shown in Figure 5 and the performance deltas are in Figure 6.
Figure 5. HAProxy performance on a 2-core system with Hyper-Threading off
Figure 6. Performance gains due to Xeon v3 optimizations
These results show significant improvements across all ciphers tested, ranging from 26% to nearly 255%. They also make a compelling argument for using ECC ciphers on Xeon E5 v3 class systems: with an enabled version of OpenSSL, just two cores are able to handle a staggering 6,500 connections per second using an ECDHE-ECDSA key exchange, which is more than 2.6x the performance of a straight RSA. ECDHE-RSA performs roughly on part with straight RSA. Both of these ciphers offer the cryptographic property of perfect forward secrecy.
Figure 7. Performance gains from enabling Hyper-Threading
Gains from enabling Hyper-Threading were very modest. The optimized algorithms are structured to use the execution resources as much as possible which does not leave much room for the additional threads.
The 28-core tests look at the scalability of these performance gains to a many-core deployment. In an ideal world the values would scale linearly with the core count such that a 28-core system would have 14x the performance of a 2-core system.
The raw results are shown in Figure 8 and Figure 9. The ECDHE-ECDSA handshake once again leads the pack, with haproxy achieving over 38,000 connections per second. Of note, though, is that this maximum occurred with only a 67% average CPU utilization on the server, implying that the performance tests ran up against software or system resource limits that were not CPU-bound. For all other ciphers, the performance tests were able to achieve an average CPU utilization above 98%.
This is clearly shown in Table 2, where the performance scaling was between 8.0 and 8.8 for all handshake protocols except ECDHE-ECDSA, which was only 5.8.
Table 2. Performance scaling of the 28-core system over the 2-core system
Figure 8. HAProxy performance on a 28-core system with Hyper-Threading off
Figure 9. Performance gains due to Xeon v3 optimizations
Effects of HT were more dramatic in the 28-core case. While the % gains were a little higher for the RSA and ECDHE-RSA key exchanges, DHE-RSA and ECDSA actually saw a performance penalty, with the ECDSA loss being rather significant. This is likely related to the same software or resource limit that impacts the non-Hyper-Threading case. In this run, the ECDSA cipher was only able to achieve an average CPU utilization of 41%. For all other ciphers, average CPU utilization was above 95%.
Figure 10. Performance impact from enabling Hyper-Threading
The optimizations for the Xeon E5 v3 processor result in significant performance gains for the haproxy load balancer using the selected ciphers. Each key exchange algorithm realized some benefit, ranging from 26% to nearly 255%. While these percentages are impressive, however, it’s probably the absolute performance figures relative to a straight RSA key exchange that are of greatest interest.
While straight RSA benefits from the Xeon v3 optimizations, the Elliptic curve Diffie-Hellman algorithms see much larger gains. ECDHE with an RSA certificate performs roughly on par with RSA, but move up to ECDHE with ECDSA signing and haproxy can handle over 2.6 times as many connections per second.
Since ECDHE ciphers provide perfect forward secrecy, there is simply no reason for a Xeon E5 v3 server to use a straight RSA key exchange. The ECDHE ciphers not only offer this added security, but the ECDHE-ECDSA cipher adds significantly higher performance. This does come at the cost of added load on the client, but the key exchange in TLS only takes place at session setup time so this is not a significant burden for the client to bear.
On a massively parallel installation with Hyper-Threading enabled, haproxy can maintain connection rates exceeding 53,000 connections/second using the ECDHE-ECDSA cipher, and do so without fully utilizing the CPU. This is on an out-of-the-box Linux distribution with only minimal system and kernel tuning and it is conceivable that even higher connection rates could be achieved if the system could be optimized to remove the non-CPU bottlenecks. This task was beyond the scope of this study.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804