Overview of OpenSSL
What are SSL/TLS
What is OpenSSL
Goals of OpenSSL 1.0.2 Cryptographic Improvements
Key Components of OpenSSL 1.0.2
Applying Multi-Buffer to OpenSSL
System Configuration and Experimental Setup
Public Key Cryptography Results
The SSL/TLS protocols involve two compute-intensive cryptographic phases: session initiation and bulk data transfer. OpenSSL 1.0.2 introduces a comprehensive set of enhancements of cryptographic functions such as AES in different modes, SHA1, SHA256, SHA512 hash functions (for bulk data transfers), and Public Key cryptography such as RSA, DSA, and ECC (for session initiation). Optimizations target Intel® Core™ processors and Intel® Atom™ processors running in 32-bit and 64-bit modes.
OpenSSL  is one of the leading open source implementations of cryptographic functions and the go-to library for applications requiring the use of the SSL/TLS  protocols. The results for the cryptographic functions commonly used during the SSL/TLS session initiation/handshake and bulk data transfer phases are given.
TLS (Transport Layer Security)  and its predecessor, SSL (Secure Sockets Layer), are cryptographic protocols that are used to provide secure communications over networks.
These protocols allow applications to communicate over the network while preventing eavesdropping and tampering. That is, third parties cannot read the content being transferred and cannot modify that content without the receiver detecting it.
These protocols operate in two phases. In the first phase, a session is initiated. The server and client negotiate to select a cipher-suite for encryption and authentication and a shared secret key. In the second phase, the bulk data is transferred. The protocols use encryption of the data packets to ensure that third parties cannot read the contents of the data packets. They use a message authentication code (MAC), based on a cryptographic hash of the data, to ensure that the data is not modified in transit.
During session initiation, before a shared secret key has been generated, the client must communicate private messages to the server using a public key encryption method. The most popular such method is RSA, which is based on modular exponentiation. Modular exponentiation is a compute-intensive operation, which accounts for the majority of the session initiation cycles. A faster modular exponentiation implementation directly translates to a lower session initiation cost.
Under SSL, the bulk data being transferred is broken into records with a maximum size of 16KBytes (for SSLv3 and TLSv1).
Figure 1: SSL Computations of Cipher and MAC
A header is added, and a message authentication code (MAC) is computed over the header and data using a cryptographic hash function. The MAC is appended to the end of the message, and the message is padded. Then everything other than the header is encrypted with the chosen cipher.
The key point here is that all of the bulk data buffers have two algorithms applied to them: encryption and authentication. In many cases, these two algorithms can be stitched  together to increase the overall performance. Some cipher suites such as GCM define combined encryption+authentication modes; in these cases, stitching the computations is easier.
OpenSSL  is an open-source implementation of the SSL and TLS protocols, used by many applications and large companies.
For these companies, the most interesting aspect of OpenSSL’s implementation is the number of connections that a server can handle (per second), as this translates directly to the number of servers needed to service their client base. The way to maximize the number of connections is to minimize the cost of each connection, which can be done by minimizing the cost of initiating a session and by minimizing the cost of transferring the data for that session.
Some of the OpenSSL Project’s goals for the cryptographic optimizations were:
- Augment the OpenSSL software architecture to support multi-buffer processing techniques to extract maximum performance from the processor’s SIMD architecture.
- Deliver market-leading SSL/TLS performance using highly optimized stitched algorithms.
- Extend SIMD utilization in the crypto space (e.g., Intel® Streaming SIMD Extensions (Intel® SSE)-based SHA2 implementations).
- Utilize Intel® Advanced Vector Extensions 2 (Intel® AVX2) for a wide range of crypto algorithms like RSA and SHA.
- Wherever possible, extract maximum algorithmic performance using the new instructions MULX, ADCX, ADOX, RORX, and RDSEED.
- SSL/TLS payload processing performance tradeoffs should favor payloads that are less than ~1400 bytes.
- Integrate all the functionality into the OpenSSL 1.0.x and future 1.1.x codelines in a manner that allows their automatic use by applications using existing OpenSSL interfaces, without any additional required initializations.
Some of the key cryptographic optimizations in OpenSSL 1.0.2 include:
- Multi-buffer  support for AES [128|256] CBC encryption
- Multi-buffer support for [SHA-1|SHA-256] utilizing architectural features [Intel SSE | Intel AVX | Intel AVX2-BMI2]
- Single-buffer support for “Stitched” AES [128|256] CBC [SHA-1|SHA-256] utilizing architectural features [Intel SSE | Intel AVX | Intel AVX2]
- Single-buffer support for “stitched” AES [128|256] GCM
- Single-buffer SHA-1 performance enhancements utilizing Intel AVX2 and BMI2
- Single-buffer SHA-2 suite SHA[224|256|384|512] performance enhancements utilizing [Intel SSE | Intel AVX | Intel AVX2-BMI2] 
- RSA and DSA (Key size >= 1024) support using [legacy | MULX | ADCX – ADOX] instructions 
- ECC – ECDH and ECDSA [MULX | ADCX – ADOX]
- Intel® Secure Hash Algorithm Extensions (Intel® SHA Extensions) new instructions 
The RSA/DSA/ECC are targeted at the session initiation phase. The rest are for improved performance of the bulk data transfer phase. Multi-buffer implementations provide the largest speedup but are currently designed to work only for encryption flows.
Pairs of algorithms to implement via function stitching were chosen based on the most commonly used cipher-suites today and in the near future. For scenarios that cannot be covered with function stitching, the singular encryption or authentication functions were optimized.
Function stitching is a technique used to optimize two algorithms that typically run in combination yet sequentially, and finely stitch the operations together to maximize compute resources. This section presents just a brief overview of stitching. A more detailed description can be found in .
Function stitching is the fine-grained interleaving of the instructions from each algorithm so that both algorithms are executed simultaneously. The advantage of doing this is that execution units that would otherwise be idle when executing a single algorithm (due to either data dependencies or instruction latencies) can be used to execute instructions from the other algorithm, and vice versa .
Multi-buffer  is an efficient method to process multiple independent data buffers in parallel for cryptographic algorithms, such as hashing and encryption. Processing multiple buffers at the same time can result in significant performance improvements—both for the case where the code can take advantage of SIMD (Intel SSE/Intel AVX) instructions (e.g., Intel SHA Extensions), and even in some cases where it can’t (e.g., AES CBC encrypt using Intel® Advanced Encryption Standard New Instructions (Intel® AES-NI)).
Figure 2: Multi-Buffer Processing
Multi-buffer generally requires a scheduler that can process the multiple data buffers of different sizes with minimal performance overhead, which we have found good solutions for. Integrating multi-buffer into serially designed synchronous applications/frameworks, however, can be challenging and was one of the key problems when we were applying mutli-buffer to OpenSSL. We solved it by breaking up records during encryption into smaller, equal-sized sub-records. This solution, however, does not apply to decryption flows.
The performance results provided in this section were measured on 3 Intel Core processors and 2 Intel Atom processors. The systems were:
- Intel® Core™ i7-3770 processor @ 3.4 GHz (codenamed Ivy Bridge (IVB))
- Intel® Core™ i5-4250U processor @ 1.30 GHz (codenamed Haswell (HSW))
- Intel® Core™ i5-5200U processor @ 2.20 GHz (codenamed Broadwell (BDW))
- Intel® Atom™ processor N450 @ 1.66GHz (codenamed Bonnell (BNL))
- Intel® Atom™ processor N2810 @ 2.00GHz (codenamed Silvermont (SLM))
The tests were run on a single core with Intel® Turbo Boost Technology off, and with Intel® Hyper-Threading Technology (Intel® HT Technology) on for the three Intel Core processors. Note that the Intel Core i5-5200U processor was defaulting to "power saving" mode at boot and was running at 800 MHz for these tests. However, all test results are given in terms of cycles in order to provide an accurate representation of the microarchitecture’s capabilities and to eliminate any frequency discrepancies.
OpenSSL ‘Speed’ Benchmark was run for the performance tests. Some example command lines are:
./bin/64/openssl speed -evp aes-128-gcm
./bin/64/openssl speed -decrypt -evp aes-128-gcm
./bin/64/openssl speed -evp aes-128-cbc-hmac-sha1
./bin/64/openssl speed -decrypt -evp aes-128-cbc-hmac-sha1
./bin/64/openssl speed -mb -evp aes-128-cbc-hmac-sha1
Note that the “-mb” switch is new and has been added to enable running multi-buffer performance tests.
Results are normalized and in most cases converted to ‘Cycles per Byte’ (CPB) of processed data. CPB is the standard metric for Cryptographic algorithm efficiency.
The following graphs show the performance for 32 and 64-bit code.
Note: Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, Go to: http://www.intel.com/performance/resources/benchmark_limitations.htm
Figure 3: AES Encrypt (Intel® Core™ processors)
AES-CBC encryption gains on IBV to HSW are due to a 1 cycle latency reduction in the AESENC[LAST] and AESDEC[LAST] instructions.
AES-GCM performance gains on IBV to HSW are due to Intel AVX and PCLMULQDQ microarchitecture enhancements and from HSW to BDW due to further PCLMULQDQ microarchitecture enhancements.
Figure 4: AES Decrypt (Intel® Core™ processors)
Most of the popular AES decrypt modes are throughput limited, rather than latency limited. We implemented parallel AES-CBC decrypt processing 6 blocks at a time.
Figure 5: AES Encrypt (Intel® Atom™ processors)
SLM introduces the AES and PCLMULQDQ instructions, resulting in a huge speedup for both CBC and GCM modes.
Figure 6: AES Decrypt (Intel® Atom™ processors)
Figure 7: Public Key Cryptography (Intel® Core™ processors)
IVB gains on RSA are due to algorithmic optimizations.
HSW RSA2048 is a special case where some of the gain is due to an Intel AVX2 implementation. All the rest of the gains are due to scalar code tuning/algorithmic improvements.
On BDW the addition of MULX/ADOX/ADCX (LIA instructions) results in large performance gains over HSW.
We added generic code in the Montgomery multiply function so it scales across all RSA sizes, DSA, DH, and ECDH.
Figure 8: Public Key Cryptography (Intel® Atom™ processors)
On SLM, architectural scalar improvements are due to out-of-order execution.
Figure 9: AES-CBC-HMAC-SHA (Encrypt) Cycles/Byte
Figure 10: AES-CBC-HMAC-SHA (Decrypt) Cycles/Byte
IVB to HSW performance gains are due to Intel AXV2 code.
AES instruction latency improvements do not yield much performance gains in the case of Decrypt, as the results become SHA bound.
The Stitched Ciphers are only available in 64b implementations due to the expanded register set. In v1.0.1 Stitched Ciphers only supported Encrypt.
Figure 11: Multi-Buffer AES-CBC-HMAC-SHA (Encrypt) Cycles/Byte
Figure 12: Multi-Buffer Speedup over Stitched
Vinodh Gopal, Sean Gulley, and Wajdi Feghali are architects in the Data Center Group, specializing in software and hardware features relating to cryptography and compression.
Ilya Albrekht and Dan Zimmerman are Application Engineers driving enabling and performance optimization efforts for cryptographic projects and libraries.
We thank Andy Polyakov and Steve Marquess of the OpenSSL Software Foundation and Max Locktyukhin, John Mechalas, and Shay Gueron from Intel for their contributions.
This paper illustrates the goals and main features in OpenSSL 1.0.2 for improved cryptographic performance. By leveraging architectural features in the processors such as SIMD and new instructions, and combining innovative software techniques such as function stitching and Multi-Buffer, large performance gains are possible (e.g., ~3X for Multi-Buffer).
 OpenSSL: http://www.openssl.org/
 The TLS Protocol http://www.ietf.org/rfc/rfc2246.txt
 Fast Cryptographic computation on Intel® Architecture processors via Function Stitching http://www.intel.com/content/www/us/en/intelligent-systems/wireless-infrastructure/cryptographic-computation-architecture-function-stitching-paper.html
 Processing Multiple Buffers in Parallel to Increase Performance on Intel® Architecture Processors - http://www.intel.com/content/www/us/en/communications/communications-ia-multi-buffer-paper.html
 Fast SHA-256 Implementations on Intel® Architecture Processors -https://www-ssl.intel.com/content/www/us/en/intelligent-systems/intel-technology/sha-256-implementations-paper.html
 New Instructions Supporting Large Integer Arithmetic on Intel® Architecture Processors
 Intel® SHA Extensions New Instructions Supporting the Secure Hash Algorithm on Intel® Architecture Processors