Improving OpenSSL Performance

Contents

Abstract
Overview of OpenSSL
      What are SSL/TLS
      What is OpenSSL
      Goals of OpenSSL 1.0.2 Cryptographic Improvements
Key Components of OpenSSL 1.0.2
      Function Stitching
      Applying Multi-Buffer to OpenSSL
System Configuration and Experimental Setup
      Speed tests
Performance
      AES Results
      Public Key Cryptography Results
      Stitching Results
      Multi-Buffer Results
Authors
Contributors
Conclusion
References

 

Abstract 


The SSL/TLS protocols involve two compute-intensive cryptographic phases: session initiation and bulk data transfer. OpenSSL 1.0.2 introduces a comprehensive set of enhancements of cryptographic functions such as AES in different modes, SHA1, SHA256, SHA512 hash functions (for bulk data transfers), and Public Key cryptography such as RSA, DSA, and ECC (for session initiation). Optimizations target Intel® Core™ processors and Intel® Atom™ processors running in 32-bit and 64-bit modes.

OpenSSL [1] is one of the leading open source implementations of cryptographic functions and the go-to library for applications requiring the use of the SSL/TLS [2] protocols. The results for the cryptographic functions commonly used during the SSL/TLS session initiation/handshake and bulk data transfer phases are given.

 

Overview of OpenSSL 


What are SSL/TLS 

TLS (Transport Layer Security) [2] and its predecessor, SSL (Secure Sockets Layer), are cryptographic protocols that are used to provide secure communications over networks.

These protocols allow applications to communicate over the network while preventing eavesdropping and tampering. That is, third parties cannot read the content being transferred and cannot modify that content without the receiver detecting it.

These protocols operate in two phases. In the first phase, a session is initiated. The server and client negotiate to select a cipher-suite for encryption and authentication and a shared secret key. In the second phase, the bulk data is transferred. The protocols use encryption of the data packets to ensure that third parties cannot read the contents of the data packets. They use a message authentication code (MAC), based on a cryptographic hash of the data, to ensure that the data is not modified in transit.

During session initiation, before a shared secret key has been generated, the client must communicate private messages to the server using a public key encryption method. The most popular such method is RSA, which is based on modular exponentiation. Modular exponentiation is a compute-intensive operation, which accounts for the majority of the session initiation cycles. A faster modular exponentiation implementation directly translates to a lower session initiation cost.

Under SSL, the bulk data being transferred is broken into records with a maximum size of 16KBytes (for SSLv3 and TLSv1).

SSL Computations of Cipher and MAC

Figure 1: SSL Computations of Cipher and MAC

A header is added, and a message authentication code (MAC) is computed over the header and data using a cryptographic hash function. The MAC is appended to the end of the message, and the message is padded. Then everything other than the header is encrypted with the chosen cipher.

The key point here is that all of the bulk data buffers have two algorithms applied to them: encryption and authentication. In many cases, these two algorithms can be stitched [3] together to increase the overall performance. Some cipher suites such as GCM define combined encryption+authentication modes; in these cases, stitching the computations is easier.

What is OpenSSL 

OpenSSL [1] is an open-source implementation of the SSL and TLS protocols, used by many applications and large companies.

For these companies, the most interesting aspect of OpenSSL’s implementation is the number of connections that a server can handle (per second), as this translates directly to the number of servers needed to service their client base. The way to maximize the number of connections is to minimize the cost of each connection, which can be done by minimizing the cost of initiating a session and by minimizing the cost of transferring the data for that session.

Goals of OpenSSL 1.0.2 Cryptographic Improvements 

Some of the OpenSSL Project’s goals for the cryptographic optimizations were:

  1. Augment the OpenSSL software architecture to support multi-buffer processing techniques to extract maximum performance from the processor’s SIMD architecture.
  2. Deliver market-leading SSL/TLS performance using highly optimized stitched algorithms.
  3. Extend SIMD utilization in the crypto space (e.g., Intel® Streaming SIMD Extensions (Intel® SSE)-based SHA2 implementations).
  4. Utilize Intel® Advanced Vector Extensions 2 (Intel® AVX2) for a wide range of crypto algorithms like RSA and SHA.
  5. Wherever possible, extract maximum algorithmic performance using the new instructions MULX, ADCX, ADOX, RORX, and RDSEED.
  6. SSL/TLS payload processing performance tradeoffs should favor payloads that are less than ~1400 bytes.
  7. Integrate all the functionality into the OpenSSL 1.0.x and future 1.1.x codelines in a manner that allows their automatic use by applications using existing OpenSSL interfaces, without any additional required initializations.

 

Key Components of OpenSSL 1.0.2 


Some of the key cryptographic optimizations in OpenSSL 1.0.2 include:

  • Multi-buffer [4] support for AES [128|256] CBC encryption
  • Multi-buffer support for [SHA-1|SHA-256] utilizing architectural features [Intel SSE | Intel AVX | Intel AVX2-BMI2]
  • Single-buffer support for “Stitched” AES [128|256] CBC [SHA-1|SHA-256] utilizing architectural features [Intel SSE | Intel AVX | Intel AVX2]
    • AES-128-CBC-Encrypt-SHA-1-AVX2-BMI2
    • AES-256-CBC-Encrypt-SHA-1-AVX2-BMI2
    • AES-128-CBC-Encrypt-SHA-256-SSE
    • AES-256-CBC-Encrypt-SHA-256-SSE
    • AES-128-CBC-Encrypt-SHA-256-AVX
    • AES-256-CBC-Encrypt-SHA-256-AVX
    • AES-128-CBC-Encrypt-SHA-256-AVX2-BMI2
    • AES-256-CBC-Encrypt-SHA-256-AVX2-BMI2
  • Single-buffer support for “stitched” AES [128|256] GCM
  • Single-buffer SHA-1 performance enhancements utilizing Intel AVX2 and BMI2
  • Single-buffer SHA-2 suite SHA[224|256|384|512] performance enhancements utilizing [Intel SSE | Intel AVX | Intel AVX2-BMI2] [5]
  • RSA and DSA (Key size >= 1024) support using [legacy | MULX | ADCX – ADOX] instructions [6]
  • ECC – ECDH and ECDSA [MULX | ADCX – ADOX]
  • Intel® Secure Hash Algorithm Extensions (Intel® SHA Extensions) new instructions [7]

The RSA/DSA/ECC are targeted at the session initiation phase. The rest are for improved performance of the bulk data transfer phase. Multi-buffer implementations provide the largest speedup but are currently designed to work only for encryption flows.

Pairs of algorithms to implement via function stitching were chosen based on the most commonly used cipher-suites today and in the near future. For scenarios that cannot be covered with function stitching, the singular encryption or authentication functions were optimized.

Function Stitching 

Function stitching is a technique used to optimize two algorithms that typically run in combination yet sequentially, and finely stitch the operations together to maximize compute resources. This section presents just a brief overview of stitching. A more detailed description can be found in [3].

Function stitching is the fine-grained interleaving of the instructions from each algorithm so that both algorithms are executed simultaneously. The advantage of doing this is that execution units that would otherwise be idle when executing a single algorithm (due to either data dependencies or instruction latencies) can be used to execute instructions from the other algorithm, and vice versa [3].

Applying Multi-Buffer to OpenSSL 

Multi-buffer [4] is an efficient method to process multiple independent data buffers in parallel for cryptographic algorithms, such as hashing and encryption. Processing multiple buffers at the same time can result in significant performance improvements—both for the case where the code can take advantage of SIMD (Intel SSE/Intel AVX) instructions (e.g., Intel SHA Extensions), and even in some cases where it can’t (e.g., AES CBC encrypt using Intel® Advanced Encryption Standard New Instructions (Intel® AES-NI)).

Multi-Buffer Processing

Multi-Buffer Processing

Figure 2: Multi-Buffer Processing

Multi-buffer generally requires a scheduler that can process the multiple data buffers of different sizes with minimal performance overhead, which we have found good solutions for. Integrating multi-buffer into serially designed synchronous applications/frameworks, however, can be challenging and was one of the key problems when we were applying mutli-buffer to OpenSSL. We solved it by breaking up records during encryption into smaller, equal-sized sub-records. This solution, however, does not apply to decryption flows.

 

System Configuration and Test Setup 


The performance results provided in this section were measured on 3 Intel Core processors and 2 Intel Atom processors. The systems were:

  1. Intel® Core™ i7-3770 processor @ 3.4 GHz         (codenamed Ivy Bridge (IVB))
  2. Intel® Core™ i5-4250U processor @ 1.30 GHz      (codenamed Haswell (HSW))
  3. Intel® Core™ i5-5200U processor @ 2.20 GHz      (codenamed Broadwell (BDW))
  4. Intel® Atom™ processor N450 @ 1.66GHz           (codenamed Bonnell (BNL))
  5. Intel® Atom™ processor N2810 @ 2.00GHz         (codenamed Silvermont (SLM))

The tests were run on a single core with Intel® Turbo Boost Technology off, and with Intel® Hyper-Threading Technology (Intel® HT Technology) on for the three Intel Core processors. Note that the Intel Core i5-5200U processor was defaulting to "power saving" mode at boot and was running at 800 MHz for these tests. However, all test results are given in terms of cycles in order to provide an accurate representation of the microarchitecture’s capabilities and to eliminate any frequency discrepancies.

Speed tests 

OpenSSL ‘Speed’ Benchmark was run for the performance tests. Some example command lines are:

./bin/64/openssl speed -evp aes-128-gcm

./bin/64/openssl speed -decrypt -evp aes-128-gcm

./bin/64/openssl speed -evp aes-128-cbc-hmac-sha1

./bin/64/openssl speed -decrypt -evp aes-128-cbc-hmac-sha1

./bin/64/openssl speed -mb -evp aes-128-cbc-hmac-sha1

Note that the “-mb” switch is new and has been added to enable running multi-buffer performance tests.

 

Performance 


Results are normalized and in most cases converted to ‘Cycles per Byte’ (CPB) of processed data. CPB is the standard metric for Cryptographic algorithm efficiency.

The following graphs show the performance for 32 and 64-bit code.

Note: Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, Go to: http://www.intel.com/performance/resources/benchmark_limitations.htm

AES Results 

AES Encrypt (Intel® Core™ processors)

Figure 3: AES Encrypt (Intel® Core™ processors)

AES-CBC encryption gains on IBV to HSW are due to a 1 cycle latency reduction in the AESENC[LAST] and AESDEC[LAST] instructions.

AES-GCM performance gains on IBV to HSW are due to Intel AVX and PCLMULQDQ microarchitecture enhancements and from HSW to BDW due to further PCLMULQDQ microarchitecture enhancements.

AES Decrypt (Intel® Core™ processors)

Figure 4: AES Decrypt (Intel® Core™ processors)

Most of the popular AES decrypt modes are throughput limited, rather than latency limited. We implemented parallel AES-CBC decrypt processing 6 blocks at a time.

AES Encrypt (Intel® Atom™ processors)

Figure 5: AES Encrypt (Intel® Atom™ processors)

SLM introduces the AES and PCLMULQDQ instructions, resulting in a huge speedup for both CBC and GCM modes.

AES Decrypt (Intel® Atom™ processors)

Figure 6: AES Decrypt (Intel® Atom™ processors)

Public Key Cryptography Results 

Public Key Cryptography (Intel® Core™ processors)

Figure 7: Public Key Cryptography (Intel® Core™ processors)

IVB gains on RSA are due to algorithmic optimizations.

HSW RSA2048 is a special case where some of the gain is due to an Intel AVX2 implementation. All the rest of the gains are due to scalar code tuning/algorithmic improvements.

On BDW the addition of MULX/ADOX/ADCX (LIA instructions) results in large performance gains over HSW.

We added generic code in the Montgomery multiply function so it scales across all RSA sizes, DSA, DH, and ECDH.

Public Key Encryption 32-bit

Public Key Encryption 64-bit

Figure 8: Public Key Cryptography (Intel® Atom™ processors)

On SLM, architectural scalar improvements are due to out-of-order execution.

Stitching Results 

AES-CBC-HMAC-SHA (Encrypt) Cycles/Byte

Figure 9: AES-CBC-HMAC-SHA (Encrypt) Cycles/Byte

AES-CBC-HMAC-SHA (Decrypt) Cycles/Byte

Figure 10: AES-CBC-HMAC-SHA (Decrypt) Cycles/Byte

IVB to HSW performance gains are due to Intel AXV2 code.

AES instruction latency improvements do not yield much performance gains in the case of Decrypt, as the results become SHA bound.

The Stitched Ciphers are only available in 64b implementations due to the expanded register set. In v1.0.1 Stitched Ciphers only supported Encrypt.

Multi-Buffer Results 

Multi-Buffer AES-CBC-HMAC-SHA (Encrypt) Cycles/Byte

Figure 11: Multi-Buffer AES-CBC-HMAC-SHA (Encrypt) Cycles/Byte

Multi-Buffer Speedup over Stitched

Figure 12: Multi-Buffer Speedup over Stitched

 

Authors 


Vinodh Gopal, Sean Gulley, and Wajdi Feghali are architects in the Data Center Group, specializing in software and hardware features relating to cryptography and compression.

Ilya Albrekht and Dan Zimmerman are Application Engineers driving enabling and performance optimization efforts for cryptographic projects and libraries.

Contributors 


We thank Andy Polyakov and Steve Marquess of the OpenSSL Software Foundation and Max Locktyukhin, John Mechalas, and Shay Gueron from Intel for their contributions.

Conclusion 


This paper illustrates the goals and main features in OpenSSL 1.0.2 for improved cryptographic performance. By leveraging architectural features in the processors such as SIMD and new instructions, and combining innovative software techniques such as function stitching and Multi-Buffer, large performance gains are possible (e.g., ~3X for Multi-Buffer).

References 


[1] OpenSSL: http://www.openssl.org/

[2] The TLS Protocol http://www.ietf.org/rfc/rfc2246.txt

[3] Fast Cryptographic computation on Intel® Architecture processors via Function Stitching http://www.intel.com/content/www/us/en/intelligent-systems/wireless-infrastructure/cryptographic-computation-architecture-function-stitching-paper.html

[4] Processing Multiple Buffers in Parallel to Increase Performance on Intel® Architecture Processors - http://www.intel.com/content/www/us/en/communications/communications-ia-multi-buffer-paper.html

[5] Fast SHA-256 Implementations on Intel® Architecture Processors -https://www-ssl.intel.com/content/www/us/en/intelligent-systems/intel-technology/sha-256-implementations-paper.html

[6] New Instructions Supporting Large Integer Arithmetic on Intel® Architecture Processors

http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/ia-large-integer-arithmetic-paper.html

[7] Intel® SHA Extensions New Instructions Supporting the Secure Hash Algorithm on Intel® Architecture Processors

https://software.intel.com/en-us/articles/intel-sha-extensions 

For more complete information about compiler optimizations, see our Optimization Notice.