Intel® AES-NI Performance Enhancements: HyTrust DataControl Case Study

 

Intel® AES-NI Performance Enhancements on 4th Gen Intel® Processors

 

Version 1.8
September 4th, 2014
Steve Pate, Kelvin Pryse
spate@hytrust.com
kpryse@hytrust.com

Contents

Executive Summary

Intel proposed a new set of instructions, Intel® Advanced Encryption Standard New Instructions (Intel® AES-NI) for encryption/decryption back in 2008. Intel has continued to enhance the performance of AES-NI in all processor families since then. With the introduction of the 4th generation Intel® Xeon® processor (code name Haswell), Intel has moved the bar higher with increased performance, particularly with AES-XTS. With Haswell the latency of AES instructions has been reduced from 8 cycles on the 2nd gen and 3rd gen Intel® Xeon® microarchitecture (code name Sandy Bridge and Ivy Bridge) down to 7 on Haswell, better than a 14% improvement. The reduction in latency helps serial modes of AES operation, such as CBC Encrypt. Additionally, the throughput has been optimized by a reduced number of micro-ops, which improves parallel modes of AES operation.

HyTrust performed a number of tests with their DataControl encryption and key management solution on both 3rd gen Intel® Xeon® processor (code name Ivy Bridge) and Haswell systems running Red Hat Enterprise Linux* and verified the performance gains expected with Haswell over the Ivy Bridge processors.

In this paper we discuss our test results and why device-level encryption offers the best performance within the operating system stack. We also discuss the use of Intel's random number generation instruction (RDRAND) and how it works on virtualized systems. Finally, we talk about the effects of operating system and application caching and what choices organizations can make regarding key sizes with respect to security and performance.

An Overview of HyTrust and HyTrust DataControl

HyTrust solves real-world problems for customers across a wide spectrum of industry verticals: retailers striving to achieve PCI compliance, government agencies seeking to protect confidential records, multinational conglomerates running massive multi-tenant private clouds and more.  One of HyTrust's products is  HyTrust DataControl, which provides encryption and key management for virtualized and cloud infrastructures HyTrust DataControl supports all hypervisor platforms and all IaaS (Infrastructure as a Service) platforms such as Amazon EC2* and Microsoft AZURE*. Shown in the Figure 1, typical deployments use HyTrust DataControl in the public cloud while customers retain control of the encryption keys by holding HyTrust KeyControl in their own data center.

Figure 1 - HyTrust KeyControl and DataControl

HyTrust KeyControl/DataControl's benefits include:

  • Highly-available (active-active) key management cluster
  • Key management in the data center or in the cloud
  • Transparent encryption so no changes to applications
  • Support for secure disk migration and cloned VMs for backup/replication
  • Dynamic rekey, which significantly reduces downtime
  • Encrypt all OS partitions including root and swap
  • Support for physical servers, virtual servers, and all hypervisor platforms
  • Support for all IaaS cloud platforms
  • Support for data geo-fencing using Intel® Trusted Execution Technology (Intel® TXT)

Key Management

Anyone familiar with symmetric encryption knows that it has an encryption algorithm (such as AES, Blowfish, 3DES) and an encryption key. Encryption keys come in different sizes. For example, AES keys can be 128-bit, 256-bit, or 512-bit. There has been much discussion as to the whether larger key sizes should be used. But are larger key sizes necessary and what is the tradeoff by moving to larger key sizes? In this section, we describe the options available.

A Note on AES-CBC Mode versus AES-XTS Mode

In our experiments we used the following block ciphers:

  • CBC – Cipher-Block Chaining. CBC is a common encryption mode in which the previous block's cipher-text is xor'ed with the current block's plaintext before encryption takes place. CBC has been around for several decades.
  • XTS – XEX-based tweaked codebook mode with cipher-text stealing. XTS-AES was designed specifically for encrypting data on hard disks and was approved by NIST in 2010. It is supported by many open source encryption solutions. XTS-AES uses two different keys, typically by splitting the symmetric key in half. Thus, if you want AES 256 and AES 128 encryption, you need to choose XTS key sizes of 512 bits and 256 bits, respectively.

For disk encryption, such as is deployed in HyTrust DataControl, XTS-AES mode is the preferred option due to a larger gain in performance.

Should I use the Largest Key Size?

It is assumed that any attacker will know the encryption algorithm being used. This is known as Kerckhoffs' principle—"only secrecy of the key provides security". Therefore, to prevent an attacker from guessing the key, the key needs to be generated truly randomly and must contain sufficient entropy. As such, the larger the key size, the harder it is to "brute force" the key.

Choosing the largest key size is not always a priority for many organizations, and performance is usually the most important factor they consider.

In "Recommendation for Key Management" (NIST Special Publication 800-57), NIST states that 128-bit keys are good for now and the foreseeable future. At the time that AES was chosen by NIST, the NSA recommended that all "TOP SECRET" data be encrypted with AES 256-bit keys. Why? During the competition to replace DES (which AES won), the NSA stated in open meetings that it was for defense against quantum computing. At the time of writing, quantum computing is still not practical.

We ran a number of tests to measure raw crypto performance, from which we derived some observations.   These are enumerated in the 'I/O testing and Results' section of this paper.

Since it is not feasible to mix and match modes for encryption and decryption, and XTS mode is certainly the fastest cipher-block mode for encryption. A choice must be made between the different key sizes. 256-bit keys are generally regarded as the default choice for encryption solutions and are recommended by many.

Random Number Generation

Much has been written about the lack of randomness in virtual machines. When special hardware is not available to help with random number generation, Linux relies on sources such as the keyboard or mouse clicks, network activity, and interrupts to seed its entropy pool. In today's cloud environments, keyboard and mouse activity are minimal, which makes random number generation more difficult; Ristenpart and Yilek [2010] describe some of the mechanisms by which random number generation can fail in virtual machines.

Traditional key managers have been shipped on physical hardware, often with HSM (Hardware Security Modules) that provide true, hardware-based sources of random number generation. Today, with most servers being virtualized, we see many key managers running as virtual appliances. Since key management is the critical component within an encryption system, it is imperative that random number generation be effective.

Introduced in Ivy Bridge, the RDRAND instruction can be used for returning random numbers from an Intel® on-chip hardware random number generator. This "digital random number generator" uses an on-processor entropy source. Thus in virtualized environments, with the instruction made available through the hypervisor to the virtual machine, we are now able to effectively produce random numbers from a hardware source. Libraries such as OpenSSL* have been modified to take advantage of the RDRAND instruction.

Disk Encryption

In this section, we show the results of various I/O tests performed on Haswell and Ivy Bridge and explain how encryption can be layered into the I/O stack and how the effect of caching can further reduce the impact of encryption and decryption.

Filesystem and Database Caching

Understanding how encryption is used in the application stack can help us determine the impact that encryption mode will have on performance. Consider Figure 2, which shows the placement of the HyTrust DataControl filter driver in the operating stack.

Figure 2 - Encryption and the effects of OS/Application caching

There are multiple places to encrypt in the application/OS stack including the application level as well as the filesystem level. HyTrust DataControl performs encryption/decryption between the filesystem and the device driver layers. When a read operation occurs, the data comes through the device driver; it is decrypted by HyTrust and passed to the filesystem (or to the application if raw I/O is used). When a write occurs, the filesystem passes the block(s) to the HyTrust driver where it is encrypted before being passed to the device driver to write it to storage. Thus by encrypting/decrypting at the device layer, as opposed to the filesystem layer, we are not changing the effect of caching at either the application or the filesystem layer. Just as applications run unmodified, the same amount of memory is available as a cache with or without the encryption/decryption filter driver.

As time goes by and more reads occur, more application data occupies the operating system page cache and buffer cache, which reduces the need to "go to disk." Similarly, if a database is in use, reads will fill the database cache, also reducing the need to go to disk.

Disk writes are a little different. Certainly, asynchronous operations will only go to the filesystem cache, but at some point they must be written to storage. We tend to see I/O-bound applications due to writes rather than reads. So if we think about encryption and decryption, we can conclude the following:

  • Encryption has higher precedence since we must write data to the disk as fast as possible.
  • Decryption has lower precedence since read operations are typically satisfied by the filesystem or database cache.

I/O Testing and Results*

Key sizes and raw crypto performance

To give us a measurement of raw crypto performance and how it relates to key sizes, we used the "cryptsetup benchmark" to measure basic encryption/decryption performance across Haswell and Ivy Bridge CPUs. This tells us basic information about the amount of encryption/decryption, in megabytes per second, of what each CPU is capable of performing.

On Ivy Bridge here are the raw numbers for both Cyber-Block-Chaining  (CBC) and XEX-based tweaked-codebook mode with ciphertext stealing

(XTS) modes with both 128- and 256-bit keys.  Note that for XTS mode, only half the key is used, so XTS-512 essentially utilizes a 256-bit key.

# Tests are approximate using memory only (no storage IO).
#  Algorithm | Key |  Encryption |  Decryption
     aes-cbc   128b   581.3 MiB/s  1961.8 MiB/s
     aes-cbc   256b   431.4 MiB/s  1503.1 MiB/s
     aes-xts   256b  1665.6 MiB/s  1642.3 MiB/s
     aes-xts   512b  1318.3 MiB/s  1282.1 MiB/s

And for Haswell:

# Tests are approximate using memory only (no storage IO).
#  Algorithm | Key |  Encryption |  Decryption
     aes-cbc   128b   663.8 MiB/s  2486.8 MiB/s
     aes-cbc   256b   493.9 MiB/s  2043.6 MiB/s
     aes-xts   256b  2265.2 MiB/s  2261.1 MiB/s
     aes-xts   512b  1778.0 MiB/s  1778.7 MiB/s

We made the following observations:

  • For CBC encryption, we see a 40% improvement for 128-bit keys over 256-bit keys.
  • For XTS encryption, we see a 30% improvement for 256-bit keys over 512-bit keys.
  • For CBC decryption, we see a 20% improvement for 128-bit keys over 256-bit keys.
  • For XTS decryption, we see a 30% improvement for 256-bit XTS keys over 512-bit keys.
  • When comparing XTS against CBC, Haswell has improvements over CBC for any key size and in particular for 128-bit keys where we can see an almost 3x performance increase.

As mentioned earlier in the paper, XTS mode is the faster cipher-block mode for encryption; when choosing key sizes, 256-bit keys are generally regarded as the default choice for encryption solutions.

We ran a number of I/O operations with and without AES-NI support, with different key sizes, and with AES-CBC and AES-XTS modes. We ran two basic sets of tests:

  1. dd of encrypted and unencrypted raw devices
  2. iozone with a single, 4, and 8 threads (filesystem on top of encrypted devices)

We issued multiple runs of each test and took an average of each set of runs. All tests were performed using direct I/O.

We have provided sample output of the different tests.

The systems used were as follows:

Haswell – Intel® Xeon® processor E5-2697 v3 @ 2.60GHz, 37.5MB L3 cache, 14 core pre-production system. Intel® SSD DC P3700 Series @ 400GB, 128GB memory (8x16GB DDR4 -2133Mhz), BIOS by Intel Corporation Version: GRNDSDP1.86B.0036.R05.1407140519, running RedHat Enterprise Linux* 7 (kernel 3.10.0-123.el7.x86_64)

Ivy Bridge – Intel® Xeon® processor E5-2697 v2 @ 2.7Ghz, 30MB L3 cache, 12 core pre-production system with Intel® DC S3700 800GB SSD, 64GB memory @ DDR3 1866Mhz, BIOS by Intel Corporation Version: SE5C600.86B.02.01.0001.080620131246, running RedHat Enterprise Linux* 7 (kernel 3. 10.0-123.el7.x86_64)

Basic "dd" Tests

These were very basic, single-threaded tests run as follows:

	# dd if=/dev/zero of=/dev/<md10|sda3> oflag=direct \
bs=1048576 count=10000
	# dd of=/dev/null if=/dev/<md10|sda3> iflag=direct \
bs=1048576 count=10000

using an Intel®  SSD DC P3700 Series disk.

Here are the basic results for Haswell, which show the effects of not having AES-NI support. They also show the difference between CBC and XTS modes.

Figure 3 – Simple "dd" tests and effects of different encryption modes

In the table below we show the difference between the AES encryption modes across both Ivy Bridge and Haswell.

CPU

AES mode

Encryption (MB/sec)

Decryption (MB/sec)

Intel® Xeon® E5-2600 v2 product family

CBC

323

564

XTS

535

564

XTS % better

66%

Similar

 

Intel® Xeon® E5-2600 v3 product family

CBC

362

631

XTS

680

640

XTS % better

88%

Similar

 

With Haswell we see 33% increase in XTS performance compared to Ivy Bridge. This confirms the AES improvements made with Haswell.

Iozone Single-Thread Tests

The next sets of tests were single-threaded iozone runs on top of a filesystem that was created on an encrypted device. We ran the following iozone command:

iozone -a -I -i 0 -i 1 -s 1024000

multiple times and took averages.

In the graph shown below, the y-axis represents I/O in MB/sec and the x-axis represents the record length (up to 16KB) that iozone is using to read/write. We display the results in graph form here to show the effect of increasing the I/O size. As we go up to I/O sizes of 16 KB, we see a larger number of IOPS. 

The results with AES-NI support and XTS-256 are shown in Figure 4.

Figure 4 - AES-NI with XTS mode

As expected, encryption (writes) is faster than decryption (reads). At higher I/O sizes, the overhead of encryption becomes less significant.

As a point of reference, we note the performance of encryption/decryption with no AES-NI support to be approximately 25% of the baseline runs.

Multi-threaded iozone Tests

We have already established that XTS mode performs significantly better than CBC so we continue our testing with XTS-AES and with multiple threads.

A number of iozone tests were running with 4 and 8 threads as follows:

iozone -l 4/8 -I -i 0 -i 1 -s 1024000

For 4 threads, the numbers are shown in Figure 5. The "y" axis shows KB/sec.

Figure 5 – iozone test with 4 threads

For 8 threads, the numbers are shown in Figure 6. The "y" axis shows KB/sec:

Figure 6 – iozone test with 8 threads

As we increase the number of threads, the overhead of encryption and decryption becomes minimal. In the case of encryption (writes), we see a degradation of 1 to 4% depending on key size.

Read operations (decryption) show less than 10% degradation for both key sizes. We expect the effects of application/operating system caching to reduce the overhead to a negligible amount.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Configurations:. Haswell – Intel® Xeon® processor E5-2697 v3 @ 2.60GHz, 37.5MB L3 cache, 14 core pre-production system. Intel® SSD DC P3700 Series @ 400GB, 128GB memory (8x16GB DDR4 -2133Mhz), BIOS by Intel Corporation Version: GRNDSDP1.86B.0036.R05.1407140519, running RedHat Enterprise Linux* 7 (kernel 3.10.0-123.el7.x86_64)l Ivy Bridge – Intel® Xeon® processor E5-2697 v2 @ 2.7Ghz, 30MB L3 cache, 12 core pre-production system with Intel® DC S3700 800GB SSD, 64GB memory @ DDR3 1866Mhz, BIOS by Intel Corporation Version: SE5C600.86B.02.01.0001.080620131246, running RedHat Enterprise Linux* 7 (kernel 3. 10.0-123.el7.x86_64)

For more information go to http://www.intel.com/performance

Summary

Performance is always a concern when deploying encryption. Many organizations, especially when using encryption in the cloud, can add extra processing power at minimal costs to allow for any drop in performance.

As the results described in this paper show, the introduction of AES-NI in Intel processors has produced a dramatic increase in encryption/decryption performance. With the introduction of Haswell, Intel has increased the performance of AES-NI again, especially when running in XTS mode. The drop in performance, especially for multi-threaded and therefore more real-world applications, is negligible so organizations can feel comfortable protecting their data with encryption, with minimal overhead.

With AES-NI instructions available through the hypervisor and with the introduction of RDRAND to aid random number generation, we continue to see significant improvements in encryption and key management that is proving to be more commonplace, especially as organizations move to the public cloud where encrypting data is seen as a necessity.

Acknowledgements

We would like to thank the various teams at Intel who have assisted in this project from setting up systems, advising on the capabilities of Haswell and Ivy Bridge, to assisting with the production of this paper.

About the Authors

Steve Pate is HyTrust’s Chief Architect. Prior to HyTrust, Steve was co-founder and CTO of HighCloud Security (acquired by HyTrust in November 2013). He has more than 25 years of experience in designing, building, and delivering file system, operating system, and security technologies, with a proven history of converting market-changing ideas into enterprise-ready products.
Before HighCloud Security, he built and led teams at ICL, SCO, VERITAS, Vormetric, and others.
Steve has published two books on UNIX kernel internals and UNIX file systems, and has been published in numerous blogs and industry publications. He earned his bachelor’s in computer science from the University of Leeds in the UK.

Kelvin Pryse is Principal Engineer/manager for the HyTrust DataControl engineering team. Kelvin has over 20 years of experience building operating system, storage and security solutions. 
Prior to HyTrust, Kelvin was a Principle Engineer at HighCloud Security, developed encryption and key management technologies at Vormetric and has been a long time contributor to the VERITAS Volume Manager across multiple versions of UNIX and Linux. Previous companies included HAL and Amdahl.
Kelvin earned his bachelor’s in computer science from California State University Bakersfield and his masters in computer science from the University of California Santa Barbara.

References

Thompson, Christopher J., Ian J. De Silva, Marie D. Manner, Michael T. Foley, and Paul E. Baxter, Randomness Exposed – An Attack on Hosted Virtual Machines

[2010] Ristenpart, Thomas and Scott Yilek, When Good Randomness Goes Bad: Virtual Machine Reset Vulnerabilities and Hedging Deployed Cryptography

For more complete information about compiler optimizations, see our Optimization Notice.