by Muneesh Nagpal, Gururaj Nagendra, and Alexey Omelchenko
This simple optimization walk-through improves an already-optimized sample OpenSSL application's performance by 35% using Intel® cryptography library functions.
With the increase in e-commerce and other transactions in enterprise applications, the demand for higher-performing, secure, and scalable communications is on the rise. From a hardware perspective, as the communication load increases, load balancing is typically accomplished by adding more processors.
From a software standpoint, securing transactions using Secure Socket Layer (SSL) is very compute-intensive and can slow down the performance of the system, which in turn can have a negative effect on scalability. Organizations need cost-effective and flexible hardware solutions that meet their demands, and application developers need a robust cryptography library implementation that is easy to use for creating secure, high-performing applications.
The 64-bit Intel® Itanium® processor offers excellent price/performance and scalability for deploying secure enterprise-scale applications. The Itanium-based platform has superior built-in hardware security features that benefit all operating-system installations. To increase the value-add to the software developer, Intel® Integrated Performance Primitives (Intel® IPP) version 4.0 introduced the cryptography function domain.
Intel Integrated Performance Primitives (Intel IPP) Cryptography Functions
The Intel IPP cryptography function domain is a suite of pre-built public-key, symmetric and hashing functions that conform to the US Government's National Institute of Standards and Technology (NIST) Federal Information Processing Standards (FIPS) specifications*. It enables fast and robust development of security software solutions for authentication, to ensure data confidentiality, and to maintain data integrity.
These functions are optimized for performance on the Itanium processor family and are engineered to make best use of the platform’s features. The functions are also optimized for the Intel® Xeon® processor, Pentium® 4 processor, Pentium® M processor, and Intel® Personal Internet Client Architecture (Intel® PCA). The Intel® IPP Cryptography addition package is a stand-alone installation that contains the binaries and header files as part of the Intel IPP 4.0 package.
The Intel IPP cryptography functionality supports the following main categories of algorithms (see Appendix A of this document for a complete list of algorithms supported under these categories):
- Symmetric Algorithms
- Hash Algorithms and Data Authentication
- Public Key Algorithms
Advantages of Using the Intel IPP Cryptography Functions
There are numerous advantages to using Intel IPP cryptography functions to generate secure applications:
- Low coding turn-around time due to ease of use – Intel IPP cryptography functions can be incorporated very easily into existing code streams with minimal effort. The library contains easy-to-understand APIs that are intuitive to developers. The resulting code is clean and can be documented with ease. Documentation is provided as part of the Intel IPP product package.
- Substantial reduction in code size – Using Intel IPP cryptography API calls greatly reduces the amount of coding required to implement a given set of functionality. Complex functions can be replaced by very simple calls to Intel IPP routines, substantially reducing code size.
- Optimal execution on Itanium-based platform – Intel® engineers who have unparalleled insight into the Itanium microarchitecture have hand-coded the Intel IPP cryptography routines, incorporating numerous low-level architecture-specific optimizations. Applications can seamlessly take advantage of features such as 64-bit arithmetic, software pipelining, and instruction scheduling. This capability relieves developers from spending their time writing custom code to optimize performance.
- Security enhancement – Intel IPP cryptographic primitives operate exclusively on user-supplied buffers to maintain domain separation for various operations involving key generation, encryption, and decryption. In addition, each function validates buffer usage and size to ensure the avoidance of buffer-overflow issues.
- Public cryptography key-size scalablity – Intel IPP cryptographic primitives for public-key cipher engines offer a flexible user interface to scale the key size up to 4,000 bits.
- Multiplatform support – Intel IPP includes a CPU-specific code dispatcher, which dynamically invokes the function versions that provide the best performance on the target processor. This capability guarantees seamless optimal performance across platforms with no code changes.
Intel IPP Cryptography at Work - OpenSSL Code Sample
The OpenSSL Project is a collaborative effort to develop a robust, commercial-grade, full-featured, open-source toolkit that implements the Secure Sockets Layer (SSL v2/v3) and Transport Layer Security (TLS v1) protocols. The protocol also includes a general-purpose cryptography library. SSL is a layer that sits on top of HTTP, ensuring secure authenticated, authorized open and close connections.
The layer is also responsible for Public and Private Key exchanges, as well as data encryption and decryption. The OpenSSL project is managed by a worldwide community of volunteers that use the Internet to communicate, plan, and develop the OpenSSL toolkit and its related documentation. OpenSSL is based on the SSLeay library developed by Eric A. Young and Tim J. Hudson from Cryptsoft Inc.
In the code sample used in this article, modifications were made to the SSL source with certain Big Number functions being replaced by calls to Intel IPP routines. These Intel IPP libraries are a part of the Public Key Algorithm support provided by the Intel IPP cryptography libraries. A varying number of SSL connections were then attempted to see the difference in performan ce between the initial version of the code and the version optimized using Intel IPP functions.
Figure 1 and Table 1 describe the averaged results of five runs. In every run, 32, 64, 96, 128, 160, 192, 224, and 256 concurrent OpenSSL 512-bit connections were attempted. The batch file executing a run included a pass with the original binaries and then a pass with the Intel IPP-optimized binaries. The runs returned numbers of clock ticks taken by the Itanium-based system to process the connection requests, which were averaged, tabulated, and graphed.
Figure 1: OpenSSL Performance with Intel IPP Cryptography Library [see Appendix B for Details]
Table 1: OpenSSL Performance with Intel IPP Cryptography Library [see Appendix B for details]
Intel IPP-based code provided an average gain of more than 35% over the already-optimized off-the-shelf OpenSSL code with minimal code changes. The resulting source code is easy to understand and document, and it is substantially smaller in size. See Appendix B for details about hardware and software configurations and clock-tick counts for each of the five individual runs.
The application code that accompanies this article is downloadable here. The following steps allow you to run it on an Itanium®-based system under Windows* Server 2003:
- Install Windows Advanced Server 2003 SP1.
- Intel IPP version 4.0 is available for download. Register for the product to download the Intel IPP 4.0 packages.
- Download and install the Intel IPP v4.0 package and then Cryptography processing for Intel IPP v4.0, in that order.
- Download and unzip the code sample above on an Itanium-based system running Windows Server 2003.
- Go to the directory OpenSSLBin and rename the file ssltest._xe to ssltest.exe.
- In the OpenSSL folder, run the batch file run_test.bat.
- A file named test.log will be created in the OpenSSL folder that contains the results of the test.
The test.log file displays eight sets of ‘New session’ results (one set of which is shown in Table 2) for binaries that do and do not use Intel IPP. The eight sets of results represent clock ticks for 32, 64, 96, 128, 160, 192, 224, and 256 concurrent SSL connections, respectively. The total number of clock ticks (as shown in Table 2) is the value of interest in each of our runs invoking different number of concurrent SSL connections. These values can now be tabulated and graphed to gauge performance.
Table 2: Result format of the sample code execution
Understanding the OpenSSL Source Changes
To view the source files of interest, go to folder OpenSSLC_code. The file bn_asm_old.c is the original source file without Intel IPP functions. The file bn_asm.c is the modified, equivalent file that achieves the same functionality using Intel IPP functions.
Consider at the first function, bn_mul_add_words, as an example. This function performs an unsigned big num integer multiply pointed to by rp with a 32-bit integer constant ap, and the computed result is stored in a location pointed to by rp. As a return value, the function also returns the carry, which within the function is in the variable cl.
Table 3: Intel IPP Implementation of bn_mul_add_words function
This is a typical scenario. Complicated while and if structures are replaced by a single Intel IPP function that accepts structures to predefined data types that are buffered during Intel IPP execution within the Intel IPP routine. The Intel IPP routine names are indicative of their functionality; for instance, ippsMACOne_BNU_I represents an Intel IPP security routine for a Multiply and Accumulate operation using one MAC unit on Big Number Unsigned Int operands.
From a developer’s perspective, there is minimal coding overhead involved in these Intel IPP optimizations. The extent of the coding involves instantiating the appropriate structures and passing pointers to those structures into the Intel IPP routine. The function takes care of the rest. No customization is required to pass the arguments and invoke the functions. Ease of use of the Intel IPP routines adds significant value to the developer and is a primary design goal of Intel IPP.
As another example, consider the bn_div_words big number routine. This routine performs a 64-bit big number unsigned divide with a 32-bit dividend. The 64-bit number is broken up into two 32-bit h (high) and l (low) bits that are passed as parameters.
Table 4: Intel IPP Implementation of bn_div_words function
This example clearly illustrates the advantages of using Intel IPP to reduce code bloat and to create clean, easy-to-document interfaces. The function bn_div_words, implemented using the Intel IPP routine ippsDiv_64u32u, is one-third the size of its predecessor and much easier to understand. The result is displayed in Big- or Little-Endean mode, based on whether or not the directive L_ENDIAN is def ined. The quotient is returned in variable r and the carry-in variable carry. Array a is initialized with the h and l portions of the dividend, and d the is the divisor.
Though two function code examples are described here, the interested reader should look through the rest of the functions in bn_asm_old.c (original source) and bn_asm.c (Intel IPP optimized source) to see other function-implementation comparisons.
The Intel IPP cryptography functions allow the deployment of high-performing, secure applications on the Itanium-based platform with minimum developer effort. Reduced coding time and API-like interfaces that are easy to understand and document help achieve quick development, testing, and deployment.
With the increasing number and complexity of server applications, securing data and optimizing performance on servers is a balancing act and a big challenge. The Intel Itanium processor and Intel IPP cryptography software library provide security building blocks to create robust, high-performing, and highly scalable security applications.
Download code sample
- Intel Software Development Products include Compilers, Performance Analyzers, Performance Libraries and Threading Tools.
- IT@Intel, through a series of white papers, case studies, and other materials, describes the lessons it has learned in identifying, evaluating, and deploying new technologies.
About the Authors
Muneesh Nagpal, Server Applications Engineer, Core Software Division, Intel® Corporation, was the Itanium Technical Marketing Engineer representing Intel Engineering in the Intel platform decision team. He is currently an Applications Engineer part of the Intel®/IBM® DB2 team working on TPC based Industry Standard Benchmarks.
Gururaj Nagendra, Senior Software Engineer and Architect, Software Products Division, SSG, has been working in Intel IPP team for more than 2 years enabling new functional domains for Intel IPP, a library product. His primary focus is to enable libraries products for new technologies such as XML and managed runtime environments. He holds a M.S in Computer Engineering and a B.E. in Computer Science and Engineering.
Alexey Omelchenko, Software Engineer, Software Enabling Division, Intel Corporation. In 2001-2003 Alexey was involved in optimization of crypto, video and small matrices processing algorithms in corresponding domains of the Intel® IPP; competitive benchmarking and performance analysis at different optimization levels on Intel® Pentium® 4 and Itanium® processors for Intel® C++ compiler 8.0 launch.
- Symmetric Algorithms
- Data Encryption Standard algorithm functions
- Triple Data Encryption functions
- Rijndael functions
- Blowfish algorithm functions
- Twofish algorithm functions
- Hash Algorithms and Data Authentication
- MD5 algorithm function
- SHA512 algorithm functions
- Keyed-hash based message authentication code functions performing under the HMAC SHA1
- HMAC SHA256
- HMAC384, and HMAC512 schemes
- Data authentication functions performing under DAA DES/TDES and DAA Rijndael schemes
- Public Key Algorithms
- Big number arithmetic functions
- Montgomery reduction scheme functions
- Pseudorandom number generation functions
- Prime number generation functions
- RSA algorithm functions
- DSA algorithm functions
- Hardware System Configuration:
- 4 X Intel® Itanium® processor, 1.5 GHz, 6MB L3 Cache
- RAM: 8 GB (DDR)
- Front Side Bus: 200 MHz dual pumped ? 400 MHz
- Hard Drive(s): 34 GB
- 17 GB OS
- 17 GB Application
Software System Configuration:
- Operating System: Windows* Enterprise Server 2003
- Windows* Enterprise Server 2003 SP1
- Intel® IPP 4.0
- Cryptography Processing for the Intel IPP v4.0 for Windows* package
Table 5: Clock Ticks for Five Runs (Without Intel IPP as baseline)
Table 6: Average Performance Improvement for Five Runs (Without Intel IPP runs used as baseline)