| Last Modified On : | May 6, 2008 7:12 PM PDT |
Rate |
|
by Muneesh Nagpal, Gururaj Nagendra, and Alexey Omelchenko
This simple optimization walk-through improves an already-optimized sample OpenSSL application's performance by 35% using Intel® cryptography library functions.
With the increase in e-commerce and other transactions in enterprise applications, the demand for higher-performing, secure, and scalable communications is on the rise. From a hardware perspective, as the communication load increases, load balancing is typically accomplished by adding more processors.
From a software standpoint, securing transactions using Secure Socket Layer (SSL) is very compute-intensive and can slow down the performance of the system, which in turn can have a negative effect on scalability. Organizations need cost-effective and flexible hardware solutions that meet their demands, and application developers need a robust cryptography library implementation that is easy to use for creating secure, high-performing applications.
The 64-bit Intel® Itanium® processor offers excellent price/performance and scalability for deploying secure enterprise-scale applications. The Itanium-based platform has superior built-in hardware security features that benefit all operating-system installations. To increase the value-add to the software developer, Intel® Integrated Performance Primitives (Intel® IPP) version 4.0 introduced the cryptography function domain.
The Intel IPP cryptography function domain is a suite of pre-built public-key, symmetric and hashing functions that conform to the US Government's National Institute of Standards and Technology (NIST) Federal Information Processing Standards (FIPS) specifications*. It enables fast and robust development of security software solutions for authentication, to ensure data confidentiality, and to maintain data integrity.
These functions are optimized for performance on the Itanium processor family and are engineered to make best use of the platform’s features. The functions are also optimized for the Intel® Xeon® processor, Pentium® 4 processor, Pentium® M processor, and Intel® Personal Internet Client Architecture (Intel® PCA). The Intel® IPP Cryptography addition package is a stand-alone installation that contains the binaries and header files as part of the Intel IPP 4.0 package.
The Intel IPP cryptography functionality supports the following main categories of algorithms (see Appendix A of this document for a complete list of algorithms supported under these categories):
There are numerous advantages to using Intel IPP cryptography functions to generate secure applications:
The OpenSSL Project is a collaborative effort to develop a robust, commercial-grade, full-featured, open-source toolkit that implements the Secure Sockets Layer (SSL v2/v3) and Transport Layer Security (TLS v1) protocols. The protocol also includes a general-purpose cryptography library. SSL is a layer that sits on top of HTTP, ensuring secure authenticated, authorized open and close connections.
The layer is also responsible for Public and Private Key exchanges, as well as data encryption and decryption. The OpenSSL project is managed by a worldwide community of volunteers that use the Internet to communicate, plan, and develop the OpenSSL toolkit and its related documentation. OpenSSL is based on the SSLeay library developed by Eric A. Young and Tim J. Hudson from Cryptsoft Inc.
In the code sample used in this article, modifications were made to the SSL source with certain Big Number functions being replaced by calls to Intel IPP routines. These Intel IPP libraries are a part of the Public Key Algorithm support provided by the Intel IPP cryptography libraries. A varying number of SSL connections were then attempted to see the difference in performan ce between the initial version of the code and the version optimized using Intel IPP functions.
Figure 1 and Table 1 describe the averaged results of five runs. In every run, 32, 64, 96, 128, 160, 192, 224, and 256 concurrent OpenSSL 512-bit connections were attempted. The batch file executing a run included a pass with the original binaries and then a pass with the Intel IPP-optimized binaries. The runs returned numbers of clock ticks taken by the Itanium-based system to process the connection requests, which were averaged, tabulated, and graphed.
Figure 1: OpenSSL Performance with Intel IPP Cryptography Library [see Appendix B for Details]
Table 1: OpenSSL Performance with Intel IPP Cryptography Library [see Appendix B for details]
Intel IPP-based code provided an average gain of more than 35% over the already-optimized off-the-shelf OpenSSL code with minimal code changes. The resulting source code is easy to understand and document, and it is substantially smaller in size. See Appendix B for details about hardware and software configurations and clock-tick counts for each of the five individual runs.
The application code that accompanies this article is downloadable here. The following steps allow you to run it on an Itanium®-based system under Windows* Server 2003:
The test.log file displays eight sets of ‘New session’ results (one set of which is shown in Table 2) for binaries that do and do not use Intel IPP. The eight sets of results represent clock ticks for 32, 64, 96, 128, 160, 192, 224, and 256 concurrent SSL connections, respectively. The total number of clock ticks (as shown in Table 2) is the value of interest in each of our runs invoking different number of concurrent SSL connections. These values can now be tabulated and graphed to gauge performance.
Table 2: Result format of the sample code execution
To view the source files of interest, go to folder OpenSSLC_code. The file bn_asm_old.c is the original source file without Intel IPP functions. The file bn_asm.c is the modified, equivalent file that achieves the same functionality using Intel IPP functions.
Consider at the first function, bn_mul_add_words, as an example. This function performs an unsigned big num integer multiply pointed to by rp with a 32-bit integer constant ap, and the computed result is stored in a location pointed to by rp. As a return value, the function also returns the carry, which within the function is in the variable cl.
Table 3: Intel IPP Implementation of bn_mul_add_words function
This is a typical scenario. Complicated while and if structures are replaced by a single Intel IPP function that accepts structures to predefined data types that are buffered during Intel IPP execution within the Intel IPP routine. The Intel IPP routine names are indicative of their functionality; for instance, ippsMACOne_BNU_I represents an Intel IPP security routine for a Multiply and Accumulate operation using one MAC unit on Big Number Unsigned Int operands.
From a developer’s perspective, there is minimal coding overhead involved in these Intel IPP optimizations. The extent of the coding involves instantiating the appropriate structures and passing pointers to those structures into the Intel IPP routine. The function takes care of the rest. No customization is required to pass the arguments and invoke the functions. Ease of use of the Intel IPP routines adds significant value to the developer and is a primary design goal of Intel IPP.
As another example, consider the bn_div_words big number routine. This routine performs a 64-bit big number unsigned divide with a 32-bit dividend. The 64-bit number is broken up into two 32-bit h (high) and l (low) bits that are passed as parameters.
Table 4: Intel IPP Implementation of bn_div_words function
This example clearly illustrates the advantages of using Intel IPP to reduce code bloat and to create clean, easy-to-document interfaces. The function bn_div_words, implemented using the Intel IPP routine ippsDiv_64u32u, is one-third the size of its predecessor and much easier to understand. The result is displayed in Big- or Little-Endean mode, based on whether or not the directive L_ENDIAN is def ined. The quotient is returned in variable r and the carry-in variable carry. Array a[] is initialized with the h and l portions of the dividend, and d the is the divisor.
Though two function code examples are described here, the interested reader should look through the rest of the functions in bn_asm_old.c (original source) and bn_asm.c (Intel IPP optimized source) to see other function-implementation comparisons.
The Intel IPP cryptography functions allow the deployment of high-performing, secure applications on the Itanium-based platform with minimum developer effort. Reduced coding time and API-like interfaces that are easy to understand and document help achieve quick development, testing, and deployment.
With the increasing number and complexity of server applications, securing data and optimizing performance on servers is a balancing act and a big challenge. The Intel Itanium processor and Intel IPP cryptography software library provide security building blocks to create robust, high-performing, and highly scalable security applications.
Sample code:
Download code sample
Articles:
Developer Centers:
Other:
Muneesh Nagpal, Server Applications Engineer, Core Software Division, Intel® Corporation, was the Itanium Technical Marketing Engineer representing Intel Engineering in the Intel platform decision team. He is currently an Applications Engineer part of the Intel®/IBM® DB2 team working on TPC based Industry Standard Benchmarks.
Gururaj Nagendra, Senior Software Engineer and Architect, Software Products Division, SSG, has been working in Intel IPP team for more than 2 years enabling new functional domains for Intel IPP, a library product. His primary focus is to enable libraries products for new technologies such as XML and managed runtime environments. He holds a M.S in Computer Engineering and a B.E. in Computer Science and Engineering.
Alexey Omelchenko, Software Engineer, Software Enabling Division, Intel Corporation. In 2001-2003 Alexey was involved in optimization of crypto, video and small matrices processing algorithms in corresponding domains of the Intel® IPP; competitive benchmarking and performance analysis at different optimization levels on Intel® Pentium® 4 and Itanium® processors for Intel® C++ compiler 8.0 launch.
Software System Configuration:
Table 5: Clock Ticks for Five Runs (Without Intel IPP as baseline)
Table 6: Average Performance Improvement for Five Runs (Without Intel IPP runs used as baseline)
