Optimizing Performance of the AES Algorithm for the Intel® Pentium® 4 Processor

This paper describes how to optimize the Advanced Encryption Standard (AES) algorithm for the Intel® Pentium® 4 processor. A brief overview of the original AES code precedes a description of how to improve the performance of AES using the Intel® C++ Compiler. Lastly, potential drawbacks of coding AES in assembly language are discussed.

Introduction

The Data Encryption Standard (DES) had served as an important cryptographic algorithm for over two decades. However, the growth of computing power during that time had compromised the security of that algorithm. There was considerable evidence that it was time to replace DES with a new standard. In October 2000, the National Institute of Standards and Technology (NIST) selected the Rijndael algorithm as the new encryption standard to replace the current DES algorithm.

For further information on the selection process and AES itself see: csrc.nist.gov/CryptoToolkit/aes/*.

Overview of Rijndael Encryption

Antoon Bosselaers and Vincent Rijmen, who proposed Rijndael as the AES, wrote the following implementation of the Rijndael algorithm in the C programming language. Since this was the only publicly available implementation at the time, the author chose it for optimization on the Pentium® 4 processor.

 ```1 int rijndaelEncrypt (word8 a[16], word8 b[16], word8 rk[MAXROUNDS+1][4][4]) { int r; word8 temp[4][4]; *((w32*)temp[0]) = *((w32*)a) ^ *((w32*)rk[0][0]); *((w32*)temp[1]) = *((w32*)(a+4)) ^ *((w32*)rk[0][1]); *((w32*)temp[2]) = *((w32*)(a+8)) ^ *((w32*)rk[0][2]); *((w32*)temp[3]) = *((w32*)(a+12)) ^ *((w32*)rk[0][3]); *((w32*)b)= *((w32*)T1[temp[0][0]]) ^ *((w32*)T2[temp[1][1]]) ^ *((w32*)T3[temp[2][2]]) ^ *((w32*)T4[temp[3][3]]); *((w32*)(b+4))= *((w32*)T1[temp[1][0]]) ^ *((w32*)T2[temp[2][1]]) ^ *((w32*)T3[temp[3][2]]) ^ *((w32*)T4[temp[0][3]]); *((w32*)(b+8))= *((w32*)T1[temp[2][0]]) ^ *((w32*)T2[temp[3][1]]) ^ *((w32*)T3[temp[0][2]]) ^ *((w32*)T4[temp[1][3]]); *((w32*)(b+12))=*((w32*)T1[temp[3][0]]) ^ *((w32*)T2[temp[0][1]]) ^ *((w32*)T3[temp[1][2]]) ^ *((w32*)T4[temp[2][3]]); for(r = 1; r < ROUNDS-1; r++) { *((w32*)temp[0]) = *((w32*)b) ^ *((w32*)rk[r][0]); *((w32*)temp[1]) = *((w32*)(b+4)) ^ *((w32*)rk[r][1]); *((w32*)temp[2]) = *((w32*)(b+8)) ^ *((w32*)rk[r][2]); *((w32*)temp[3]) = *((w32*)(b+12)) ^ *((w32*)rk[r][3]); *((w32*)b)= *((w32*)T1[temp[0][0]]) ^ *((w32*)T2[temp[1][1]]) ^ *((w32*)T3[temp[2][2]]) ^ *((w32*)T4[temp[3][3]]); *((w32*)(b+4)) = *((w32*)T1[temp[1][0]]) ^ *((w32*)T2[temp[2][1]]) ^ *((w32*)T3[temp[3][2]]) ^ *((w32*)T4[temp[0][3]]); *((w32*)(b+8)) = *((w32*)T1[temp[2][0]]) ^ *((w32*)T2[temp[3][1]]) ^ *((w32*)T3[temp[0][2]]) ^ *((w32*)T4[temp[1][3]]); *((w32*)(b+12))= *((w32*)T1[temp[3][0]]) ^ *((w32*)T2[temp[0][1]]) ^ *((w32*)T3[temp[1][2]]) ^ *((w32*)T4[temp[2][3]]); } } ```

Figure 1: Antoon Bosselaer's implementation of Rijndael

The array a contains 16 bytes of input text, the array b will contain the output and the array rk contains 16 bytes of key for each round of encryption. For 128-bit Rijndael encryption, the value of MAXROUNDS and ROUNDS will be 14 and 10, respectively.

The array temp is the state table containing indexes into four fixed sized, constant tables.

1. T1
2. T2
3. T3
4. T4
• Lines 7-10 calculate the initial value of the state table.
• Lines 12-26 perform the table lookups and the XOR operations to calculate the encrypted text for the first iteration, which also becomes the input to the next iteration.
• The state table for next iteration is calculated in lines 29-32, which is used in lines 34-49 to calculate the encrypted text for that round.

Optimizing Rijndael in 'C'

Rearrange the Code and Hide Dependencies

We will start optimizing the original C code shown in Figure 1 by rearranging the code. We will move lines 12-15 inside the 'for' loop as shown in Figure 2. This reduces the number of assignment statements, avoiding assignment to the 'temp' array through the 'b' array. Instead we use the 'temp' array to communicate between iterations of the loop.

Since the Pentium 4 processor can execute several instructions in parallel and out of the order they appear in the code, it is often useful to rearrange the code to let the processor start a long computation so that the result will be ready for later calculations that depend on it. This is often referred to as 'hiding the latency' of an instruction. The first four bytes of the encrypted text are available after the calculation in line 12. In order to hide the latency of instructions, we can start the operations of line 29 at line 12. In this way we can avoid building a dependency chain that can interfere with the out-of-order and parallel execution in the processor. The modified code in Figure 2 includes changes to start computing the state table as soon as possible. It also partially unrolls the 'for' loop.

 ```int rijndaelEncrypt (word8 a[16], word8 b[16], word8 rk[MAXROUNDS+1][4][4]) { register int r; word8 temp0[4][4], temp1[4][4]; *((w32*)temp0[0]) = *((w32*)a) ^ *((w32*)rk[0][0]); *((w32*)temp0[1]) = *((w32*)(a+4)) ^ *((w32*)rk[0][1]); *((w32*)temp0[2]) = *((w32*)(a+8)) ^ *((w32*)rk[0][2]); *((w32*)temp0[3]) = *((w32*)(a+12)) ^ *((w32*)rk[0][3]); for (r = 1; r< (ROUNDS-1); r= r+2) { *((w32*)temp1[0]) = *((w32*)rk[r][0]) ^ *((w32*)T1[temp0[0][0]]) ^ *((w32*)T2[temp0[1][1]]) ^ *((w32*)T3[temp0[2][2]]) ^ *((w32*)T4[temp0[3][3]]); *((w32*)temp1[1]) = *((w32*)rk[r][1]) ^ *((w32*)T1[temp0[1][0]]) ^ *((w32*)T2[temp0[2][1]]) ^ *((w32*)T3[temp0[3][2]]) ^ *((w32*)T4[temp0[0][3]]); *((w32*)temp1[2]) = *((w32*)rk[r][2]) ^ *((w32*)T1[temp0[2][0]]) ^ *((w32*)T2[temp0[3][1]]) ^ *((w32*)T3[temp0[0][2]]) ^ *((w32*)T4[temp0[1][3]]); *((w32*)temp1[3]) = *((w32*)rk[r][3]) ^ *((w32*)T1[temp0[3][0]]) ^ *((w32*)T2[temp0[0][1]]) ^ *((w32*)T3[temp0[1][2]]) ^ *((w32*)T4[temp0[2][3]]); *((w32*)temp0[0]) = *((w32*)rk[r+1][0]) ^ *((w32*)T1[temp1[0][0]]) ^ *((w32*)T2[temp1[1][1]]) ^ *((w32*)T3[temp1[2][2]]) ^ *((w32*)T4[temp1[3][3]]); *((w32*)temp0[1]) = *((w32*)rk[r+1][1]) ^ *((w32*)T1[temp1[1][0]]) ^ *((w32*)T2[temp1[2][1]]) ^ *((w32*)T3[temp1[3][2]]) ^ *((w32*)T4[temp1[0][3]]); *((w32*)temp0[2]) = *((w32*)rk[r+1][2]) ^ *((w32*)T1[temp1[2][0]]) ^ *((w32*)T2[temp1[3][1]]) ^ *((w32*)T3[temp1[0][2]]) ^ *((w32*)T4[temp1[1][3]]); *((w32*)temp0[3]) = *((w32*)rk[r+1][3]) ^ *((w32*)T1[temp1[3][0]]) ^ *((w32*)T2[temp1[0][1]]) ^ *((w32*)T3[temp1[1][2]]) ^ *((w32*)T4[temp1[2][3]]); } //Code for Last round -1 //Code for Last round ```

Figure 2: Rearranged code with partially unrolled 'for' loop

Last Round of Rijndael

The last round of Rijndael encryption is different from its previous rounds. Figure 3 shows the code for the last round from Bosselaers's original implementation.

 ```b[0] = T1[temp1[0][0]][1]; b[1] = T1[temp1[1][1]][1]; b[2] = T1[temp1[2][2]][1]; b[3] = T1[temp1[3][3]][1]; . . . b[11] = T1[temp1[1][3]][1]; b[12] = T1[temp1[3][0]][1]; b[13] = T1[temp1[0][1]][1]; b[14] = T1[temp1[1][2]][1]; b[15] = T1[temp1[2][3]][1]; ```

Figure 3: Code for last round in 'C'

In the last round, we are always accessing column one of table T1. By making a separate table from the elements of column one, we avoid the extra operations to go to column one each time; thereby eliminating these operations.

The optimizations described thus far can result in approximately a 15% improvement over the original code.

Assembly Analysis of C

Figure 4 shows a section of the assembly code generated by the Intel C++ Compiler V5.0 from the original C implementation. Notice that lines 3, 7, 13, 16, all have the same problem. Load instructions from data addresses are not four-byte aligned, which is the primary source of performance degradation in this code. For the Pentium® 4 processor, it is always better to read four-byte aligned data portions that completely contain the data and then shift/mask the data as necessary, since the penalty for not doing this is much higher than the cost of the shifts.

 ``` movzx eax, BYTE PTR [esp+9] xor edx, DWORD PTR _T2[0+eax*4] movzx eax, BYTE PTR [esp+14] xor edx, DWORD PTR _T3[0+eax*4] movzx eax, BYTE PTR [esp+3] xor edx, DWORD PTR _T4[0+eax*4] mov DWORD PTR [ebx+4], edx movzx eax, BYTE PTR [esp+8] mov edx, DWORD PTR _T1[0+eax*4] movzx eax, BYTE PTR [esp+13] xor edx, DWORD PTR _T2[0+eax*4] movzx eax, BYTE PTR [esp+2] xor edx, DWORD PTR _T3[0+eax*4] movzx eax, BYTE PTR [esp+7] xor edx, DWORD PTR _T4[0+eax*4] ```

Figure 4: Assembly generated by the Intel C++ Compiler

To solve the problem of reading from odd addresses, we will change the state table of bytes to an array of four 32-bit double words. We will also make the same change to array a, array b and array rk. The Intel C++ Compiler with (-O2) option insures that there are no unaligned data accesses and generates simple instructions that will execute best on the Pentium 4 processor. On the Pentium 4 processor, simple arithmetic and logic instructions make use of the Rapid Execution Engine that operates at twice the core frequency. Figure 5 shows how to calculate the first four bytes of the encrypted text with modified state table.

 ```Temp[0] = rkey[0] ^ aes_table[0][ (unsigned char) stateTable[0]] ^ aes_table[1][ (unsigned char)(stateTable[1] >> 8)] ^ aes_table[2][ (unsigned char)(stateTable[2] >> 16)] ^ aes_table[3][ (unsigned char)(stateTable[3] >> 24)]; ```

Figure 5: Temp[0] contains encrypted text

Potential Drawbacks

Pitfalls of Programming Rijndael in Assembly Language

When performance is the primary concern, programming in assembly languag e may seem to be the obvious choice, since coding in assembly provides control over the processor. However, it is very easy to use the wrong mix of instructions and the performance is often worse than code generated by a good optimizing compiler. Here are three approaches to computing the first four bytes of the first round of encryption (Lines 12-15 in the original code). Each approach shows how handwritten assembly can unintentionally result in slower performance.

Approach 1: Use XMM Registers to Save State Table

It is possible to keep the 16 bytes of the state table in a 128-bit XMM register. In fact, all the 'C' code shown below in Figure 6 can be done by only two MOV instructions and a PXOR instruction.

 ```*((w32*)temp[0]) = *((w32*)a) ^ *((w32*)rk[0][0]); *((w32*)temp[1]) = *((w32*)(a+4)) ^ *((w32*)rk[0][1]); *((w32*)temp[2]) = *((w32*)(a+8)) ^ *((w32*)rk[0][2]); *((w32*)temp[3]) = *((w32*)(a+12)) ^ *((w32*)rk[0][3]); ```

Figure 6: Calculation of state table

However, to extract the index, we will need to copy a byte from an XMM register to one of the general-purpose registers. There is no instruction available for that purpose. PEXTR.W can extract word (16-bit value) into a general-purpose register, but then it will require extra shift and mask operations. The overall cost of PEXTR.W and shift-mask leads to a non-optimal solution.

Approach 2: Store State Table in Memory

Another way to extract bytes to get the first four bytes of the first round of encrypted text is shown in Figure 7. In this example, the state table is kept in memory instead of registers and 'temp' is the pointer to the state table. Register ECX holds the pointer to the array of constant tables. Therefore, the address of the second table will be ECX+1024 and third table will be ECX+2048. MMX Register MM0 contains the final result.

 ```mov al, temp movd mm0, [ecx + eax * 4] mov bl, [temp + 5] pxor mm0, mmword ptr [ ecx + ebx * 4 + 1024] mov al, [temp + 10] pxor mm0, mmword ptr [ ecx + eax * 4 + 2048] mov bl, [temp + 15] pxor mm0, mmword ptr [ ecx + ebx * 4 + 3072] ```

Figure 7: Calculation of 4 bytes of encrypted text

This looks like a very simple way of doing table lookups and XORs. However, there are some performance issues with this implementation. One is the partial register stall, which one can avoid by using a MOVZX. Another is loads from odd addresses for e.g., (temp + 5) and (temp + 15). As discussed earlier, it is much faster to do a 32-bit load from a four-byte aligned address and then shift/mask the data as necessary, than to load from an odd address.

Approach 3: Resizing Constant Tables

Figure 8 displays a different approach that avoids unaligned accesses and partial register stalls. However, this will require changing our array of constant table of bytes to 16-bit words with high-order byte as zero.

 ```pextrw edi, mm1, 1 pxor mm5, QWORD PTR [ebp+eax*8] pextrw eax, mm0, 2 pxor mm5, QWORD PTR [ebp+edx*8] pextrw edx, mm3, 3 pxor mm7, QWORD PTR [ebp+ecx*8] pextrw ecx, mm1, 0 Pxor mm7, QWORD PTR [ebp+edi*8] ```

Figure 8: Implementation using MMX instructions for Intel® Pentium® III processor

This optimized code provided a 40% performance improvement over the original C code when measured on the Pentium III processor. However, the assembly code did not perform as well on the Pentium 4 processor because its PEXTR.W instruction consumes more clock cycles than on the Pentium III processor.

Conclusion

In this paper, the author has shown how to realize a 125% performance improvement over the original Rijndael 'C' encryption code shown in Figure 1 by:

• Taking advantage of compiler optimizations
• Restructuring the code to hide dependencies
• Creating a separate table for last round, and most importantly

The Intel C++ Compiler can be purchased through Intel® Software Development Products

Yamir Yunus is a Senior Software Engineer in Architecture Performance Engineering at Intel Corporation. He joined Intel in 1999 after four years at Honeywell, Inc. At Intel, he has worked closely with RSA, Inc. on performance optimization of cryptographic algorithms

His work includes optimization of the RSA public key algorithm, SHA-1, RC4 and Rijndael algorithm on the Intel® Itanium® processor. He is the recipient of an Intel Achievement Award for the optimization of security algorithms on Intel processors. He earned his BSE in Computer Systems Engineering and MCS in Computer Science from Arizona State University in Tempe.

Categories:
Tags:
For more complete information about compiler optimizations, see our Optimization Notice.

Yamir , impressive coding and optimization. Enjoy reading it.
I&#39;m using a MPC8248 that runs AES256 and I would like it to talk to an Intel based PC.
Between the MPC8248 devices the AES256 works great and I have no issue. However I have dificulies in trying to write the AES on the Intel PC that will communicate with the MPC8248 device.
Do you have any idea?

Thank you very much for that code.

I hope you can help me with my problem to optimize the SHA-256 for intel processors as they are very week compared to the even weeker via C7.

I am using the bitcoin original client along with an example CPU miner to perform the SHA-265 i will send the code if you have time .. thank you very much in advance

contact walidzohair@gmail.com as I might forget to check here again and again