| Last Modified On : | August 19, 2009 4:49 PM PDT |
Rate |
|
This white paper provides details on Intel’s carry-less multiplication instruction, PCLMULQDQ, that computes the carry-less multiplication of two 64-bit operands. The paper explains algorithms for using this instruction for computing the Galois Hash, which is the underlying computation of the Galois Counter Mode (GCM).
An important usage model is AES-GCM, where in this case the AES encryption/decryption part could be efficiently implemented by the AES instructions that are also being introduced into the ISA.
Usage Models of Carry-less Multiplication
Carry-less multiplication is the mathematical operation of computing the (carry-less) product of two operands without the generation or propagation of carry values. It is relatively a time consuming operation when implemented with the current ISA of the IA architecture. For example, software implementation of carry-less multiplication that uses the best-known method (found at the OpenSSL source code distribution, www.openssl.org) computes a 64 by 64 bit carry-less product in about 100 cycles.
Carry-less multiplication is an essential processing component of several cryptographic systems and standards. Hence, accelerating carry-less multiplication can significantly contribute to achieving high speed secure computing and communication. Carry-less multiplication is especially important for implementing the Galois Counter Mode (GCM), which is a recently defined mode of operation of block ciphers [4, 6, 7, 8, 14, 15]. The GCM mode was endorsed by the US government in April 2006 and is used together with AES, which is part of the NSA Suite B. It is also defined in IEEE 802.1ae standard, where it is recommended for forwarding rates higher than 10 Gbps. Other usage models of GCM include IPsec (IPsec RFC 4106), the storage standard P1619 and security protocols over fiber channel (ISO-T11 standard).
GCM requires the carry-less multiplication of 128-bit operands, producing a 256-bit product. This is the first step of computing a ‘Galois hash’, which is part of the GCM mode. The PCLMUQDQ instruction computes the 128-bit product of two 64-bit operands. It can be used by software as a building block for generating the 256-bit result required for GCM.
The other step in GCM is reduction modulo of a pentanomial x128 + x7 + x2 + x + 1. In this document, we describe a new efficient algorithm for performing this reduction in the Intel SSE domain (using PSRLD, PSLLD PSHUFD instructions). The combination of the PCLMUQDQ instruction, together with this algorithm speeds up the GCM mode.
Carry-less multiplication is also in the core computation in Elliptic Curve Cryptography (ECC) over binary fields [2] and in Cyclic Redundancy Check (CRC) computations. The carry-less multiplication instruction PCLMUQDQ can speedup the computation of CRC with polynomials other than the iSCSI polynomial, for which there is already a dedicated instruction in the ISA (namely, CRC32 that is part of the Intel SSE4 set).
Carry-Less Multiplication - Definition
Carry-less multiplication is the operation of mult iplying two operands without generating and propagating carries. It is formally defined as follows. Let the two operands A, B, be defined by the following n-bit array notation
|
|
(1) |
and let the carry-less multiplication result be the following 2n bit array:
|
|
(2) |
The bits of the output C are defined as the following logic functions of the bits of the inputs A and B as follows:
|
|
(3) |
for 0 = i = n-1, and
|
|
(4) |
for n = i = 2n-1 (note that c2n-1=0).
One can see that the logic functions of Equations (3) and (4) are somehow analogous to integer multiplication in the following sense. In integer multiplication, the first operand is shifted as many times as the positions of bits equal to “1” in the second operand. The integer multiplication is obtained by adding the shifted versions of the first operand with each other. The same procedure is followed in carry-less multiplication, but the ``additions’’ do not generate or propagate carry, and are equivalent to the exclusive OR (XOR) logical operation.
Hereafter, carry-less multiplication is denoted by the symbol “·”.
Carry-less Multiplication and Galois Field Multiplication
Typically, carry-less multiplication is used as the first step of multiplications in finite fields (aka Galois Fields) of characteristic 2 [10, 11, 12, 13].
A Galois Field is a finite set of elements where the operations addition ‘+’ and multiplication ‘·’ are defined. The set is closed under these operations where they satisfy the following properties: associativity, commutativity, existence of a neutral element (for addition it is called “0” and for multiplication it is called “1”), existence of additive inverse, existence of multiplicative inverse for each element except for zero, and a distributive law for multiplication over addition (i.e., a·(b+c) = a·b+a·c).
The number of elements of a finite field must be a power of some prime p, where in such case the field is denoted by GF(pk). A binary field (field with characteristic 2) is the case where p=2, and is denoted by GF (2k).
In general, the elements in GF(pk) can be viewed as polynomials of degree k where the operations are defined using some irreducible polynomial of degree k: addition of two elements is defined as polynomial addition (adding the corresponding coefficients modulo p). Multiplication is defined by polynomial multiplication, which is subsequently reduced modulo the irreducible polynomial that defines the finite field.
Typically, there are multiple different irreducible polynomials of degree k. Using them to define the finite field results in isomorphic representations of the field.
For binary fields GF (2n), one can view the elements as n-bit strings, where each bit represents the corresponding coefficient of the polynomial. In such cases, addition is equivalent to the bitwise XOR of the two strings. Multiplication consists of two steps. The first step is carry-less multiplication of the two operands. The second step is the reduction of this carry-less product modulo the polynomial that defines that field.
For example, consider the field GF(24) (i.e., n=4) defined by the reduction polynomial x4+x+1. Let A = [1110] and B = [1011] be two elements in that field. Then, their product in this field (i.e., their Galois Field multiplication) is [1000]. This result is obtained as follows: computing the carry-less multiplication C = A · B = [01100010], followed by reducing C modulo x4+x+1. The reduction result is obtained by finding that [01100010] = [0110] · [10011] XOR [1000].
PCLMULQDQ instruction performs carry-less multiplication of two 64-bit quadwords which are selected from the first and the second operands according to the immediate byte value.
|
Instruction format: PCLMULQDQ xmm1, xmm2/m128, imm8 |
|
Description: Carry-less multiplication of one quadword (8-byte) of xmm1 by one quadword (8-byte) of xmm2/m128, returning double quadwords (16 bytes). The immediate byte is used for determining which quadwords of xmm1 and xmm2/m128 should be used. Opcode: 66 0f 3a 44 The presence of PCLMULQDQ is indicated by the CPUID leaf 1 ECX[1]. Operating systems that support the handling of Intel SSE state will also support applications that use AES extensions and the PCLMULQDQ instruction. This is the same requirement for Intel SSE2, Intel SSE3, Intel SSSE3, and Intel SSE4. |
The immediate byte values are used as follows.
|
imm[7:0] |
Operation |
|
0x00 |
xmm2/m128[63:0] • xmm1[63:0] |
|
0x01 |
xmm2/m128[63:0] • xmm1[127:64] |
|
0x10 |
xmm2/m128[127:64] • xmm1[63:0] |
|
0x11 |
xmm2/m128[127:64] • xmm1[127:64] |
NOTES:
The pseudo code for defining the operation is as follow.
IF imm8[0] == 0 THEN
Temp1 = xmm1[63:0]
ELSE
Temp1 = xmm1[127:64]
ENDIF
IF imm8[1] == 0 THEN
Temp2 = xmm2/m128[63:0]
ELSE
Temp2 = xmm2/m128[127:64]
ENDIF
FOR i = 0 TO 63
TempB [i] := (Temp1[0] AND Temp2[i]);
FOR j = 1 TO i, 1
TempB [i] := TempB [i] XOR (Temp1[j] AND Temp2[i -j])
NEXT j
Dest[i] := TempB[i];
NEXT i
FOR i = 64 TO 126, 1
TempB [i] := (Temp1[ i-63] AND Temp2[63]);
FOR j = i-62 TO 63, 1
TempB [i] := TempB [i] XOR (Temp1[j] AND Temp2[i -j])
NEXT j
Dest[i] := TempB [i];
NEXT i
Dest[127] := 0;
Identifying PCLMULQDQ Support by the Processor
Before an application attempts to use the PCLMULQDQ instruction it should check for its availability, which is indicated if CPUID.01H:ECX.PCLMULQDQ[bit 1] = 1.
Operating systems that support handling Intel® SSE state support also applications that use PCLMULQDQ. This is the same requirement for Intel SSE2, Intel SSE3, Intel SSSE3, and Intel SSE4.
This section described the Galois Counter Mode and its current table lookup based software implementation (see for example [5] and www.openssl.org).
The Definition of GCM
The Galois Counter Mode is illustrated in Figure 1. This mode produces a message digest, called “Galois Hash’”, from the encrypted data. This Galois Hash is used for high performance message authentication. In each step of the mode, the previous Galois Hash value is XOR-ed with the current ciphertext block. The result is then multiplied in GF(2128) with a hash key value. GCM uses GF(2128) defined by the irreducible polynomial g = g(x) = x128 + x7 + x2 + x + 1.
Figure 1. The Galois Counter Mode
The multiplication in GF(2128) involves carry-less multiplication of the two 128-bit operands, to generate a 256-bit result, followed by and reduction modulo the irreducible polynomial g.
Current Software Implementation of GCM
Current software implementations of the GCM mode use a table lookup based algorithm [5], shown in Figure 2 (for AES-GCM). This algorithm consists of two phases:
Preprocessing phase: generation of 16 lookup tables. Each table has 256 128-bit entries where entry j of table Ti stores the value (j * hash key * 28i ) mod g for j =0, 1, …255, and i = 0, 1, …, 15
Run time phase: The algorithm takes the next ciphertext block and XOR-s it with the current value of the Galois Hash. The result (dynamic value) is multiplied with the Hash Key (fixed value) in GF(2128). The GF(2128) multiplication is carried out as follows: the value of the result is segmented into 16 8-bit slices. Subsequently, 16 table lookups are performed, using the slices, for indexing the tables. The results from the table lookups are XOR-ed with each other.
This algorithm performs operations on a per-byte basis: each 128-bit block involve 16 table lookups and 16 128-bit XOR operations.
This algorithm is not very efficient in software due to the cost of table lookups. It also suffers from potential side channel leakage based on memory access patterns (the accessed cache lines for the table lookup are data dependent).
Figure 2. Table Lookup-Based Implementation of AES-GCM
// Code snippet illustrating AES-GCM (AES-128) |
This section describes several new methods for computing the Galois Counter Mode. This method is an extension of the algorithm described in [6, 7].
The method described here uses the 64-bit PCLMULQDQ instruction in place of the table lookup method, for computing 128-bit-by-128-bit carry-less products, and an efficient novel method for reducing the result modulo the irreducible polynomial g of the finite field GF(2128). The proposed algorithms are carried out in two steps: carry-less multiplication and reduction modulo g = x128 + x7 + x2 + x + 1.
Generating Carry-less Multiplication of 128-bit Operands Using PCLMULQDQ
Denote the input operands by [A1:A0] and [B1:B0], where A0, A1, B0 and B1 are 64 bit long each. The proposed algorithm is the following:
The following algorithm can be viewed as “one iteration carry-less schoolbook’” multiplication.
Algorithm 1
Step 1: multiply carry-less the following operands: A0 with B0, A1 with B1, A0 with B1, and A1 with B0. Let the results of the above four multiplications be:
Step 2: construct the 256-bit output of the multiplication [A1:A0] • [B1:B0] as follows:
|
|
(5) |
An alternative technique trades-off one multiplication for additional XOR operations. It can be viewed as “one iteration carry-less Karatsuba” multiplication [7, 9].
Algorithm 2
Step 1: multiply carry-less the following operands: A1 with B1, A0 with B0, and A0 ? A1 with B0 ? B1. Let the results of the above three multiplications be: [C1:C0], [D1:D0] and [E1:E0], respectively.
Step 2: construct the 256-bit output of the multiplication [A1:A0] * [B1:B0] as follows:
|
|
(6) |
Efficient Reduction Algorithm
To reduce a 256-bit carry-less product modulo a polynomial g of degree 128, we first split it into two 128-bit halves. The least significant half is simply XOR-ed with the final remainder (since the degree of g is 128).
For the most significant part, we develop an algorithm that realizes division via two multiplications. This algorithm can be seen as an extension of the Barrett reduction algorithm [1] to modulo-2 arithmetic, or as an extension of the Feldmeier CRC generation algorithm [3] to dividends and divisors of arbitrary size.
Since we do not need to take into account the least significant half of the input (see above), we investigate the efficient generation of a remainder p(x) defined as follows:
|
|
(7) |
Where,
For the polynomials p(x), c(x), and g(x) we write:
|
|
(8) |
Hereafter, we use the notation Lu(v) to denote the coefficients of the u least significant terms of the polynomial v and Mu(v) to denote the coefficients of its u most significant terms. The polynomial p(x) can be expressed as:
|
|
(9) |
where q(x) is a polynomial of degree s-1 equal to the quotient from the division of c(x)?xt with g. The intuition behind equation (9) is that the t least significant terms of the dividend c(x)?xt equal zero. Further, the dividend c(x)?xt can be expressed as the sum of the polynomials g?q and p:
|
|
(10) |
Where operator ‘+’ means XOR (‘?’). From equation (10) one can expect that the t least significant terms of the polynomial g?q are equal to the terms of the polynomial p. Only if these terms are equal to each other, the result of the XOR operation g?q ? p is zero for its t least significant terms. Hence:
|
|
(11) |
Now we define
|
|
(12) |
The polynomial g* represents the t least significant terms of the polynomial g. Obviously,
|
|
(13) |
However, the t least significant terms of the polynomial q?gt?xt are zero. Therefore,
|
|
(14) |
From Equation (14) it follows that in order to compute the remainder p we need to know the value of the quotient q. The quotient can be calculated in a similar manner as in the Barrett reduction algorithm:
|
|
(15) |
Let
|
|
(16) |
where q+ is an s-degree polynomial equal to the quotient from the division of xt+s with g and p+ is the remainder from this division. The degree of the polynomial p+ is t-1. From equations (15) and (16) we get
|
|
(17) |
and
|
|
(18) |
One can see that the polynomials c?g?q+ and g?q?xs are of degree t+2?s-1 the polynomial c?p+ is of degree t+s-2, and the polynomial p?xs is of degree t+s-1. As a result the s most significant terms of the polynomials in the left and right hand side of equation (18) are not affected by the polynomials c?p+ and p?xs. Hence,
|
|
(19) |
Next, we observe that the s most significant terms of the polynomial c?g?q+ equal to the s most significant terms of the polynomial g?Ms(c?q+)?xs. The polynomial Ms(c?q+)?xs results from c?q+ by replacing the s least significant terms of this polynomial with zeros. The intuition behind this observation is the following: the s most significant terms of the polynomial c?g?q+ are calculated by adding the s most significant terms of the polynomial c?q+ with each other in as many offset positions as defined by the terms of the polynomial g. Thus, the s most significant terms of c?g?q+ do not depend on the s least significant terms of c?q+, and consequently,
|
|
(20) |
Equation (20) is satisfied for q given by
|
|
(21) |
Since there is a unique quotient q satisfying equation (10) one can show that there is a unique quotient q satisfying equation (20). As a result this quotient q must be equal to M s(c(x)?q+(x)).
It follows that the polynomial p is found by
|
|
(22) |
Equation (22) indicates the algorithm for computing the polynomial p.
Algorithm 3:
Preprocessing: For the given irreducible polynomial g the polynomials g* and q+ are computed first. The polynomial g* is of degree t-1 consisting of the t least significant terms of g, whereas the polynomial q+ is of degree s and is equal to the quotient of the division of xt+s with the polynomial g.
Calculation of the remainder polynomial:
Step 1: The input c is multiplied with q+. The result is a polynomial of degree 2s-1.
Step 2: The s most significant terms of the polynomial resulting from step 1 are multiplied with g*. The result is a polynomial of degree t+s-2.
Step 3: The algorithm returns the t least significant terms of the polynomial resulting from step 2. This is the desired remainder.
Application of the method for reduction modulo x128 + x7 + x2 + x + 1
One can see that the quotient from the division of x256 with g is g itself. The polynomial g = g(x) = x128 + x7 + x2 + x + 1 contains only 5 non-zero coefficients (therefore also called “pentanomial”). This polynomial can be represented as the bit sequence [1:<120 zeros>:10000111]. Multiplying this carry-less with a 128-bit value and keeping the 128 most significant bit can be obtained by: (i) Shifting the 64 most significant bits of the input by 63, 62 and 57-bit positions to the right. (ii) XOR-ing these shifted copies with the 64 least significant bits of the input. Next, we carry-less multiply this 128-bit result with g, and keep the 128 least significant bits. This can be done by: (i) shifting the 128-bit input by 1, 2 and 7 positions to the left. (ii) XOR-ing the results. Algorithm 4 provides a detailed description of the reduction algorithm.
Algorithm 4
Denote the input operand by [X3:X2:X1:X0] where X3, X2, X1 and X0 are 64 bit long each.
Step 1: shift X3 by 63, 62 and 57-bit positions to the rig ht. Compute the following numbers:
|
|
(23) |
Step 2: We XOR A, B, and C with X2. We compute a number D as follows:
|
|
(24) |
Step 3: shift [X3:D] by 1, 2 and 7 bit positions to the left. Compute the following numbers:
|
|
(25) |
Step 4: XOR [E1:E0], [F1:F0], and [G1:G0] with each other and [X3:D]. Compute a number [H1:H0] as follows:
|
|
(26) |
Return [X1?H1: X0?H0].
Bit Reflection Peculiarity of GCM
Special peculiarity should be taken into account when implementing the GCM mode, because the standard specifies that the bits inside their 128-bit double-quad-words are reflected. That is, the bit corresponding to the least significant coefficient of the polynomial representation of the entities which are multiplied is bit number 127 rather than bit number 0. This also implies that the order of bits in the reduction polynomial is [11100001:<120 zeros>:1] as opposed to [1:<120 zeros>:10000111]. Note that this property is not merely the difference between Little Endian and Big Endian notations.
To handle this peculiarity, we point out the following fundamental property of carry-less multiplication, namely
|
|
(27) |
Therefore, the PCLMULQDQ instruction can be used for performing multiplication in the finite field GF(2128) seamlessly, regardless on the representation of the input and the output operands.
Algorithm 5 outlines the required modification of the reduction algorithm that accommodates bit reflection of the inputs and the o utputs in the GCM mode.
Algorithm 5
Denote the input operand by [X3:X2:X1:X0] where X3, X2, X1 and X0 are 64 bit long each.
Step 1: compute
|
|
(28) |
Step 2: shift X0 by 63, 62 and 57 bit positions to the left. We compute the following numbers:
|
|
(29) |
Step 2: XOR A, B, and C with X1. Compute a number D as follows:
|
|
(30) |
Step 3: shift [D:X0] by 1, 2 and 7 bit positions to the right. Compute the following numbers:
|
|
(31) |
Step 4: XOR [E1:E0], [F1:F0], and [G1:G0] with each other and [D:X0]. Compute a number [H1:H0] as follows:
|
|
(32) |
Return [X3?H1: X2?H0].
Implementation Using Linear Folding
Linear folding is the mathematical operation of replacing a number of most significant bits of a quantity with the product of these bits times a constant during a reduction. Folding helps with speeding up reduction because it decreases the length of the quantity which is being reduced at the expense of the number of multiplications needed. In what follows we assume that all operations that take place are carry-less, i.e., in GF(2) arithmetic.
Suppose that the quantity to be reduced can be expressed as the sum of two polynomials:
|
|
(33) |
Where,
c(x) is a polynomial of degree s-1 with coefficients in GF(2), representing the most significant bits of the quantity to be reduced
t-1 is the length of the degree of d(x)
g(x) is the irreducible polynomial defining the field (for GCM, g = g(x) = x128 + x7 + x2 + x + 1).
For the polynomial p(x) we write:
|
|
(34) |
The quantity xt mod g(x) depends only on the reduction polynomial. Hence it can be treated as a constant. Equation (34) indicates a method for performing reduction which is called linear folding and works as follows:
Step 1: The polynomial c(x) is multiplied carry-less with the constant xt mod g(x).
Step 2: The result of the multiplication is XOR-ed with d(x).
Step 3: The reduction proceeds using any known technique.
The remainder from the division of xt with g(x) = x128 + x7 + x2 + x + 1 is the bit sequence <10000111: t-128 zeros>. Since t+s = 255 one can see that the length of the carry-less multiplication of c(x) with xt mod g(x) is 134 and is independent of the choice of t and s.
In GCM all operations are bit reflected, so folding needs to be implemented in a bit reflected manner too. The designer of an algorithm based on linear folding has several degrees of freedom that depend on the choice of t and s. If the length of the folding quantity c(x) spans a single 64-bit word then the cost of multiplication with xt mod g(x) can be potentially small equal to 1 or 2 64-bit carry-less multiplication operations. On the other hand the cost of further reducing the given polynomial after folding may be higher.
Another issue related to the design of a folding algorithm has to do with the fact that the reflected version of xt mod g(x) may span one or multiple words. Theoretically one can multiply c(x) not with xt mod g(x) but with an appropriately shifted version of the bit sequence <11100001> so that the second operand of the multiplication spans exactly one 64 bit word. The result can then be corrected with further shift operations. There is one case for which the second operand of the multiplication spans one word and no further shifts are required: this is for t = 193. For this case, the subsequent reduction steps after folding requiring reducing a 193 bit quantity.
In what follows we describe four representative algorithms two for s = 120 and two for s = 64 bits. In each pair the second operand of the folding multiplication spans 1 or 2 words.
Algorithm 6: Folding length s = 120, two multiplications version
Denote the input operand by [X3:X2:X1:X0] where X3, X2, X1 and X0 are 64 bit long each.
Step 1: Compute [C1:C0] =[X1:X0] AND 0x00ffffffffffffffffffffffffffff
Step 2: Compute [H2:H1:H0] =[C1:C0] · 0xe100000000000000
Step 3: Shift [ H2:H1:H0] by one bit position to the left
Step 4: XOR [H2:H1:H0:0] with [X3:X2:X1:X0] and replace [X3:X2:X1:X0]
Step 5: Replace X1 with the result of logical AND between X1 and 0xff00000000000000
Step 6: Compute A = X1 >> 63, B = X1 << 1, C = (X1 & 0x7f00000000000000) >> 1, D = (X1 & 0x7f00000000000000) >> 6, E = X1 & 0x7f00000000000000.
Step 7: Shift [X3:X2:A] by one bit position to the left
Return [X3? B? C? D? E:X2].
An assembly implementation of this algorithm is shown in Figure 3.
Algorithm 7: Folding length s = 120, four multiplications version
Denote the input operand by [X3:X2:X1:X0] where X3, X2, X1 and X0 are 64 bit long each.
Step 1: Compute [C1:C0] =[X1:X0] AND 0x00ffffffffffffffffffffffffffff
Step 2: Compute [H2:H1:H0] =[C1:C0] · 0x1c200000000000000
Step 3: XOR [H2:H1:H0:0] with [X3:X2:X1:X0] and replace [X3:X2:X1:X0]
Step 4: Replace X1 with the result of logical AND between X1 and 0xff00000000000000
Step 5: Compute A = X1 >> 63, B = X1 << 1, C = (X1 & 0x7f00000000000000) >> 1, D = (X1 & 0x7f00000000000000) >> 6, E = X1 & 0x7f00000000000000.
Step 6: Shift [X3:X2:A] by one bit position to the left
Return [X3? B? C? D? E:X2].
Algorithm 8: Folding length s = 64, single multiplication version
Denote the input operand by [X3:X2:X1:X0] where X3, X2, X1 and X0 are 64 bit long each.
Step 1: Compute [H1:H0] =X0 · 0xe100000000000000
Step 2: Shift [H1:H0] by one bit position to the left
Step 3: XOR [0:H1:H0:0] with [X3:X2:X1:X0] and replace [X3:X2:X1:X0]
Step 4: Compute A = X1 >> 63, B = X1 << 1, C1 = (X1 & 0x7f00000000000000) >> 1, C0 = X1<<63, D1 = (X1 & 0x7f00000000000000) >> 6, D0 = X1<<58, E = X1 & 0x7f00000000000000,
Step 5: Shift [X3:X2:A] by one bit position to the left
Return [X3? B? C1 ? D1? E:X2? D0? C0].
Algorithm 9: Folding length s = 64, two multiplications version
Denote the input operand by [X3:X2:X1:X0] where X3, X2, X1 and X0 are 64 bit long each.
Step 1: Compute [H2:H1:H0] =X0 · 0x1c200000000000000
Step 2: XOR [H2:H1:H0:0] with [X3:X2:X1:X0] and replace [X3:X2:X1:X0]
Step 3: Compute A = X1 >> 63, B = X1 << 1, C1 = (X1 & 0x7f00000000000000) >> 1, C0 = X1<<63, D1 = (X1 & 0x7f00000000000000) >> 6, D0 = X1<<58, E = X1 & 0x7f00000000000000,
Step 4: Shift [X3:X2:A] by one bit position to the left
Return [X3? B? C1 ? D1? E:X2? D0? C0].
The following assembler code snippets illustrate the implementation of the Galois Field multiplication portion of GCM using the algorithms described in Efficient Algorithms for Computing GCM.
These examples use the Intel SSE version of PCLMULQDQ.
Figure 3. Code Sample for Performing GCM Using the Schoolbook Method
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; |
Figure 4. Code Sample for Performing GCM Using the Karatsuba Method
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; |
Figure 5. Code Sample for Performing GCM Using Linear Folding
|
We present Intel’s PCLMULQDQ, a new Single Instruction Multiple Data (SIMD) instruction that will be introduced in the next generation of Intel® processor, as of 2009. PCLMULQDQ performs carry-less multiplication of two 64-bit operands. This paper provides information on using this instruction for computing the Galois Counter Mode. Several algorithms are described that take into account the fact that the irreducible polynomial specified by the binary field of GCM is sparse. Carry-less multiplication is performed using the instruction PCLMULQDQ as well as the schoolbook and Karatsuba algorithms. Reduction is performed using a small number of shift and XOR operations as well as linear folding. The bit reflection peculiarity of GCM is taken into account.
This appendix provides a few test vectors. Little Endian notation is used unless specified otherwise.
PCLMULQDQ Test Vectors
All 4 test vectors have the same xmm1 and xmm2 input, and vary only in the value of the immediate byte. This way, the result of all of the four combinations, High/Low from xmm1 and xmm2, are generates. Only bits [4] and [0] of the immediate byte are used for selecting the multiplicands.
Input to the PCLMULQDQ instruction
Step 1: xmm1:= 7b5b54657374566563746f725d53475d
Step 2: xmm2:= 48692853686179295b477565726f6e5d
The corresponding 64-bit Low/High halves are
Step 1: xmm1High:= 7b5b546573745665
Step 2: xmm1Low:= 63746f725d53475d
Step 3: xmm2High:= 4869285368617929
Step 4: xmm2Low:= 5b477565726f6e5d
Test Vector 1:
Step 1: immediate = 0x00; carry-less multiply xmm2Low by xmm1Low
Step 2: PCLMULQDQ result:= 1d4d84c85c3440c0929633d5d36f0451
Test Vector 2:
Step 1: immediate = 0x10; carry-less multiply xmm2High by xmm1Low
Step 2: PCLMULQDQ result:= 1bd17c8d556ab5a17fa540ac2a281315
Test Vector 3:
Step 1: immediate = 0x01; carry-less multiply xmm2Low by xmm1High
Step 2: PCLMULQDQ result:= 1a2bf6db3a30862fbabf262df4b7d5c9
Test Vector 4:
Step 1: immediate = 0x11; carry-less multiply xmm2High by xmm1High
Step 2: PCLMULQDQ result:= 1d1e1f2c592e7c45d66ee03e410fd4ed
Multiplication in GF(2128) Defined by the Polynomial x128 + x7 + x2 + x + 1
Input to the PCLMULQDQ instruction
Step 1: xmm1 := 7b5b54657374566563746f725d53475d
Step 2: xmm2 := 48692853686179295b477565726f6e5d
Step 3: Product := 40229a09a5ed12e7e4e10da323506d2
Step 4: (Product in GF(2128) defined by the reduction polynomial g = g(x) = x128 + x7 + x2 + x + 1)
GCM Test Vector
The following test vector takes into account the special peculiarity of GCM which specifies that the bits of the input operands (data and hash keys) and the output operand should be reflected
Step 1: Data: 0x952b2a56a5604ac0b32b6656a05b40b6
Step 2: Hash Key: 0xdfa6bf4ded81db03ffcaff95f830f061
Step 3: Multiplication Result: 0xda53eb0ad2c55bb64fc4802cc3feda60
Shay Gueron is a Security Architect in the Mobility Group at Intel Corporation, working at the Israel Design Center. His interests include applied security, cryptography, and algorithms. He holds a Ph.D. in applied mathematics from Technion—Israel Institute of Technology, and is also an Associate Professor at the Department of Mathematics, Faculty of Science and Science Education, at the University of Haifa, Israel.
Michael E. Kounavis is a Senior Research Scientist at Intel Corporation. He joined Intel in 2003. Since then he has worked on developing novel algorithms and solutions for line rate packet processing, security and data integrity. His current research focuses on accelerating cryptographic algorithms and protocols. Prior to joining Intel, he worked on programming network architectures at Columbia University. He has published over 30 papers in leading technical journals and conferences and he is a regular program committee member of the IEEE ISCC, SPECTS and IWDDS conferences.
| November 1, 2009 3:34 PM PST
heinz
|
Thanks to publish this paper, really great stuff, I like it, especially the part of linear folding. I linked it to our Lunatics developer forum http://lunatics.kwsn.net/5-windows/ffts-and-other-important-.....l#msg17325 in the closed area as « Reply #9 on: 03 May 2009, 04:26:54 pm » I voted: ***** |

English | 中文 | Русский | Français
Shay Gueron (Intel)
| ||
| Michael Kounavis (Intel) |
John