Intel® Carry-Less Multiplication Instruction and its Usage for Computing the GCM Mode - Rev 2.02

Introduction

Intel® PCLMULQDQ instruction is a new instruction available beginning with the all new 2010 Intel® Core™ processor family based on the 32nm Intel® microarchitecture codename Westmere. PCLMULQDQ instruction performs carry-less multiplication of two 64-bit operands.

This paper provides information on the instruction, and its usage for computing the Galois Hash. It also provides code examples for the usage of PCLMULQDQ, together with the new AES instructions (introduced together with PCLMULQDQ) for efficient implementation of AES in Galois Counter Mode (AES-GCM).

This version of the paper also provides high performance code examples for AES-GCM, and discloses, for the first time, their measured performance numbers.

[Revisions history: Rev. 1.0 in 4/2008; Rev. 2.0 in 5/2009; Rev. 2.01 in 9/2012; Rev. 2.02 in 4/2014]

Download Article

For more complete information about compiler optimizations, see our Optimization Notice.
AttachmentSize
PDF icon clmul-wp-rev-2.02-2014-04-20.pdf970.71 KB

6 comments

Top
brainhub's picture

Hello Intel.

I am disappointed with the performance numbers I am getting for the PCLMUL instruction.

I compared the performance of 3 independent PCLMUL instructions with 10 dependent AESENC instructions and the 10 AESENC instructions are 1.42 times faster than the 3 PCLMULs.

These 10 AESENC is exactly the AES128 encryption (of a 16 byte block), while at least 3 PCLMUL instructions are needed to perform GF 128x128-->256 polynomial field multiplication (using Karatsuba algorithm) before modulo reduction. The actual GF 128 operation will be slower due to the missing reduction step in my timing.

One can observe from the above that if we consider two instructions that operate on 16 bytes: one AESENC and one PCLMUL, the PCLMUL is over 10 times slower. This turned out to be a big surprise for me, as I was expecting them to be roughly equivalent.

As a result of this poor performance of PCLMUL instruction its usefulness is limited. It seems that the GF128 multiplication operation cannot be faster than the AESNI encryption on Intel processors, something I was counting on.

Is this correct? Are there plans to bring the performance of the PCLMUL in line with AESENC?

Thank you.

Code with pclmul instruction (added two xors for simplicity, but they are immaterial here; there will be more to accomplish full GF128 operation anyway):
0x0000000000428da0 <+0>: movdqa %xmm0,%xmm3
0x0000000000428da4 <+4>: movdqa %xmm0,%xmm2
0x0000000000428da8 <+8>: pclmullqhqdq %xmm1,%xmm0
0x0000000000428dae <+14>: pclmullqlqdq %xmm1,%xmm3
0x0000000000428db4 <+20>: pclmulhqlqdq %xmm1,%xmm2
0x0000000000428dbb <+27>: pxor %xmm3,%xmm2
0x0000000000428dbf <+31>: pxor %xmm2,%xmm0

AESNI encryption:
0x0000000000428eb0 <+0>: pxor %xmm1,%xmm0
0x0000000000428eb4 <+4>: aesenc %xmm1,%xmm0
0x0000000000428eb9 <+9>: aesenc %xmm1,%xmm0
0x0000000000428ebe <+14>: aesenc %xmm1,%xmm0
0x0000000000428ec3 <+19>: aesenc %xmm1,%xmm0
0x0000000000428ec8 <+24>: aesenc %xmm1,%xmm0
0x0000000000428ecd <+29>: aesenc %xmm1,%xmm0
0x0000000000428ed2 <+34>: aesenc %xmm1,%xmm0
0x0000000000428ed7 <+39>: aesenc %xmm1,%xmm0
0x0000000000428edc <+44>: aesenc %xmm1,%xmm0
0x0000000000428ee1 <+49>: aesenclast %xmm1,%xmm0

PCLMUL code achieves 1600 MB/sec.
AESNI code achieves 2857.14 MB/sec
Both are run on the same 16 byte block, processing 8Gb of data for at least 4 seconds. All other factors are identical between the two tests. Please don't be sidetracked by the usefull of these mesurements in absolute terms. Their purpose is to primarily expose the relative performance difference.

This is on Linux64 bit, Here is CPU information:
model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
stepping : 7
microcode : 0x25
cpu MHz : 1600.000
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 3
cpu cores : 4
apicid : 7
initial apicid : 7
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips : 6784.20
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual

anonymous's picture

Hello Intel.

I am disappointed with the performance numbers I am getting for the PCLMUL instruction.

I compared the performance of 3 independent PCLMUL instructions with 10 dependent AESENC instructions and the 10 AESENC instructions are 1.42 times faster than the 3 PCLMULs.

These 10 AESENC is exactly the AES128 encryption (of a 16 byte block), while at least 3 PCLMUL instructions are needed to perform GF 128x128-->256 polynomial field multiplication (using Karatsuba algorithm) before modulo reduction. The actual GF 128 operation will be slower due to the missing reduction step in my timing.

One can observe from the above that if we consider two instructions that operate on 16 bytes: one AESENC and one PCLMUL, the PCLMUL is over 10 times slower. This turned out to be a big surprise for me, as I was expecting them to be roughly equivalent.

As a result of this poor performance of PCLMUL instruction its usefulness is limited. It seems that the GF128 multiplication operation cannot be faster than the AESNI encryption on Intel processors, something I was counting on.

Is this correct? Are there plans to bring the performance of the PCLMUL in line with AESENC?

Thank you.

Code with pclmul instruction (added two xors for simplicity, but they are immaterial here; there will be more to accomplish full GF128 operation anyway):
0x0000000000428da0 <+0>: movdqa %xmm0,%xmm3
0x0000000000428da4 <+4>: movdqa %xmm0,%xmm2
0x0000000000428da8 <+8>: pclmullqhqdq %xmm1,%xmm0
0x0000000000428dae <+14>: pclmullqlqdq %xmm1,%xmm3
0x0000000000428db4 <+20>: pclmulhqlqdq %xmm1,%xmm2
0x0000000000428dbb <+27>: pxor %xmm3,%xmm2
0x0000000000428dbf <+31>: pxor %xmm2,%xmm0

AESNI encryption:
0x0000000000428eb0 <+0>: pxor %xmm1,%xmm0
0x0000000000428eb4 <+4>: aesenc %xmm1,%xmm0
0x0000000000428eb9 <+9>: aesenc %xmm1,%xmm0
0x0000000000428ebe <+14>: aesenc %xmm1,%xmm0
0x0000000000428ec3 <+19>: aesenc %xmm1,%xmm0
0x0000000000428ec8 <+24>: aesenc %xmm1,%xmm0
0x0000000000428ecd <+29>: aesenc %xmm1,%xmm0
0x0000000000428ed2 <+34>: aesenc %xmm1,%xmm0
0x0000000000428ed7 <+39>: aesenc %xmm1,%xmm0
0x0000000000428edc <+44>: aesenc %xmm1,%xmm0
0x0000000000428ee1 <+49>: aesenclast %xmm1,%xmm0

PCLMUL code achieves 1600 MB/sec.
AESNI code achieves 2857.14 MB/sec
Both are run on the same 16 byte block, processing 8Gb of data for at least 4 seconds. All other factors are identical between the two tests. Please don't be sidetracked by the usefull of these mesurements in absolute terms. Their purpose is to primarily expose the relative performance difference.

This is on Linux64 bit, Here is CPU information:
model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
stepping : 7
microcode : 0x25
cpu MHz : 1600.000
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 3
cpu cores : 4
apicid : 7
initial apicid : 7
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips : 6784.20
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual

montyw47's picture

Too bad it isn't available across the full 2nd gen. processor line only the most expensive CPUs!

anonymous's picture

This addition has been waited for very long time, not only for cryptography but also for many other applications using GF computations, like convolutional codes, turbo codes, CRC and others. Several years ago we suggested addition of this instruction to ARM ISA, but ARM Ltd. did not find it worth the effort. Then we made a custom processor containing this instruction and could reach significant improvements in many applications in which it can be used (most importantly, computing CRC codes). Another application for this instruction is fast-forwarding shift registers (like Gold code generators).

anonymous's picture

Could I get the whole AVX-instruction set?

Cryptographer's picture

Do you have a subroutine emulating _mm_clmulepi64_si128() to develop and test gfmul/ghash on an older processor?

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.