Q&A: Speeding up a CRC32 Algorithm

Q&A: Speeding up a CRC32 Algorithm

Here is a question received by Intel Software Network Support, along with the response provided by our Application Engineers:

Q. I'm trying to build in some assembler code into a C++ code to speed up an CRC32 algorithm. I currently have the lookup table generated and Im trying to optimize the following function:

DWORD crc32(LPCSTR *s, unsigned int len, DWORD dwLastCRC)
DWORD dwCRC32 = dwLastCRC;
unsigned int i = 0;

for (i = 0; i < len; i ++)
dwCRC32 = crc32_tab[ (dwCRC32 ^ s[i]) & 0xff ] ^ (dwCRC32 >>

return -( 0xFFFFFFFFL ^ ( dwCRC32 - 1 ) ) ;

crc32_tab is just a lookup table defined as:
static DWORD crc32_tab[] = { ... };

the data type LPCSTR is an array of 256 DWORD values. Like the following:

static DWORD crc32_tab[] = {
0x000000000h, 0x077073096h, 0x0EE0E612Ch, 0x0990951BAh
0x0076DC419h, 0x0706AF48Fh, 0x0E963A535h, 0x09E6495A3h
0x00EDB8832h, 0x079DCB8A4h, 0x0E0D5E91Eh, 0x097D2D988h
0x009B64C2Bh, 0x07EB17CBDh, 0x0E7B82D07h, 0x090BF1D91h

LPCSTR = const char*
len is variable and means the length of the buffers in bytes.So, the function is:

DWORD crc32(const char *s, unsigned int len, DWORD dwLastCRC)

If it helps, this function calculates the CRC32 value of the string s of length len.

Could you help me out on an assembler version of the code thats in the for...loop?

A. We would recommend optimizing a little more than just the statement inside the for loop. This will allow you to set things up optimally for the loop. Without a way to test this,we can't guarantee 100% accuracy, but something like this should work:

font color="#669900" size="2">pushebx
moveax, DWORD PTR dwLastCRC// load initial parameter (dwLastCRC)
movedx, DWORD PTR s// load initial parameter pointer
xorecx, ecx// clear loop counter

// for (i = 0; i < len; i ++)
addecx, 1// increment loop counter
cmpecx, DWORD PTR len// check for ending condition
jgSHORT $L279// jump out if done
// dwCRC32 = crc32_tab[ (dwCRC32 ^ s[i]) & 0xff ] ^ (dwCRC32 >> 8)
movebx, DWORD PTR [edx + ecx*4]// load parameter value (s[i])
xorebx, eax// perform exclusive OR (dwCRC32 ^ s[i])
andebx, 0xff// clear upper 24 bits
shreax, 8// shift dwCRC32 value (divide by 256)
xoreax, DWORD PTR [crc32_tab + ebx*4]// exclusive OR with look up table
jmpSHORT $L278

This removes all loading and storing of intermediate values. By holding the intermediate values in the registers, you should be able to save quite a few cycles in the calculation. Good luck!

Lexi S.

IntelSoftware NetworkSupport


Contact us

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

A developer wrote to us recently about this thread, saying:

The provided assembler code is wrong. The assembler code treats s as an array of 32-bit values, whereas in the C code s is an array of bytes. The assembler code does not compute the correct CRC32 function.

Since we no longer have a way to contact our original source, does anyone have suggestions as to how this code might be rewritten?


Contact us

That is not the only error. Final subtraction, xor and negation are omitted as well.

__declspec(naked) DWORD crc32(LPCSTR *s, unsigned int len, DWORD dwLastCRC)
	__asm	{
		push	ebx
		mov	edx, DWORD PTR [esp + 8]		; edx = s
		mov	ecx, DWORD PTR [esp + 12]		; ecx = len
		mov	eax, DWORD PTR [esp + 16]		; eax = dwLastCRC
		add	edx, ecx				; edx = s + len
		neg	ecx					; ecx = -len
		jnl	crc_loop_end
		movzx	ebx, BYTE PTR [edx + ecx]		; ebx = s[i]
		xor	ebx, eax				; ebx = s[i] ^ dwCRC32
		and	ebx, 0FFh				; ebx = (s[i] ^ dwCRC32) & 0xFF
		shr	eax, 8					; eax = dwCRC32 >> 8
		xor	eax, DWORD PTR [crc32_tab + ebx * 4]	; eax = (dwCRC32 >> 8) ^ crc32_tab[(s[i] ^ dwCRC32) & 0xFF]
		add	ecx, 1
		js	short crc_loop
		mov	ecx, 0FFFFFFFFh
		add	eax, ecx				; eax = dwCRC32 - 1
		xor	eax, ecx				; eax = (dwCRC32 - 1) ^ 0xFFFFFFFF
		neg	eax					; eax = -((dwCRC32 - 1) ^ 0xFFFFFFFF)
		pop	ebx

This code should work correctly, but please bear in mind that it has been written in a hurry (meaning I didn't test it), and that just like the original C function does not perform parameter validation.

Just a minor thing, why not inc ecx instead of add ecx, 1?

Because Intel manual advises against use of INC/DEC. Reasons for that are twofold:

  1. On Pentium IV Northwood ADD/SUB take only 0.5 clock cycles while INC/DEC take 1 clock cycle. On Pentium IV Prescott, Core and Core 2 both ADD/SUB and INC/DEC take 1 clock cycle so there is no difference performance-wise.
  2. INC/DEC do not update CF (carry flag) and if not used carefully can cause partial flags stall.

Here is a code example of partial flags stall:

	cmp	eax, ebx	; writes all flags
	inc	ecx		; updates all flags except CF
	jbe	label		; must wait for both instructions to retire

So, while it may be perfectly OK to use INC in this particular case, there is no benefit of using it on architectures newer than Pentium III, and there is a also risk of penalty if not used correctly. Hope that answers your question?

Leave a Comment

Please sign in to add a comment. Not a member? Join today