Array Clipping

by Igor M. Levicki, Software Engineer


Introduction

Are you an advanced programmer into digital signal processing? If so, and you use the Intel® C/C++ Compiler 8.0 on a regular basis, you know that it can vectorize almost anything you throw at it and squeeze maximum performance from your code. Sometimes however, you are not getting the results you expect. Consider this example:

void ClipArray(short *src, DWORD n, short lo, short hi)

{

for (int i = 0; i < n; i++) {

if (src[i] < lo) {

src[i] = lo;

} else if (src[i] > hi) {

src[i] = hi;

}

}

}

 

 

void ClipArray(unsigned short *src, DWORD n, unsigned short lo, unsigned short hi)

{

for (int i = 0; i < n; i++) {

if (src[i] < lo) {

src[i] = lo;

} else if (src[i] > hi) {

src[i] = hi;

}

}

}

 

 

.B2.12:

movdqa	xmm4, xmm2

movdqa	xmm5, xmm1

movdqa	xmm3, XMMWORD PTR [eax + edx * 2]

pcmpgtw	xmm4, xmm3

pand		xmm5, xmm4

pminsw	xmm3, xmm0

pandn		xmm4, xmm3

por		xmm5, xmm4

movdqa	XMMWORD PTR [eax + edx * 2], xmm5

add		edx, 8

cmp		edx, esi

jb		.B2.12

 

As you can see from the above code snippet (note that your code may use different registers), the compiler uses a very neat trick to clip signed values to an arbitrary range. However, it is only applicable when you deal with signed data. Your boss is breathing down your neck, the application release date is getting closer and you still do not have that disambiguating performance level required to raise you above the competition. What to do?

The solution seems simple, but you have to write some assembler code. Problem is that you have learned to rely on the compiler so heavily that you’ve almost forgotten how to use the assembler. You browse through the manuals, you search for examples and you even find some, but they are not complete and you are out of time. Do not fret; Intel Developer Services comes to your rescue!


Getting your hands dirty

 

void ClipArray(unsigned short *src, DWORD n, unsigned short lo, unsigned short hi)

{

__asm	{

mov		esi, dword ptr [src]


movzx		eax, word ptr [lo]

movd		xmm0, eax

punpcklwd	xmm0, xmm0

pshufd	xmm0, xmm0, 0	; low


movzx		edx, word ptr [hi]

movd		xmm1, edx

punpcklwd	xmm1, xmm1

pshufd	xmm1, xmm1, 0	; high


mov		ecx, dword ptr [n]

mov		edi, ecx

and		edi, 0x1F

shr		ecx, 5


pcmpeqw	xmm2, xmm2		; 65535


psubusw	xmm2, xmm1		; 65535 - high


movdqa	xmm3, xmm2

paddusw	xmm3, xmm0		; 65535 - high + low

loop_01:

test		ecx, ecx

jz		loop_01_tail


movdqa	xmm4, xmmword ptr [esi]

movdqa	xmm5, xmmword ptr [esi + 16]

movdqa	xmm6, xmmword ptr [esi + 32]

movdqa	xmm7, xmmword ptr [esi + 48]


paddusw	xmm4, xmm2		; val = val + (65535 - hi)

paddusw	xmm5, xmm2

paddusw	xmm6, xmm2

paddusw	xmm7, xmm2


psubusw	xmm4, xmm3		; val = val - (65535 - hi + lo)

psubusw	xmm5, xmm3

psubusw	xmm6, xmm3

psubusw	xmm7, xmm3


paddw		xmm4, xmm0		; val = val + lo

paddw		xmm5, xmm0

paddw		xmm6, xmm0

paddw		xmm7, xmm0


movdqa	xmmword ptr [esi], xmm4

movdqa	xmmword ptr [esi + 16], xmm5

movdqa	xmmword ptr [esi + 32], xmm6

movdqa	xmmword ptr [esi + 48], xmm7


add		esi, 64

sub		ecx, 1

jmp		loop_01

loop_01_tail:

mov		ecx, edi

loop_02:

test		ecx, ecx

jz		loop_02_end


movzx		edi, word ptr [esi]

cmp		edi, eax

cmovb		edi, eax

cmp		edi, edx

cmova		edi, edx

mov		word ptr [esi], di


add		esi, 2

sub		ecx, 1

jmp		loop_02

loop_02_end:

}

}

 


How does it work?

Following these instructions sets all eight elements in a register to the value 65535 (0xFFFF). It works because the comparison of a register with itself is always true. It is a handy way to generate constant instead of loading it from the memory.

pcmpeqw	xmm2, xmm2		; 65535


 

 

paddusw	xmm4, xmm2		; val = val + (65535 - hi)

 

 

psubusw	xmm4, xmm3		; val = val - (65535 - hi + lo)

 

 

paddw		xmm4, xmm0		; val = val + lo

 

The second loop deserves its own explanation. It is the same code that compiler generates for our original function but we use it only if we have less than 32 elements to process or if the number of elements is not divisible by 32 without remainder.

loop_02:

test		ecx, ecx

jz		loop_02_end


movzx		edi, word ptr [esi]

cmp		edi, eax

cmovb		edi, eax

cmp		edi, edx

cmova		edi, edx

mov		word ptr [esi], di


add		esi, 2

sub		ecx, 1

jmp		loop_02

 

As you can see, we use conditional move instruction (MOVcc) in order to eliminate conditional branches. Why do we want them eliminated when the CPU has advanced branch prediction logic?

It seems counter-intuitive at first but let me underline the importance of doing it:

  • An incorrectly predicted branch incurs large penalty due to the long instruction pipeline, which must be flushed. That penalty is twice as big for Pentium 4 processor when compared to Pentium III processor. It is even bigger for some more recently released members of the Pentium processor family since even longer pipelines are being introduced in order to run at higher clock speeds.
  • The number of jumps inside the loop body is not a measure for itself. You should multiply it with the loop trip-count before you try to judge how much performance impact they can have on your code.
  • How often the CPU can predict the branch outcome also depends on the dataset. If data is not uniform, then no matter how advanced it is branch prediction algorithm cannot predict the correct outcome.

 

Therefore, whenever you have an opportunity, do not hesitate to remove them! At the very least you should try to use profiling on your application. It can sometimes improve the code performance by rearranging conditional jumps based on run-time statistics it collects.


Conclusion

Relying on advanced developer tools can relieve us of a lot of hard work. The tools we work with today are so advanced that the need for assembly programming is diminishing fast. How fast? Well, maybe even faster than it should. You c an consider this lesson successful if you remember two key points:

  1. You know more about the nature of your data than the compiler. Use that fact to your advantage whenever you can.
  2. No matter how advanced, the compiler cannot understand what you want to accomplish. Simplify your logic expressions and conditionals and even remove them if possible.

 


Related Links

If you would like to learn more about code optimization check the following documentation, it should be enough to get you started:

 


About the Author

Igor M. Levicki is a software engineer from Belgrade, Serbia & Montenegro, Europe. He has in depth knowledge of C, C++ and assembler for IA-32 and IA-32e including MMX, SSE, SSE2 and SSE3 instruction sets. Over last two years he has worked on the optimization of 3D medical imaging software for an American company and thanks to some of his optimizations, their software running on the Intel Pentium 4 processor is currently faster than dedicated hardware solutions. You can contact him with questions regarding this article at levicki@sezampro.yu.


For more complete information about compiler optimizations, see our Optimization Notice.

Comments

Igor Levicki's picture

This article has been butchered by incompetent editing. I would like to see it revised because it leaves bad impression regarding my own writing skills. Please contact me so we can discuss fixing it.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.