Array Clipping

Submit New Article

January 15, 2009 12:00 AM PST


by Igor M. Levicki, Software Engineer


Introduction

Are you an advanced programmer into digital signal processing? If so, and you use the Intel® C/C++ Compiler 8.0 on a regular basis, you know that it can vectorize almost anything you throw at it and squeeze maximum performance from your code. Sometimes however, you are not getting the results you expect. Consider this example:

void ClipArray(short *src, DWORD n, short lo, short hi)
{
for (int i = 0; i < n; i++) {
if (src[i] < lo) {
src[i] = lo;
} else if (src[i] > hi) {
src[i] = hi;
}
}
}

 

 

void ClipArray(unsigned short *src, DWORD n, unsigned short lo, unsigned short hi)
{
for (int i = 0; i < n; i++) {
if (src[i] < lo) {
src[i] = lo;
} else if (src[i] > hi) {
src[i] = hi;
}
}
}

 

 

.B2.12:
movdqa xmm4, xmm2
movdqa xmm5, xmm1
movdqa xmm3, XMMWORD PTR [eax + edx * 2]
pcmpgtw xmm4, xmm3
pand xmm5, xmm4
pminsw xmm3, xmm0
pandn xmm4, xmm3
por xmm5, xmm4
movdqa XMMWORD PTR [eax + edx * 2], xmm5
add edx, 8
cmp edx, esi
jb .B2.12

 

As you can see from the above code snippet (note that your code may use different registers), the compiler uses a very neat trick to clip signed values to an arbitrary range. However, it is only applicable when you deal with signed data. Your boss is breathing down your neck, the application release date is getting closer and you still do not have that disambiguating performance level required to raise you above the competition. What to do?

The solution seems simple, but you have to write some assembler code. Problem is that you have learned to rely on the compiler so heavily that you’ve almost forgotten how to use the assembler. You browse through the manuals, you search for examples and you even find some, but they are not complete and you are out of time. Do not fret; Intel Developer Services comes to your rescue!


Getting your hands dirty

 

void ClipArray(unsigned short *src, DWORD n, unsigned short lo, unsigned short hi)
{
__asm {
mov esi, dword ptr [src]

movzx eax, word ptr [lo]
movd xmm0, eax
punpcklwd xmm0, xmm0
pshufd xmm0, xmm0, 0 ; low

movzx edx, word ptr [hi]
movd xmm1, edx
punpcklwd xmm1, xmm1
pshufd xmm1, xmm1, 0 ; high

mov ecx, dword ptr [n]
mov edi, ecx
and edi, 0x1F
shr ecx, 5

pcmpeqw xmm2, xmm2 ; 65535

psubusw xmm2, xmm1 ; 65535 - high

movdqa xmm3, xmm2
paddusw xmm3, xmm0 ; 65535 - high + low
loop_01:
test ecx, ecx
jz loop_01_tail

movdqa xmm4, xmmword ptr [esi]
movdqa xmm5, xmmword ptr [esi + 16]
movdqa xmm6, xmmword ptr [esi + 32]
movdqa xmm7, xmmword ptr [esi + 48]

paddusw xmm4, xmm2 ; val = val + (65535 - hi)
paddusw xmm5, xmm2
paddusw xmm6, xmm2
paddusw xmm7, xmm2

psubusw xmm4, xmm3 ; val = val - (65535 - hi + lo)
psubusw xmm5, xmm3
psubusw xmm6, xmm3
psubusw xmm7, xmm3

paddw xmm4, xmm0 ; val = val + lo
paddw xmm5, xmm0
paddw xmm6, xmm0
paddw xmm7, xmm0

movdqa xmmword ptr [esi], xmm4
movdqa xmmword ptr [esi + 16], xmm5
movdqa xmmword ptr [esi + 32], xmm6
movdqa xmmword ptr [esi + 48], xmm7

add esi, 64
sub ecx, 1
jmp loop_01
loop_01_tail:
mov ecx, edi
loop_02:
test ecx, ecx
jz loop_02_end

movzx edi, word ptr [esi]
cmp edi, eax
cmovb edi, eax
cmp edi, edx
cmova edi, edx
mov word ptr [esi], di

add esi, 2
sub ecx, 1
jmp loop_02
loop_02_end:
}
}

 


How does it work?

Following these instructions sets all eight elements in a register to the value 65535 (0xFFFF). It works because the comparison of a register with itself is always true. It is a handy way to generate constant instead of loading it from the memory.

pcmpeqw	xmm2, xmm2		; 65535

 

 

paddusw	xmm4, xmm2		; val = val + (65535 - hi)

 

 

psubusw	xmm4, xmm3		; val = val - (65535 - hi + lo)

 

 

paddw		xmm4, xmm0		; val = val + lo

 

The second loop deserves its own explanation. It is the same code that compiler generates for our original function but we use it only if we have less than 32 elements to process or if the number of elements is not divisible by 32 without remainder.

loop_02:
test ecx, ecx
jz loop_02_end

movzx edi, word ptr [esi]
cmp edi, eax
cmovb edi, eax
cmp edi, edx
cmova edi, edx
mov word ptr [esi], di

add esi, 2
sub ecx, 1
jmp loop_02

 

As you can see, we use conditional move instruction (MOVcc) in order to eliminate conditional branches. Why do we want them eliminated when the CPU has advanced branch prediction logic?

It seems counter-intuitive at first but let me underline the importance of doing it:

  • An incorrectly predicted branch incurs large penalty due to the long instruction pipeline, which must be flushed. That penalty is twice as big for Pentium 4 processor when compared to Pentium III processor. It is even bigger for some more recently released members of the Pentium processor family since even longer pipelines are being introduced in order to run at higher clock speeds.
  • The number of jumps inside the loop body is not a measure for itself. You should multiply it with the loop trip-count before you try to judge how much performance impact they can have on your code.
  • How often the CPU can predict the branch outcome also depends on the dataset. If data is not uniform, then no matter how advanced it is branch prediction algorithm cannot predict the correct outcome.

 

Therefore, whenever you have an opportunity, do not hesitate to remove them! At the very least you should try to use profiling on your application. It can sometimes improve the code performance by rearranging conditional jumps based on run-time statistics it collects.


Conclusion

Relying on advanced developer tools can relieve us of a lot of hard work. The tools we work with today are so advanced that the need for assembly programming is diminishing fast. How fast? Well, maybe even faster than it should. You c an consider this lesson successful if you remember two key points:

  1. You know more about the nature of your data than the compiler. Use that fact to your advantage whenever you can.
  2. No matter how advanced, the compiler cannot understand what you want to accomplish. Simplify your logic expressions and conditionals and even remove them if possible.

 


Related Links

If you would like to learn more about code optimization check the following documentation, it should be enough to get you started:

 


About the Author

Igor M. Levicki is a software engineer from Belgrade, Serbia & Montenegro, Europe. He has in depth knowledge of C, C++ and assembler for IA-32 and IA-32e including MMX, SSE, SSE2 and SSE3 instruction sets. Over last two years he has worked on the optimization of 3D medical imaging software for an American company and thanks to some of his optimizations, their software running on the Intel Pentium 4 processor is currently faster than dedicated hardware solutions. You can contact him with questions regarding this article at levicki@sezampro.yu.