by Igor M. Levicki, Software Engineer
Introduction
Are you an advanced programmer into digital signal processing? If so, and you use the Intel® C/C++ Compiler 8.0 on a regular basis, you know that it can vectorize almost anything you throw at it and squeeze maximum performance from your code. Sometimes however, you are not getting the results you expect. Consider this example:
|
|
.B2.12: movdqa xmm4, xmm2 movdqa xmm5, xmm1 movdqa xmm3, XMMWORD PTR [eax + edx * 2] pcmpgtw xmm4, xmm3 pand xmm5, xmm4 pminsw xmm3, xmm0 pandn xmm4, xmm3 por xmm5, xmm4 movdqa XMMWORD PTR [eax + edx * 2], xmm5 add edx, 8 cmp edx, esi jb .B2.12 |
As you can see from the above code snippet (note that your code may use different registers), the compiler uses a very neat trick to clip signed values to an arbitrary range. However, it is only applicable when you deal with signed data. Your boss is breathing down your neck, the application release date is getting closer and you still do not have that disambiguating performance level required to raise you above the competition. What to do?
The solution seems simple, but you have to write some assembler code. Problem is that you have learned to rely on the compiler so heavily that you’ve almost forgotten how to use the assembler. You browse through the manuals, you search for examples and you even find some, but they are not complete and you are out of time. Do not fret; Intel Developer Services comes to your rescue!
Getting your hands dirty
|
How does it work?
Following these instructions sets all eight elements in a register to the value 65535 (0xFFFF). It works because the comparison of a register with itself is always true. It is a handy way to generate constant instead of loading it from the memory.
pcmpeqw xmm2, xmm2 ; 65535 |
paddusw xmm4, xmm2 ; val = val + (65535 - hi) |
psubusw xmm4, xmm3 ; val = val - (65535 - hi + lo) |
paddw xmm4, xmm0 ; val = val + lo |
The second loop deserves its own explanation. It is the same code that compiler generates for our original function but we use it only if we have less than 32 elements to process or if the number of elements is not divisible by 32 without remainder.
loop_02: test ecx, ecx jz loop_02_end movzx edi, word ptr [esi] cmp edi, eax cmovb edi, eax cmp edi, edx cmova edi, edx mov word ptr [esi], di add esi, 2 sub ecx, 1 jmp loop_02 |
As you can see, we use conditional move instruction (MOVcc) in order to eliminate conditional branches. Why do we want them eliminated when the CPU has advanced branch prediction logic?
It seems counter-intuitive at first but let me underline the importance of doing it:
- An incorrectly predicted branch incurs large penalty due to the long instruction pipeline, which must be flushed. That penalty is twice as big for Pentium 4 processor when compared to Pentium III processor. It is even bigger for some more recently released members of the Pentium processor family since even longer pipelines are being introduced in order to run at higher clock speeds.
- The number of jumps inside the loop body is not a measure for itself. You should multiply it with the loop trip-count before you try to judge how much performance impact they can have on your code.
- How often the CPU can predict the branch outcome also depends on the dataset. If data is not uniform, then no matter how advanced it is branch prediction algorithm cannot predict the correct outcome.
Therefore, whenever you have an opportunity, do not hesitate to remove them! At the very least you should try to use profiling on your application. It can sometimes improve the code performance by rearranging conditional jumps based on run-time statistics it collects.
Conclusion
Relying on advanced developer tools can relieve us of a lot of hard work. The tools we work with today are so advanced that the need for assembly programming is diminishing fast. How fast? Well, maybe even faster than it should. You c an consider this lesson successful if you remember two key points:
- You know more about the nature of your data than the compiler. Use that fact to your advantage whenever you can.
- No matter how advanced, the compiler cannot understand what you want to accomplish. Simplify your logic expressions and conditionals and even remove them if possible.
Related Links
If you would like to learn more about code optimization check the following documentation, it should be enough to get you started:
- Intel® 64 and IA-32 Architectures Software Developer's Manuals
- The Software Optimization Cookbook by Richard Gerber
- Software Optimization Using Hardware Synthesis Techniques* by Bret Victor (PDF 100KB)
- Array Clipping Code sample
About the Author
Igor M. Levicki is a software engineer from Belgrade, Serbia & Montenegro, Europe. He has in depth knowledge of C, C++ and assembler for IA-32 and IA-32e including MMX, SSE, SSE2 and SSE3 instruction sets. Over last two years he has worked on the optimization of 3D medical imaging software for an American company and thanks to some of his optimizations, their software running on the Intel Pentium 4 processor is currently faster than dedicated hardware solutions. You can contact him with questions regarding this article at levicki@sezampro.yu.

Comments
This article has been butchered by incompetent editing. I would like to see it revised because it leaves bad impression regarding my own writing skills. Please contact me so we can discuss fixing it.