Q&A: Assembler question

Q&A: Assembler question

Here is a question received by Intel Software Network Support,along withthe answers provided by our Application Engineers:

Q. I have the following function in C++:

#define low16(x) ((x) & 0xffff)
unsigned short f(unsigned short a, unsigned short b)
{
if( a )
{
if( b )
{
unsigned int p = (unsigned int) a * b;

b = ( low16(p) );
a = ( p >> 16 );
return b - a + (b < a);
}
}
else
return 1 - a;
}
else
return 1 - b;
}

this function does f(a,b) = a * b mod (2^16 + 1) and is one of the hot spots in the application Im working on. So, Ive tried the following optimization using assembler:

unsigned short f(unsigned short a, unsigned short b)
{
if( a )
{
if( b )
{
unsigned int p = (unsigned int) a * b;

__asm {
mov eax,p
mov ecx,p
shr ecx,16
sub ax,cx
movzx eax,ax
adc eax,0
}
}
}
else
return 1 - a;
}
else
return 1 - b;
}

and I get slower perfomance. Ive tried declaring a function for the assembler part, using __fastcall modifiers and register varaibles with no good results. Still, the assembler function is slower. Im using Intel Compiler v8.0. Any comments?

A. Our AEs have been looking over your code, but have been unable to duplicate the assertion that the assembly code is slower. On our machine (with a 3.06 GHz Pentium 4 processor), the assembly code is slightly faster than the source code. Can you let us know your hardware and software configuration (processor speed, OS, compiler, etc.)?

Q. Well, I indeed found that the function alone by itself is a little faster than the C++ code. But that function is called repeatedly in one of our processes and if I do the tests repeatedly, I mean several hundred times, I get slower results with the assembler version. Im using Intel Compiler v8.0 with some OpenMP parallel sections and HT on a Pentium4 2.4GHz processor with 500MB ram. I really dont know if there could be some context switching problems, or Im killing the pipeline with the assembler version. But, by the way, do you think its worth it? In your measurements, was the gain something noticeable > 10%?

A.We wereable to duplicate the slowdown for multiple passes with the partial assembly code implementation. Looking further into the procedure itself, there are quite a few unnecessary memory accesses as the same value is stored and reloaded due to the inefficiencies of compiled code. We rewrote the entire procedure in assembly, removing unnecessary memory accesses, and the result appears to be about 25% faster than the original source code implementation.

By way of explanation, the core functionality is straightforward. There is very little we can do to optimize the integer multiply or the shifting. Therefore, we need to look at the surrounding functionality to see if we can achieve improvement there. In this case, the compiler explicitly loads each variable every time the source code references it. In addition, it stores the product of the integer multiplication (p) to memory, even though there is no future reference to that value. Therefore, we can load a and b exactly once, and keep those values in the registers until they are no longer needed. This allows us to remove several later memory loads. In addition, we can remove the store of p and the subsequent reloading of the value of p. Because the core functionality is so efficient (integer multiply and shifting), memory references (even to the L1 cache) can play a significant part of the overall execution time.

The code we used to achieve the 25% speedup follows this. If you have any further questions, please let us know.

unsigned short f_new (unsigned short a, unsigned short b)
{
_asm {
// if (a)
// {
movzxedx, WORD PTR a
testedx, edx
jeSHORT TAG_A_ZERO
//if (b)
// {
movzxeax, WORD PTR b
testeax, eax
jeSHORT TAG_B_ZERO
// unsigned int p = (unsigned int) a * b;
imuledx, eax
// __asm {
moveax, edx
shredx, 16
subax, dx
movzxeax, ax
adceax, 0
jmpSHORT TAG_RETURN
// }
// }
//else
//return 1 - a;
TAG_B_ZERO:
moveax, 1
subeax, edx
jmpSHORT TAG_RETURN
TAG_A_ZERO:
// }
// else
// return 1 - b;
movzxedx, WORD PTR b
moveax, 1
subeax, edx
TAG_RETURN:
}
}

==
Lexi S.

IntelSoftware NetworkSupport

http://www.intel.com/software

Contact us

Message Edited by intel.software.network.support on 12-07-2005 05:00 PM

2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Here is an additional followup:

Q. Do you know if this code works on any PC, or is it exclusive for Intel processors? Pentium4 or above? Our product works on any machine so I musttake care with this kind of correction. Also, does this code have some impact on an HT-enabled machine? Please let me know.


A. The code should work on all x86 processors (including i486). The instructions themselves have been around since the beginning. The syntax of the assembly code may cause problems with gas (Gnu Assembler) because it is in the Intel format. There may also be some Microsoft-specific directives embedded. There should not be any specific impact on HT-enabled systems.

==
Lexi S.

IntelSoftware NetworkSupport

http://www.intel.com/software

Contact us

Message Edited by intel.software.network.support on 12-07-2005 04:15 PM

Leave a Comment

Please sign in to add a comment. Not a member? Join today