Difference using SSE on Intel and AMD processors (?)

Difference using SSE on Intel and AMD processors (?)

Hi, I have the following problem: I use SSE in database system for selection (sql command select). I coded it on AMD Duron 1800 MHz processor and it works fine. But when I tested it on Intel Pentium 4 and Pentium D, it gives bad results - it doesn't work properly. In C language we could write it in a simplified way (for sql command "select * from TABLE where VALUE > key"):

----------------------------------
float *input; // address to input data stored in array
// (data page with 820 entries of type float)
float key; // find key
...
for (i = 0; i < get_values_count_in_column(); i++)
{
if (input[i] > key)
add_row_to_output_table(i);
}
----------------------------------

My SSE code is:

----------------------------------
float *input;
float *key; // addres to find key
...
__asm
{
mov esi, input
mov edi, key
xor ecx, ecx // counter
xor edx, edx
mov ebx, values_count_in_column

movss xmm1, [edi] // xmm1 <- key
shufps xmm1, xmm1, 0 // broadcast

prefetchnta [esi+32]

START_LOOP:
movaps xmm0, [esi] // xmm0 <- input

// The following line is problematic. On AMD processor it is // not needed, while on Intel processor without this line XMM1 // losts its contents after first calling of procedure &nbs
p; // sse_add_row (see below)
movaps xmm1, [edi] // xmm1 <- key

cmpnleps xmm0, xmm1 // compare input > key
movmskps edx, xmm0 // store mask to edx

// for testing purposes, we show the xmm1 register (see below)
push eax
push ecx
push edx
call show_xmm1
pop edx
pop ecx
pop eax

test edx, edx // if nothing found, skip testing bits
jz NOT_FOUND_3

FOUND:
test edx, 1 // test bit 0
jz NOT_FOUND_0 // if not set, jump to test bit 1

// bit is set, we have to store data into output
// selection table - it is done by function sse_add_row
push eax
push ecx
push edx
call sse_add_row // sse_add_row stores entry with // offset in ecx to output table in DBS
pop edx
pop ecx
pop eax

NOT_FOUND_0:
test edx, 2 // test bit 1
jz NOT_FOUND_1 // if not set, jump to test bit 2

push eax
push ecx
push edx
add ecx, 1
call sse_add_row
pop edx
pop ecx
pop eax

NOT_FOUND_1:
test edx, 4 // test bit 2
jz NOT_FOUND_2 // if not set, jump to test bit 3

push eax
push ecx
push edx
add ecx, 2
call sse_add_row
pop edx
pop ecx
pop eax

NOT_FOUND_2:
test edx, 8 // test bit 3
jz NOT_FOUND_3 // if not set, jump to end of bit testing

push eax
push ecx
push edx
add ecx, 3
call sse_add_row
pop edx
pop ecx
pop eax

NOT_FOUND_3:
add esi, 16
add ecx, 4
cmp ecx, ebx
jne START_LOOP
}
...

// write entry to output table of the DBS
// sse_ecx is offset of found entry
void __fastcall sse_add_row(register sse_ecx)
{
Row *row = algebra -> generateRow(table, page, sse_ecx);
algebra -> syscat -> addRowData (output_table, row);
delete row;
}

// print the contents of XMM1 (for testing purposes only)
void __fastcall show_xmm1(register sse_ecx)
{
float *o = (float *)malloc(4 * sizeof(float));
__asm
{
mov edi, o
movups [edi], xmm1
}

printf("%d: %f %f %f %f
", sse_ecx, o[0], o[1], o[2], o[3]);
free(o);
}
----------------------------------

On AMD Duron
1800Mhz processor the red line above is not needed because XMM1 is already loaded (movss and broadcast). Its contents is constant. But on Intel, its contents is constat only until procedure sse_add_row is called. After the first call the contents of XMM1 is changed - it is rewriten to these components: 0.00000 2.90625 0.00000 0.00000 and then stay constant with these values.

I don't understand what part of code is wrong or some strange-side-effect-generating, why it runs fine on AMD and why the new content of XMM1 is right 0.00000 2.90625 0.00000 0.00000. I studied manuals with instructions and function calling conventions, but I didn't find what could modify the contents of XMM1 and why only on Intel processors.

Now I tested it on AMD Turion and it run in the same way like on Intel. The XMM1 contents is rewriten... So my program run correctly only on AMD Duron 1800 MHz.

Can somebody find the clue? Thanks in advance.
Jozef

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I think the calling convention for the mmx/xmm registers requires that they be "caller saved" which means that they could be modified by called functions. You have to save them before calls and restore them afterwards. Check the following links:

Windows:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/Kernel_d/hh/Kernel_d/64bitamd_6848c803-89d3-4f19-82b2-6fae5e63ec13.xml.asp

Intel compiler changes:

http://www.intel.com/support/performancetools/c/windows/sb/cs-020438.htm

Of passing interest:

SysV AMD64 ABI:

http://www.x86-64.org/documentation/abi-0.96.pdf#search=%22linux%20IA32%20ABI%20SSE%22

Can you step into the function calls and see if some instruction is explicitly writing to the XMM1 register?

Another response to the original question, forwarded to us by engineering:

Which compilerare youusing? XMM registers in 32 bit mode are non-volatile, andthe question appears toassume they are. It is very likely that the compiler is calling an optimized memory routinein the Intel case.

==

Lexi S.

IntelSoftware NetworkSupport

http://www.intel.com/software

Contact us

Jozef,

Some comments on your code:

First

movss xmm1, [edi] // xmm1 <- key
shufps xmm1, xmm1, 0 // broadcast

is not equivalent to

movaps xmm1, [edi] // xmm1 <- key

Unless edi points to 4 identical single precision FP values. (I assume it is)

Second, as per Lexi's suggestion step through your code. You will most likely find the code called by your sse_add_row is modifying XMM1 (caller's responsibility to preserve/restore XMM registers). If you find this the case then insert the

movaps xmm1, [edi] // xmm1 <- key

following each call to sse_add_row. In this manner the overhead only occures when needed. (remove what you thought was the unnecessary movaps)

Third, if you data is such that the majority of compares are "not founds" then rearrange the code to place the NOT_FOUND_3 section following the first test

START_LOOP:
...
test edx,edx
jnz FOUND
NOT_FOUND_3:
add esi, 16
add ecx, 4
cmp ecx, ebx
jne START_LOOP
jmp DONE

FOUND:
...
DONE:
}

There are a few more tweeks, but I will let you find them for yourself.

Jim Dempsey

www.quickthreadprogramming.com

Leave a Comment

Please sign in to add a comment. Not a member? Join today