Difference using SSE on Intel and AMD processors (?)

Difference using SSE on Intel and AMD processors (?)

imagem de jozef.timulak@gmail.com

Hi, I have the following problem: I use SSE in database system for selection (sql command select). I coded it on AMD Duron 1800 MHz processor and it works fine. But when I tested it on Intel Pentium 4 and Pentium D, it gives bad results - it doesn't work properly. In C language we could write it in a simplified way (for sql command "select * from TABLE where VALUE > key"):

----------------------------------
float *input; // address to input data stored in array
// (data page with 820 entries of type float)
float key; // find key
...
for (i = 0; i < get_values_count_in_column(); i++)
{
if (input[i] > key)
add_row_to_output_table(i);
}
----------------------------------

My SSE code is:

----------------------------------
float *input;
float *key; // addres to find key
...
__asm
{
mov esi, input
mov edi, key
xor ecx, ecx // counter
xor edx, edx
mov ebx, values_count_in_column

movss xmm1, [edi] // xmm1 <- key
shufps xmm1, xmm1, 0 // broadcast

prefetchnta [esi+32]

START_LOOP:
movaps xmm0, [esi] // xmm0 <- input

// The following line is problematic. On AMD processor it is // not needed, while on Intel processor without this line XMM1 // losts its contents after first calling of procedure &nbs
p; // sse_add_row (see below)
movaps xmm1, [edi] // xmm1 <- key

cmpnleps xmm0, xmm1 // compare input > key
movmskps edx, xmm0 // store mask to edx

// for testing purposes, we show the xmm1 register (see below)
push eax
push ecx
push edx
call show_xmm1
pop edx
pop ecx
pop eax

test edx, edx // if nothing found, skip testing bits
jz NOT_FOUND_3

FOUND:
test edx, 1 // test bit 0
jz NOT_FOUND_0 // if not set, jump to test bit 1

// bit is set, we have to store data into output
// selection table - it is done by function sse_add_row
push eax
push ecx
push edx
call sse_add_row // sse_add_row stores entry with // offset in ecx to output table in DBS
pop edx
pop ecx
pop eax

NOT_FOUND_0:
test edx, 2 // test bit 1
jz NOT_FOUND_1 // if not set, jump to test bit 2

push eax
push ecx
push edx
add ecx, 1
call sse_add_row
pop edx
pop ecx
pop eax

NOT_FOUND_1:
test edx, 4 // test bit 2
jz NOT_FOUND_2 // " face="Courier New" size="2">if not set, jump to test bit 3

push eax
push ecx
push edx
add ecx, 2
call sse_add_row
pop edx
pop ecx
pop eax

NOT_FOUND_2:
test edx, 8 // test bit 3
jz NOT_FOUND_3 // if not set, jump to end of bit testing

push eax
push ecx
push edx
add ecx, 3
call sse_add_row
pop edx
pop ecx
pop eax

NOT_FOUND_3:
add esi, 16
add ecx, 4
cmp ecx, ebx
jne START_LOOP
}
...

// write entry to output table of the DBS
// sse_ecx is offset of found entry
void __fastcall sse_add_row(register sse_ecx)
{
Row *row = algebra -> generateRow(table, page, sse_ecx);
algebra -> syscat -> addRowData (output_table, row);
delete row;
}

// print the contents of XMM1 (for testing purposes only)
void __fastcall show_xmm1(register sse_ecx)
{
float *o = (float *)malloc(4 * sizeof(float));
__asm
{
mov edi, o
movups [edi], xmm1
}

printf("%d: %f %f %f %f
", sse_ecx, o[0], o[1], o[2], o[3]);
free(o);
}
----------------------------------

On AMD Duron
1800Mhz processor the red line above is not needed because XMM1 is already loaded (movss and broadcast). Its contents is constant. But on Intel, its contents is constat only until procedure sse_add_row is called. After the first call the contents of XMM1 is changed - it is rewriten to these components: 0.00000 2.90625 0.00000 0.00000 and then stay constant with these values.

I don't understand what part of code is wrong or some strange-side-effect-generating, why it runs fine on AMD and why the new content of XMM1 is right 0.00000 2.90625 0.00000 0.00000. I studied manuals with instructions and function calling conventions, but I didn't find what could modify the contents of XMM1 and why only on Intel processors.

Now I tested it on AMD Turion and it run in the same way like on Intel. The XMM1 contents is rewriten... So my program run correctly only on AMD Duron 1800 MHz.

Can somebody find the clue? Thanks in advance.
Jozef

5 posts / 0 new
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.
imagem de Community Admin

I think the calling convention for the mmx/xmm registers requires that they be "caller saved" which means that they could be modified by called functions. You have to save them before calls and restore them afterwards. Check the following links:


Windows:


http://msdn.microsoft.com/library/default.asp?url=/library/en-us/Kernel_d/hh/Kernel_d/64bitamd_6848c803-89d3-4f19-82b2-6fae5e63ec13.xml.asp


Intel compiler changes:


http://www.intel.com/support/performancetools/c/windows/sb/cs-020438.htm


Of passing interest:


SysV AMD64 ABI:


http://www.x86-64.org/documentation/abi-0.96.pdf#search=%22linux%20IA32%20ABI%20SSE%22

imagem de Michael Stoner (Intel)

Can you step into the function calls and see if some instruction is explicitly writing to the XMM1 register?

imagem de Intel Software Network Support

Another response to the original question, forwarded to us by engineering:



Which compilerare youusing? XMM registers in 32 bit mode are non-volatile, andthe question appears toassume they are. It is very likely that the compiler is calling an optimized memory routinein the Intel case.


==


Lexi S.


IntelSoftware NetworkSupport


http://www.intel.com/software


Contact us


imagem de jimdempseyatthecove

Jozef,


Some comments on your code:


First


movss xmm1, [edi] // xmm1 <- key
shufps xmm1, xmm1, 0 // broadcast


is not equivalent to


movaps xmm1, [edi] // xmm1 <- key

Unless edi points to 4 identical single precision FP values. (I assume it is)


Second, as per Lexi's suggestion step through your code. You will most likely find the code called by your sse_add_row is modifying XMM1 (caller's responsibility to preserve/restore XMM registers). If you find this the case then insert the


movaps xmm1, [edi] // xmm1 <- key

following each call to sse_add_row. In this manner the overhead only occures when needed. (remove what you thought was the unnecessary movaps)


Third, if you data is such that the majority of compares are "not founds" then rearrange the code to place the NOT_FOUND_3 section following the first test


START_LOOP:
...
test edx,edx
jnz FOUND
NOT_FOUND_3:
add esi, 16
add ecx, 4
cmp ecx, ebx
jne START_LOOP
jmp DONE


FOUND:
...
DONE:
}


There are a few more tweeks, but I will let you find them for yourself.


Jim Dempsey



www.quickthreadprogramming.com

Faça login para deixar um comentário.