MOVAPS alignment problem

MOVAPS alignment problem

Portrait de iliyapolak

Hello!
I 'am trying to "port" my java special functions class to pure x86 assembly.In my project i use SSE and SSE2instrustion setoperating on fp REAL4 values.I would like to use movaps instruction because of timing (less cpi than movups),but my program crashes with "access violation" error.While debugging i have found thanerror is caused by movaps instruction trying to access stack values local to the procedure(addressed by ebp-n) ebp is decremented by multiplies of 16.When i use movups the problem is absent.I tried to add align 16 directive but it does not work , so i'am stuck to less efficient instruction.
Here is my code snippet which calculates a few term of e^x taylor expansion.

 movaps xmm0,one ;movaps works perfectly while accessing memory

 addps xmm0,argument ;1+x xmm0 accumulator

 mov eax,OFFSET coef1

 movaps xmm1,[eax]

 rcpps xmm2,xmm1 ;1/coef1

 movaps xmm3,argument

 mulps xmm3,xmm3 ;x^2
 movups [ebp-16],xmm3 ;store x^2 ;here movaps crashes program

  mulps xmm2,xmm3

 addps xmm0,xmm2 ;1+x+x^2/2! xmm0 accumulator

 mov eax,OFFSET coef2

 movups xmm1,[eax]

 rcpps xmm2,xmm1 ;1/coef2

 movups xmm7,argument

 movups xmm3,[ebp-16]

 mulps xmm3,xmm7 ;x^3

 movups [ebp-32],xmm3 ;store x^3

 mulps xmm2,xmm3

 addps xmm0,xmm2 ;1+x+x^2/2!+x^3/3! xmm0 accumulator

3 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.
Portrait de Max Locktyukhin (Intel)

short answer: you don't need to bother about MOVAPS vs MOVUPS loads /stores

long answer: although you could make an efforts to align your stack (e.g. adding AND EBP, 0xfffffff0), MOVUPS has been as fast as MOVAPS for 4 generations of Intel CPU's now, you are only really penalized when store/load crosses page boundaries (relatively rare case); also stores and subsequent loads from stack are handled by a shortcut called store-to-load forwarding mechanism without cache interaction. Perf bottlenecks are most certainly elsewhere for this code.

-Max

Portrait de iliyapolak

Thank You very much.

Connectez-vous pour laisser un commentaire.