You should can gain something when you precalculate the reversed coefficients and when you reorder the codes.
movups xmm1,argument
mov ebx,OFFSET rev_coef
movups xmm0,argument
mulps xmm1,xmm1
movups xmm2,[ebx]
mulps xmm1,xmm0
mulps xmm1,xmm2
subps xmm0,xmm1
At least the coefficients should be aligned. If this is the case you could also try this:
movups xmm1,argument
movups xmm0,argument
mov ebx,OFFSET rev_coef
mulps xmm1,xmm1
mulps xmm1,xmm0
mulps xmm1,[ebx]
subps xmm0,xmm1
If also the offset to rev_coef is constant you can also remove the load of ebx:
movups xmm1,argument
movups xmm0,argument
mulps xmm1,xmm1
mulps xmm1,xmm0
mulps xmm1,[OFFSET rev_coef]
subps xmm0,xmm1





Optimization of sine function's taylor expansion
Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.