Optimization of sine function's taylor expansion

Optimization of sine function's taylor expansion

iliyapolak's picture

Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.

movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1

343 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
sirrida's picture

You should can gain something when you precalculate the reversed coefficients and when you reorder the codes.

movups xmm1,argument
mov ebx,OFFSET rev_coef
movups xmm0,argument
mulps xmm1,xmm1
movups xmm2,[ebx]
mulps xmm1,xmm0
mulps xmm1,xmm2
subps xmm0,xmm1

At least the coefficients should be aligned. If this is the case you could also try this:

movups xmm1,argument
movups xmm0,argument

mov ebx,OFFSET rev_coef
mulps xmm1,xmm1
mulps xmm1,xmm0

mulps xmm1,[ebx]

subps xmm0,xmm1

If also the offset to rev_coef is constant you can also remove the load of ebx:

movups xmm1,argument
movups xmm0,argument

mulps xmm1,xmm1
mulps xmm1,xmm0

mulps xmm1,[OFFSET rev_coef]

subps xmm0,xmm1

iliyapolak's picture

Thanks for your fast answer!
Yep i did not think about the pre-calculations of coefficient's inverse ,it seem like good speed improvemnt.
I initially tried to write the codethat looks like your last example but in the runtime i got access violation error inspite of coefficient beign aligned.
By trying to rewrite this code like you showedi suppose that i could reach 100 cycles for less than 14 terms it could be comparable to hardware accelerated fcos and fsin instruction but with single precision accuracy.
Do you know what an implementation are fcos and fsin based on? I mean an approximation algorithm .

Tim Prince's picture

As the other responder implied, it doesn't make sense to leave division in an implementation of Taylor (or, better, Chebyshev economized or similar) power series. The rcp division approximation might be useable for the smallest term of a series. An iterative approximation to approach full accuracy takes as long (longer, on the most up to date CPUs) as a full implemenation of division, and is referred to by the Intel compilers as a "throughput" optimization, as the advantage would show when multiple independent calculations could be interleaved by instruction level parallelism. The same compile options which introduce this "throughput" optimization usually also perform automatic inversion of constants so as to replace division by multiplication.

Tim Prince's picture

The fsincos in the x87 firmware probably uses a rational approximation over a reduced range, taking advantage of range reduction using the fact that the 64-bit precision (10-byte long double) rounded value for Pi happens to be accurate to 66 bits. There's no guarantee that all x87 firmware implementations are the same internally, even though compatibility demands they give the same result and all have the feature of returning the argument itself when its absolute value exceeds 2^64. Math libraries based on SSE2 don't use the non-vectorizable x87 built-ins.
When you're developing your own approximation, a Chebyshev economized series expansion of sin(x)/x over a suitable interval, such as |x| < Pi/4, may be a good point of comparison.

iliyapolak's picture

I have found this post "http://software.intel.com/en-us/forums/showthread.php?t=74354" one of the Intel engineers stated that compiler does not use x87 instructions and you stated that Math libraries do not use x87 instructions too.
I would ask you what an approximation can be used for high precision andvectorizable code targeted for function approximation.
Sorry but i do not know chebyshev series expansion.

Tim Prince's picture

You could study the Numerical Recipes on-line chapters about Chebyshev economization. A literal translation of their code into bc, or running it in 32-bit IA32 mode, will give you enough accuracy to come up with coefficients for a double series. You'll probably also want to consider variations on Horner's method for polynomial evaluation. These are about the simplest ways to come up with competitive math functions based on series approximations.
You might note that transforming a polynomial from separate terms to Horner's form is encouraged by the Fortran standard, although in practice no compilers can do the entire job automatically.
For vectorization, you can take the general approach of svml, where you calculate a number of results in parallel corresponding to the register width, or the vml method, applying your method to a vector.
You also have to consider accuracy vs. speed and vectorization issues in range reduction (discussed in the reference you mentioned).
I would go as far as possible with C or Fortran source code, then start applying intrinsics. This idea is strangely controversial, but I don't see how you can prototype or document without a working high level source code version. You're clearly up against a situation where starting to code in intrinsics without studying algorithms becomes pointless.

iliyapolak's picture

I have "NR in C" book and i took a look at book's section about the polynomial expansion it is not so hard to implement it in source code be it java or c.
Sadly i do not know fortran albeit i think i could understand the code while reading it.
My intention is to code in pure assembly i do not like the idea of using intrinsics.I could study the formulae and it's implementation in high-level language source code and try to write the code in assembly or maybe use inline assembly in order to minimize the overhead of coding WindowsI/O in assembly.
The best idea is try to study various approximation methodsandto compare them for speed and accuracy.

iliyapolak's picture

sirrida
I wrote improved code exactly as you in your last code snippet.I measured it and i got on average 4-5 cycles.

bronxzv's picture

notsin(x) but somewhat related, here is a 2 instructionsapproximation to log(x) in AVX2, it's verycoarse butpretty valuable for my use cases (3D graphics):

vcvtdq2ps ymm1, ymm0

vfmsub213ps ymm1, ymm3, ymm2

ymm0: argument
ymm1: result
ymm2: constant (8x broadcast 8.2629582e-8f)
ymm3: constant (8x broadcast 87.989971f)

it'sbased on Paul Mineiro's "fasterlog"example here:http://www.machinedlearnings.com/2011/06/fast-approximate-logarithm-exponential.html

iliyapolak's picture

For my approximations i am using famous book "Handbook of mathematical functions".You can find there a plenty formulas for many elementary and special functions greatly recommended.Sadly my cpu does not support AVX2 instructions.

bronxzv's picture

The Milton Abramowitz classic (Dover) ? I have it too, it's big and heavy!

A more useful book (IMO) for practical coding is Elementary Functions: Algorithms and Implementation by Jean-Michel Muller (2nd edition, Birkhuser)

iliyapolak's picture

Yes Milton Abramowitz great book.I have already implemented in java 52elementary and special functions my code is based on formulas from this classic book. Now i am writing all those functions inpure assembly as a simple programms.I would also recommend you another classic book for error analysis of numerical methods
"Real Computing Made Real" great little book written by famous scientist Forman Acton.
Another book also "must have" is "Numerical Recipes in C".
Elementary Functions is my next buy.
Btw what is your IDE or assemblerfor your assembly projects i mean visual studio 2010 or maybe masm32 or nasm?
I use VS 2010 and masm32.

bronxzv's picture

Actually I never write directly in assembly, all the ASM snippets that I post here are compiler outputs (sometimes with some massaging)

I use VS 2010 as IDE along with the Intel XE 2011 C++ compiler for code generation

For most performance critical kernels I don't rely of the vectorizer but use a high level framework (built around the intrinsics) based on C++ classeswith inlined functions and operators. The code is very readable thanks to operator overloading, for example writingsimply

vImg = T*vSrc;

allows to apply a 3D transform on packed vectors, it will amount for 9 packed multiplications + 9 packedadditions (i.e. the equivalent of 144 scalar fp instructions with AVX)

the actual generated code depends on the code path, for example register allocation will be different in 64-bit mode since there is more logical registers,the AVX2 path will use the FMA instructions, etc.

iliyapolak's picture

Regarding high-level OO languages i mostly use java (obligatory language in my CS classes :( ) to lesser extent i use c++.
I see that you are interested in 3d graphicsprogramming.I would like to recommend you another classical book "Physically based rendering" i have 1 edition.This is the most helpful book to learn 3d graphics programming, but the book is heavy on advanced math.I try to develop following this book my own renderer
i write it in java i already have written vector ,point ,rotation and transformation classes. Now i'am struggling with integrator class which describes BSSRDF function my main problem is to calculate the integral because i do not want to simply copy book's code.
Do you know CUDA programming?

bronxzv's picture

I'm a 3D garphics old timer so I havetons of books including"Physically based rendering" I'm not sureit's the best for a newbie, though

I have no practical experience with CUDA (just watched a few snippets in the GPU gems/ Shader X / GPU pro series) which is a dead man walking anyway. I'm 110% convinced that the futurefor 3D graphicslie in as high as possible languagesto cope with the ever increasing complexityof the 3D engines. We are on the verge of beingmore constrained by the programmer's productivity than by the raw FLOPS. Even C++ with operators overloading isn't enough for my taste, I will prefer a more readable notation(think Mathematica or Word Equation editor) at the momentwe have to write a complex algorithm twice, one time for the source code and another one to document the thing

iliyapolak's picture

I have a great time with "Physically based rendering".I like to learn theory first.I made my first steps in 3D graphics with the help of books such as "Opengl Bible" it is great book for newbie but it lacks advanced math needed to understand what rendering is.
I do not agree with you on thesubject of programming languages for 3D graphics.I think that adding another layer of abstraction to hide the complexity of 3D algorithms is not good.Programmer must have the knowledge of inne rworking of an algorithm and technology which algorithm tries to describein order to understand it and to write better more optimized code.
For example at my Uni peopledo not take assembly language class because it is not obligatory.Their understanding of of the CPU architecture and OS inner working lacks because their areaccustomedto high-level languages like java or c#.

bronxzv's picture

I'm not talking about hiding the algorithms, on the contraryI wantexpressingthem very clearly (in a compact form) and only once, not one timeat design and to document them for the rest of the team and one time to explain to the compiler what I want.

iliyapolak's picture

Finally i understood you.You want to change the notation only?

bronxzv's picture

Mostly the notation, yes, that's what academia compiler people keep describing as "cosmetic" or "syntactic sugar". There is certainly some good reasons all books use a mathematical notation for square roots, exponents, 1 to N sums, divide. etc., it's far more readable for us humans,even after years of coding experience. Any complex project is for the most ofits lifecycle in maintenance phase, very often you have only the source code at hand, and a refactoring effort (for a complex area) may well start with a very boring step where you attempt to retrieve the algorithm from the code and express it in an *equivalent*more readable form

iliyapolak's picture

You are right generic programming procedural or object-oriented languages do not employ pure mathematical notation.For example solving linear equation in c or java it is not easy thing becauseyou have to deal with various indices, nested loops,pointers and etc...On theother hand array programming languages like matlab mostly hide itfrom theprogrammer.I think thjat good old fortran is better suited for mathematical programming.

Sergey Kostrov's picture
Quoting iliyapolak ...The best idea is try to study various approximation methodsandto compare them for speed and accuracy.

I'll provide you with performance numbers, compared to CRT-function 'sin', for:

Normalized Series ( up to 7th term )
Normalized Series ( up to 9th term )
Chebyshev Polynomial ( up to 7th term )
Chebyshev Polynomial ( up to 9th term )
Normalized Chebyshev Polynomial ( up to 7th term )
Normalized Chebyshev Polynomial ( up to 9th term )
Normalized Taylor Series ( up to 7th term )
Normalized Taylor Series ( up to 9th term )

MyAPI isimplemented in C as functions and macros.

Best regards,
Sergey

iliyapolak's picture

Thanks Sergey you will save me a lot of coding :).
As far i know CRT- sin uses taylor approximation(polynomial multiplication)up to 14th term combined with excessive range reduction and overflow checking so we have a lot of branches in that code.
Sadly my implementation of sin(written in java)is based on taylor series it is simply for loop with pow library function called from inside the loop so it is not so fast as others algorithms I need to rewrite it.
Sergey while comparing various approximations i think that good idea is to code it in assembly language in order to remove compiler-generated code when compiled from high-level language.

bronxzv's picture

I'll be interested to hear about your findings (speed vs. precision) of the different methods

AFAIK Taylor seriesaremuch worse thanother approximations such as Legendre/Chebyshev/Jacobi polynomials, or best of allto minimize the absolute error: Minimax approximations

iliyapolak's picture

In my library i was able to achieve an accuracy beetwen 0.001 and 0.000001 depends on formula
being used .I remember that the worse accuracy was acnhieved with gnalternating sign polynomial
Probably because of catastrophjc cancellation.Results of calculations were compared to Mathematica 8

Tim Prince's picture

That's the reason for range reduction so that Horner's method gives you sum in order of increasing magnitude, using the well known precaution of evaluating
x + x**3 * (a3 + x**2 * (a5 + .....
for the case a0==0 a1=1 a2==0 ....
As others indicated, Cheybyshev or mini-max economization of the polynomial gives you better convergence (and accuracy).

You've demonstrated fully that beginning by concentrating on intrinsics without getting the algorithm right is pointless.

iliyapolak's picture

Thanks TimP for your help.I must say that i put to much faith in taylor expansion
Need to learn other methods.

Sergey Kostrov's picture
Hi Iliya,

Quoting iliyapolak ...As far i know CRT- sin uses taylor approximation(polynomial multiplication)up to 14th term...

I wonder how is it possible?All powers of the Taylorseries forsin(X)are odd:

sin(X) ~= X - X^3/3! + X^5/5! - X^7/7! + X^9/9! - X^11/11! + X^13/13! - X^15/15! + X^17/17! - ...

Unless it is partially normalized:

sin(X) ~= X* (1 - X^2/3! + X^4/5! - X^6/7! + X^8/9! - X^10/11! + X^12/13! - X^14/15! )

iliyapolak's picture

Sorry made a mistake my english is still not perfect.I mean count of terms composing the taylor series and not 14th power of x.

Sergey Kostrov's picture
Quoting Sergey Kostrov
Quoting iliyapolak ...The best idea is try to study various approximation methodsandto compare them for speed and accuracy.

I'll provide you with performance numbers, compared to CRT-function 'sin', for:

Normalized Series ( up to 7th term )
Normalized Series ( up to 9th term )
Chebyshev Polynomial ( up to 7th term )
Chebyshev Polynomial ( up to 9th term )
Normalized Chebyshev Polynomial ( up to 7th term )
Normalized Chebyshev Polynomial ( up to 9th term )
Normalized Taylor Series ( up to 7th term )
Normalized Taylor Series ( up to 9th term )

MyAPI isimplemented in C as functions and macros.

Here are results for C functions:

Normalized Chebyshev Polynomial 7t Sin(45)=0.70710740923532034 - 140 ticks AE=+0.00000062804877288 BT

Normalized Series               7t Sin(45)=0.70710647085826561 - 141 ticks AE=-0.00000031032828185

Normalized Taylor Series        7t Sin(45)=0.70710646957517809 - 141 ticks AE=-0.00000031161136937

Normalized Taylor Series        9t Sin(45)=0.70710678293686713 - 156 ticks AE=+0.00000000175031967

Normalized Chebyshev Polynomial 9t Sin(45)=0.70710678268237914 - 172 ticks AE=+0.00000000149583168 BA

Normalized Series               9t Sin(45)=0.70710678421995354 - 187 ticks AE=+0.00000000303340608

Interpolated                       Sin(45)=0.70710676908493042 - 187 ticks AE=-0.00000001210161704

Chebyshev Polynomial            7t Sin(45)=0.70710740923532023 - 188 ticks AE=+0.00000062804877277

Chebyshev Polynomial            9t Sin(45)=0.70710678268237925 - 219 ticks AE=+0.00000000149583179

CRT                                Sin(45)=0.70710678118654746 - 312 ticks AE=N/A

Notes:

1. Visual Studio 2005 / Microsoft C++ compiler / Release configuration / All Optimizations Disabled
2. All tests for a Double-Precision data type ( 'double' ) / FPU precision was set to 53-bits
3. Every test was executed 4,194,304 ( 2^22 ) times with a Process priority set to Normal
4. These functions without verifications ofinput parameters
5. A '7t'means up to 7th term
6. Interpolated Sine function uses a 1.00 degree step for a LUT of Sine Values generated by a CRT-function 'sin'
7. AE ( Absolute Error )= User-function-Value - CRT-function-Value
8. BT - Best Time / BA - Best Accuracy

Best regards,
Sergey

Sergey Kostrov's picture
Quoting iliyapolak ...The best idea is try to study various approximation methodsandto compare them for speed and accuracy...

Hi Iliya,

Take into account that accuracy of approximation if sine valuesvaries. This is how it looks like for a range from 0 to 90 degrees
for a Normalized Chebyshev Polynomial up to 9th term:

...

           Normalized

           Chebyshev Polynomial 9t   CRT-function 'sin'
Sin( 0) =  0.00000000000000000000 =  0.00000000000000000000 - AE =  0.000000000000000000

Sin( 1) =  0.01745240642776235700 =  0.01745240643728351200 - AE = -0.000000000009521155

Sin( 2) =  0.03489949668347069200 =  0.03489949670250096900 - AE = -0.000000000019030277

Sin( 3) =  0.05233595621442847100 =  0.05233595624294383500 - AE = -0.000000000028515364

Sin( 4) =  0.06975647370616093500 =  0.06975647374412530200 - AE = -0.000000000037964368

Sin( 5) =  0.08715574270029286000 =  0.08715574274765816600 - AE = -0.000000000047365306

Sin( 6) =  0.10452846321094722000 =  0.10452846326765347000 - AE = -0.000000000056706251

Sin( 7) =  0.12186934333917229000 =  0.12186934340514748000 - AE = -0.000000000065975184

Sin( 8) =  0.13917310088490520000 =  0.13917310096006544000 - AE = -0.000000000075160239

Sin( 9) =  0.15643446495598134000 =  0.15643446504023087000 - AE = -0.000000000084249524

Sin(10) =  0.17364817757369913000 =  0.17364817766693033000 - AE = -0.000000000093231201

Sin(11) =  0.19080899527445139000 =  0.19080899537654480000 - AE = -0.000000000102093417

Sin(12) =  0.20791169070693508000 =  0.20791169081775934000 - AE = -0.000000000110824266

Sin(13) =  0.22495105422445330000 =  0.22495105434386500000 - AE = -0.000000000119411703

Sin(14) =  0.24192189547182422000 =  0.24192189559966773000 - AE = -0.000000000127843514

Sin(15) =  0.25881904496641395000 =  0.25881904510252074000 - AE = -0.000000000136106793

Sin(16) =  0.27563735567281122000 =  0.27563735581699916000 - AE = -0.000000000144187939

Sin(17) =  0.29237170457066503000 =  0.29237170472273677000 - AE = -0.000000000152071744

Sin(18) =  0.30901699421520712000 =  0.30901699437494740000 - AE = -0.000000000159740277

Sin(19) =  0.32556815428998409000 =  0.32556815445715670000 - AE = -0.000000000167172609

Sin(20) =  0.34202014315132739000 =  0.34202014332566871000 - AE = -0.000000000174341319

Sin(21) =  0.35836794936408850000 =  0.35836794954530027000 - AE = -0.000000000181211768

Sin(22) =  0.37460659322817519000 =  0.37460659341591201000 - AE = -0.000000000187736826

Sin(23) =  0.39073112829542028000 =  0.39073112848927377000 - AE = -0.000000000193853489

Sin(24) =  0.40673664287632461000 =  0.40673664307580021000 - AE = -0.000000000199475603

Sin(25) =  0.42261826153621362000 =  0.42261826174069944000 - AE = -0.000000000204485817

Sin(26) =  0.43837114658035231000 =  0.43837114678907740000 - AE = -0.000000000208725093

Sin(27) =  0.45399049952756826000 =  0.45399049973954675000 - AE = -0.000000000211978490

Sin(28) =  0.46947156257193301000 =  0.46947156278589081000 - AE = -0.000000000213957796

Sin(29) =  0.48480962003205796000 =  0.48480962024633706000 - AE = -0.000000000214279094

Sin(30) =  0.49999999978756438000 =  0.49999999999999994000 - AE = -0.000000000212435569

Sin(31) =  0.51503807470229079000 =  0.51503807491005416000 - AE = -0.000000000207763362

Sin(32) =  0.52991926403380607000 =  0.52991926423320490000 - AE = -0.000000000199398831

Sin(33) =  0.54463903482879883000 =  0.54463903501502708000 - AE = -0.000000000186228255

Sin(34) =  0.55919290330392202000 =  0.55919290347074690000 - AE = -0.000000000166824887

Sin(35) =  0.57357643621167165000 =  0.57357643635104605000 - AE = -0.000000000139374401

Sin(36) =  0.58778525219088762000 =  0.58778525229247314000 - AE = -0.000000000101585518

Sin(37) =  0.60181502310146640000 =  0.60181502315204827000 - AE = -0.000000000050581872

Sin(38) =  0.61566147534288251000 =  0.61566147532565829000 - AE =  0.000000000017224222

Sin(39) =  0.62932039115611915000 =  0.62932039104983739000 - AE =  0.000000000106281761

Sin(40) =  0.64278760990861683000 =  0.64278760968653925000 - AE =  0.000000000222077579

Sin(41) =  0.65605902936184934000 =  0.65605902899050728000 - AE =  0.000000000371342068

Sin(42) =  0.66913060692114923000 =  0.66913060635885824000 - AE =  0.000000000562290992

Sin(43) =  0.68199836086740095000 =  0.68199836006249848000 - AE =  0.000000000804902478

Sin(44) =  0.69465837157023880000 =  0.69465837045899725000 - AE =  0.000000001111241543

Sin(45) =  0.70710678268237914000 =  0.70710678118654746000 - AE =  0.000000001495831681

Sin(46) =  0.71933980231473449000 =  0.71933980033865108000 - AE =  0.000000001976083408

Sin(47) =  0.73135370419195334000 =  0.73135370161917046000 - AE =  0.000000002572782876

Sin(48) =  0.74314482878804677000 =  0.74314482547739424000 - AE =  0.000000003310652530

Sin(49) =  0.75470958444175995000 =  0.75470958022277201000 - AE =  0.000000004218987937

Sin(50) =  0.76604444845135888000 =  0.76604444311897801000 - AE =  0.000000005332380870

Sin(51) =  0.77714596814850889000 =  0.77714596145697090000 - AE =  0.000000006691537990

Sin(52) =  0.78801076195092989000 =  0.78801075360672201000 - AE =  0.000000008344207880

Sin(53) =  0.79863552039351415000 =  0.79863551004729283000 - AE =  0.000000010346221324

Sin(54) =  0.80901700713761193000 =  0.80901699437494745000 - AE =  0.000000012762664481

Sin(55) =  0.81915205995818574000 =  0.81915204428899180000 - AE =  0.000000015669193942

Sin(56) =  0.82903759170854807000 =  0.82903757255504174000 - AE =  0.000000019153506337

Sin(57) =  0.83867059126240617000 =  0.83867056794542405000 - AE =  0.000000023316982123

Sin(58) =  0.84804812443294242000 =  0.84804809615642596000 - AE =  0.000000028276516462

Sin(59) =  0.85716733486866747000 =  0.85716730070211233000 - AE =  0.000000034166555141

Sin(60) =  0.86602544492579625000 =  0.86602540378443860000 - AE =  0.000000041141357654

Sin(61) =  0.87461975651689605000 =  0.87461970713939574000 - AE =  0.000000049377500311

Sin(62) =  0.88294765193557789000 =  0.88294759285892688000 - AE =  0.000000059076651016

Sin(63) =  0.89100659465699772000 =  0.89100652418836779000 - AE =  0.000000070468629931

Sin(64) =  0.89879413011395126000 =  0.89879404629916704000 - AE =  0.000000083814784224

Sin(65) =  0.90630788644835647000 =  0.90630778703664994000 - AE =  0.000000099411706533

Sin(66) =  0.91354557523791691000 =  0.91354545764260087000 - AE =  0.000000117595316040

Sin(67) =  0.92050499219778437000 =  0.92050485345244037000 - AE =  0.000000138745343992

Sin(68) =  0.92718401785703108000 =  0.92718385456678742000 - AE =  0.000000163290243660

Sin(69) =  0.93358061820976956000 =  0.93358042649720174000 - AE =  0.000000191712567821

Sin(70) =  0.93969284534074993000 =  0.93969262078590832000 - AE =  0.000000224554841610

Sin(71) =  0.94551883802529291000 =  0.94551857559931674000 - AE =  0.000000262425976172

Sin(72) =  0.95105682230340971000 =  0.95105651629515353000 - AE =  0.000000306008256179

Sin(73) =  0.95630511202798463000 =  0.95630475596303544000 - AE =  0.000000356064949192

Sin(74) =  0.96126210938689549000 =  0.96126169593831889000 - AE =  0.000000413448576597

Sin(75) =  0.96592630539896773000 =  0.96592582628906831000 - AE =  0.000000479109899421

Sin(76) =  0.97029628038365956000 =  0.97029572627599647000 - AE =  0.000000554107663087

Sin(77) =  0.97437070440439277000 =  0.97437006478523525000 - AE =  0.000000639619157528

Sin(78) =  0.97814833768545340000 =  0.97814760073380558000 - AE =  0.000000736951647817

Sin(79) =  0.98162803100239626000 =  0.98162718344766398000 - AE =  0.000000847554732286

Sin(80) =  0.98480872604589964000 =  0.98480775301220802000 - AE =  0.000000973033691620

Sin(81) =  0.98768945575902911000 =  0.98768834059513777000 - AE =  0.000001115163891341

Sin(82) =  0.99026934464787952000 =  0.99026806874157036000 - AE =  0.000001275906309162

Sin(83) =  0.99254760906557782000 =  0.99254615164132198000 - AE =  0.000001457424255835

Sin(84) =  0.99452355746963761000 =  0.99452189536827329000 - AE =  0.000001662101364319

Sin(85) =  0.99619659065267530000 =  0.99619469809174555000 - AE =  0.000001892560929750

Sin(86) =  0.99756620194650159000 =  0.99756405025982420000 - AE =  0.000002151686677387

Sin(87) =  0.99863197739962362000 =  0.99862953475457383000 - AE =  0.000002442645049783

Sin(88) =  0.99939359592819632000 =  0.99939082701909576000 - AE =  0.000002768909100559

Sin(89) =  0.99985082944048176000 =  0.99984769515639127000 - AE =  0.000003134284090489

Sin(90) =  1.00000354293488590000 =  1.00000000000000000000 - AE =  0.000003542934885914

...

Sergey Kostrov's picture

Precision Control of FPU also needs to be taken into account and here is example ( for 'double' data type ):

     24-bit accuracy

        CRT                      Sin(   45.0 ) =  0.7071067966408575200000

        Normalized:

        Chebyshev Polynomial 9t  Sin(   45.0 ) =  0.7071067690849304200000
     53-bit accuracy

        CRT                      Sin(   45.0 ) =  0.7071067811865474600000

        Normalized:

        Chebyshev Polynomial 9t  Sin(   45.0 ) =  0.7071067826823791400000
     64-bit accuracy

        CRT                      Sin(   45.0 ) =  0.7071067811865474600000

        Normalized:

        Chebyshev Polynomial 9t  Sin(   45.0 ) =  0.7071067826823791400000
     Default accuracy

        CRT                      Sin(   45.0 ) =  0.7071067811865474600000

        Normalized:

        Chebyshev Polynomial 9t  Sin(   45.0 ) =  0.7071067826823791400000

As you can see 53-bit, 64-bit and Default results are identical. I always recommend to useDefault settings
for FPU and in that case results on different platforms with different C/C++ compilers will be highly consistent!

Best regards,
Sergey

Tim Prince's picture

I can't get much information from what you just said. I don't know whether you mean hardware or software default, in the case of x87 FPU.
The hardware default for x87 FPU is 64-bit precision mode, but Windows and Intel compiler practice is to set (by "default") 53-bit precision (which gives same results as SSE2 double, aside from exponent range). Microsoft libraries don't support the Intel options /Qlong-double -pc80, so their results aren't predictable for that case; printf will see only 64-bit doubles, never 80-bit long doubles.
gnu compilers don't set 53-bit precision mode in their run-time initialization, so there you get 64-bit precision mode. linux glibc libraries have full support for long-double 64-bit precision mode. mingw compilers rely on Microsoft libraries, thus no library support for long double.
When printf formats a double, IEEE standards support zeroing out the insignificant decimal digits, but nothing requires that various compilers make this same choice.

iliyapolak's picture

I mustsay that you have made a great job thanks for posting that. I have a few question regarding your methodology.
How did you make a running-time measurement? Did you use KeGetTickCount function,or maybe an inline rdtsc?
For interpolated sine what interpolation did you use linear interpolation or lagrange interpolation(quadratic or cubic)?
For normalized Taylor Series did you use polynomial fit or maybe for loop and how you calculated powers of an rad argument (many times i used calls to library function pow , but it is non efficient method ).
Were chebyshev polynomial's coefficients pre-calculated , or did your measuremant take into account calculation of chebyshev coefficients ( many calls to library cos).

iliyapolak's picture

Sergey
CRT sin implementation is more than two times slower than your library normalized taylor series sin.I think that extensive range reduction ,input checking and overflow checking are responsible for that.

iliyapolak's picture

Take into account that accuracy of approximation if sine valuesvaries. This is how it looks like for a range from 0 to 90 degrees
for a Normalized Chebyshev Polynomial up to 9th term

If I 'am not wrong CRT sin function uses or is based on FDLIBM implementation.I can see that AE rises when an angle is close to 90 degrees.When I waschecking my librarysine function for an accuracy I also saw
such a behaviour when an argument was closer toPi/2 error was largerwhen compared tosmaller angles closer to 0.
I suppose that this is taylor series inducedinaccuracy.

bronxzv's picture

we can seethat the error vary wildly within a range of values, that's why when comparing algorithmsthestandardmethodsare tocompare themaximum absolute error or theaverage absolute errorover thedomain

iliyapolak's picture

I suppose that Sergey did some averaging of error's values.He wrote that he ran his tests 2^22 times, but he did not say how many values of 2^22 tests were averaged.

Tim Prince's picture

As the convergence properties of Taylor series for sin(x) are poor at Pi/2, this isn't a satisfactory method. Chebyshev or minimax economization corrects this. However, given that range reduction is needed in general, it is likely to be used to reduce the range of series evaluation to |x| < Pi/4.
For quick and dirty usage, a rational approximation may be better if it is to cover the entire range |x| < Pi/2. As tan() lends itself to a rational approximation (with odd polynomial in numerator and even in denominator), the trigonometric formulae can be used to deduce the appropriate form for cos() sin()....
Minimax economization of numerator and denominator polynomials is needed to make this competitive.

iliyapolak's picture

It seems that I need to rewrite my library completely because I did not take into account taylor expansion
inacurracies.

iliyapolak's picture

TimP
Slightly off-topic question.Is it possible for Intel profiler (the one that is part of Parallel Studio) to profile java class file?I 'am interested in finding hot-spots ,function calls count and cpu-time spent while calculating various pieces of code.

bronxzv's picture

also don't forget to use the Horner's scheme to evaluate your polynomials as suggested by TimP, see an exampleof sin(x) below(coefficients taken from Abramowitz & Stegun, 4.3.97 p.76), the maximum absolute error is around 2.1e-9 with only 5 terms (based on actual measurements)

inline double EvalPoly5(double a0, double a1, double a2, double a3, double a4, double a5, double x)

  // Returns a5*x^5 + a4*x^4 + a3*x^3 + a2*x^2 + a1*x + a0

{

  return ((((a5*x+a4)*x+a3)*x+a2)*x+a1)*x+a0;

}
inline double SinApprox(double x)

  // 0 <= x <= Pi/2

{

  const double a0 = 1.0, a2 = -.1666666664, a4 = 0.0083333315, a6 = -.0001984090, a8 = .0000027526, a10 = -.0000000239;

  return EvalPoly5(a0,a2,a4,a6,a8,a10,x*x)*x;

}

as you can see this kind ofcode is straightforward to vectorize and the polynomial evaluation is a perfect fit for FMA

iliyapolak's picture

I took different approach whith my sin function definitely not fast one.
I used for loop with compound statement calculating factorials and Math.pow library calls to calcualte powers of an argument.The results were summed in array.As i said very slow compared to yours method.

For example here is code for Gamma Stirling approximation (formula and coefficientstaken from Abramovitz and Stegun) written in Java.

public double GammaStirlingApprox(double x){
			double sum = 0;

			double result = 0;
			if(x > Double.MAX_VALUE){

				return Double.NaN;

			}else{
				double ln;

				double pow;

				double pi_sqrt;

				double two_pi = 2* Math.PI;

				double []coef = {12.0,288.0,51840.0,2488320.0,209018880.0,75246796800.0,902961561600.0,86684309913600.0

				};

				double[]coef2 = {1.0,1.0,139.0,571.0,163879.0,5246819.0,534703531.0,4483131259.0};

				double[]array = new double[coef.length];

				double[]array2 = new double[coef.length];

				int i;

				int j;

				int k = 0;
				for(i = 0;i
					++k;

					array[i] = Math.pow(x, k);				}
				for(j = 0;j
					array2[j] = array[j]*coef[j];

					array2[j] = coef2[j]/array2[j];

				}
				ln = Math.exp(-x);

				pow = Math.pow(x, x-0.5);

				pi_sqrt = Math.sqrt(two_pi);

				sum = ln*pow*pi_sqrt;

				result = 1+array2[0]+array2[1]-array2[2]-array2[3]+array2[4]+array2[5]-array2[6]-array2[7];
			}

			return sum*result;

		}

		

Sergey Kostrov's picture

Hi everybody, This is my consolidated response on posts fromIliyapolak:

>>...How did you make a running-time measurement?

In codes it looks like:

	...

	// Sub-Test 1.1 - CRT

	{

	///*

		CrtPrintf( RTU("Sub-Test 1.1n") );
		g_uiTicksStart = ::GetTickCount();

		for( RTint t = 0; t < NUMBER_OF_TESTS; t++ )

		{

			dResult = CrtSin( dA );

		}

		CrtPrintf( RTU("Completed in %ld ticksn"), ( RTint )( ::GetTickCount() - g_uiTicksStart ) );

		CrtPrintf( RTU("tCRT                      Sin( % 6.1f ) = % 24.22fn"), dA, dResult );

	//*/

	}

	...

>>...Did you use KeGetTickCount function, or maybe an inline rdtsc?

Please see above. I use a Win32 API function 'GetTickCount'.

>>...For interpolated sine what interpolation did you use Linear interpolation...

Yes.

>>Lagrange interpolation?..

No, unfortunately, it is on hold for a long time.

>>...For Normalized Taylor Series did you use polynomial fit or maybe for loop and how you calculated powers...

In codes it looks like:

...

// Normalized Series ( up to 7th term )

// Sin(x) = x*( a - x^2*( b - x^2*( c - x^2*d )))

//		a = 1.570796329000

//		b = 0.645964096000

//		c = 0.079692625990

//		d = 0.004681754102

...

RTdouble CalcSinNS7( RTdouble dX )

{

	...

	return ( RTdouble )( dX * ( 1.570796329L - dX * dX * ( 0.645964096000L - dX * dX *

							  ( 0.0796926259900L - dX * dX * ( 0.004681754102L ) ) ) ) * iSign );

}

...

>>...Were Chebyshev polynomial's coefficients pre-calculated...

Yes.

>>...did your measuremant take into account calculation of Chebyshev coefficients...

No, because all Chebyshev coefficients are pre-calculated and saved in constants.

>>...( many calls to library cos)...

No, a different method ( Telescoping a Series)was used.

>>...CRT sin implementation is more than two times slower than your library Normalized Taylor series sin
>>I think that extensive range reduction, input checking and overflow checking are responsible for that...

Yes, that's correct. A special set of macrosto calculate sine, cosine, etc, even more faster because all
overheads ( like make a call, createthestack, clean ups, return )are minimazed.

>>...I suppose that Sergey did some averaging of error's values. He wrote that he ran his tests 2^22 times,
>>but he did not say how many values of 2^22 tests were averaged...

No. For example, for all calls like'CalcNTS9(45)' an Absolute Error is the same.

>>...I used for loop with compound statement calculating factorials and Math.pow library calls to calculate
>>powers of an argument. The results were summed in array. AsI said very slow compared to yours method...

I agree that it affects performance of your calculations. As another example this is how a macro declaration
looks like:

...

RTdouble g_dNF3 = 1.0L / 6.0L;					//  3! = 6

RTdouble g_dNF5 = 1.0L / 120.0L;				//  5! = 120

RTdouble g_dNF7 = 1.0L / 5040.0L;				//  7! = 5040

RTdouble g_dNF9 = 1.0L / 362880.0L;				//  9! = 362880

...
#define CrtSinNTS9( A )		( A - A * A * A * ( g_dNF3 - A * A * ( g_dNF5 - A * A *

						  ( g_dNF7 - g_dNF9 * A * A ) ) ) )

...

Sergey Kostrov's picture

Hi everybody, This is my consolidated response on a post fromBronxzy:

>>...We can see that the error vary wildly within a range of values, that's why when comparing
>>algorithms the standard methods are to compare the maximum absolute error or the average
>>absolute error over the domain...

It could be a set of different statistical methods like:

- Min error
- Max error
- Average error
- Standard Deviation of all errors

and it depends on how important that analysis for some project. It is no question that formission-critical
applications it must be done.

Sergey Kostrov's picture

Hi everybody, This is my consolidated response onposts fromTimP:

>>...I can't get much information from what you just said. I don't know whether you mean hardware or
>>software default, in the case of x87 FPU...

I control precision with CRT-function '_control87' and this is because '_control87' changes the control words
for the x87 and the SSE2 units. In codes it looks like:

	...

//	uiControlWordx87 = CrtControl87( _RTFPU_PC_24, _RTFPU_MCW_PC );

	uiControlWordx87 = CrtControl87( _RTFPU_PC_53, _RTFPU_MCW_PC );

//	uiControlWordx87 = CrtControl87( _RTFPU_PC_64, _RTFPU_MCW_PC );

//	uiControlWordx87 = CrtControl87( _RTFPU_CW_DEFAULT, _RTFPU_MCW_PC );

	...

>>...MinGW compilers rely on Microsoft libraries...

Please take a look at a case with some inconsistency... I missed it about 1.5 year ago during intensive testing and
detected it recently.

I have a couple of tangent methods in a template class and here are results when a 'float' data type
is used:

MSC ( VS2005 SP1 / Windows CE / MSVCR80.DLL for ARMV4 CPU used ) - Microsoft C++ compiler
	CRT                   CrtTan(45) =  1.0000000000000000000000

	Normalized TS    msF.TanNTS9(45) =  1.0000000000000000000000

	Normalized TS   msF.TanNTS11(45) =  1.0000000000000000000000
MSC ( VS2005 SP1 / Windows XP / MSVCR80.DLL used               ) - Microsoft C++ compiler
	CRT                   CrtTan(45) =  1.0000000000000000000000

	Normalized TS    msF.TanNTS9(45) =  0.9999999403953552200000               <- Incorrect!

	Normalized TS   msF.TanNTS11(45) =  1.0000000000000000000000
MGW ( v3.4.2     / Windows XP / Uses MSVCRT.DLL                ) - MinGW C++ compiler
	CRT                   CrtTan(45) =  1.0000000000000000000000

	Normalized TS    msF.TanNTS9(45) =  1.0000000000000000000000

	Normalized TS   msF.TanNTS11(45) =  1.0000000000000000000000
BCC ( v5.5.1     / Windows XP / Doesn't Use Any Microsoft DLLs ) - Borland C++ compiler
	CRT                   CrtTan(45) =  1.0000000000000000000000

	Normalized TS    msF.TanNTS9(45) =  1.0000000000000000000000

	Normalized TS   msF.TanNTS11(45) =  1.0000000000000000000000
TCC ( v3.0.0     / MS-DOS     / Doesn't Use Any Microsoft DLLs ) - Turbo C++ compiler
	CRT                   CrtTan(45) =  1.0000000000000000000000

	Normalized TS    msF.TanNTS9(45) =  1.0000000000000000000000

	Normalized TS   msF.TanNTS11(45) =  1.0000000000000000000000

Summary is as follows:

A Win32 console application that uses MSVCR80.DLL calculated Tan(45) using Normalized Taylor Series ( up to 9th term ) method:

msF.TanNTS9(45) = 0.9999999403953552200000

Compiled by Microsoft C++ compiler on Windows XP Desktop 32-bit platform.

A Win32 console application that uses MSVCRT.DLL calculated Tan(45) using Normalized Taylor Series ( up to 9th term ) method:

msF.TanNTS9(45) = 1.0000000000000000000000

Compiled byMinGW C++ compiler on Windows XP Desktop 32-bit platform.

At the moment I can't explain why the result of Tan(45)is not equal to'1.0'. Isuppose this is due to some differences of
theCRT-function '_control87' in these two DLLs, but I could be wrong.

iliyapolak's picture

Sergey!
Does your library contain alsovarious implementations ofspecial functions too?
For example here is my code which calculates Bessel function of first kind J0(x).Plynomial approximation is taken from Abramovitz and Stegun p. 369 ,9.4.1
First part of code contains for loops with arrays multiplications , but second part of code is based on polynomial multiplications.I would like to ask you how to accurate measure running-time of these two parts of code?

public double BesselJ0(double x){
			double sum = 0;

			double result = 0;
			if(x <= -3 || x <= 3){
				double third = x/3;

				int k = 0;

				int exp = 0;

				int i,j;
				double[]coef = {-2.2499997,

						        1.2656208,

						        -0.3163866,

						        0.0444479,

						        -0.0039444,

						        0.0002100

				};

				double[]array = new double[coef.length];

				double small = 4.5*1e-8;

				double b = 0;

				for(i = 0;i
					++k;

					exp = 2*k;

					b = Math.pow(third, exp);

					array[i] = b;

					array[i] = array[i]*coef[i];

					System.out.println("exp t" + exp +"array t" + array[i] + " ");
				}

				for(j = 0;j
					sum += array[j];

				}

			sum =	sum+small;
			return 1+sum;
		}else if(x > 3.0 || x < Double.MAX_VALUE){

			    double sqr = Math.sqrt(x);

			    sqr = 1/sqr;

			    double f = 0;

			    double cos = 0;

			    double theta = 0;

			    double three = 3.0/x;

			    double term = 0.79788456;

			    double term2 = 0.78539816;

			    double[]coef2 = {-0.00000077,

			    		         -0.00552740,

			    		        - 0.00009512,

			    		         0.00137237,

			    		         -0.00072805,

			    		         0.00014476
			    };

			    double[]coef3 = {-0.04166397,

			    		         -0.00003954,

			    		         0.00262573,

			    		         -0.00054125,

			    		         -0.00029333,

			    		         0.00013558
			    };
			    double c = 0;

			    double d = 0;

			    double three2 = 3.0/x;

			    double small = 1.5*1e-8;
			    c = three*coef2[0]+three*(coef2[1]+three*(coef2[2]+three*(coef2[3]+three*(coef2[4]+three*(coef2[5])))));

			    d = three2*coef3[0]+three2*(coef3[1]+three2*(coef3[2]+three2*(coef3[3]+three2*(coef3[4]+three2*(coef3[5])))));

			    System.out.println("c t" + c + "d t" + d);

			    f = term - c;

			    theta = x-term2-d;

			    cos = Math.cos(theta);
			    result = sqr*f*cos;

			    System.out.println("sqr t" + sqr +"f t" + f + "cos t" + cos);
		}

			   return result;

		}

		

Tim Prince's picture

MSVC and ICL compiled Win32 applications include the equivalent of a _control87 function call to set 53-bit precision mode. You can reset with your own _control87 function. Mingw compiled win32 (but not X64) applications would be expected to start up in 64-bit precision mode. Any result formatted by printf will have been rounded to double, so digits beyond 17 would be meaningless if they had not been set to zero.
Your first quoted result looks as if it had been calculated in float (24-bit precision). There is a _control87 setting which does that.

Sergey Kostrov's picture
Quoting iliyapolak ...Does your library contain alsovarious implementations ofspecial functions too?...

No

...I would like to ask you how to accurate measure running-time of these two parts of code?

You could consider two versions and here are pseudo codes:

Version 1 - Uses Win32 API function GetTickCount

...
Declare Start, End, Diff1, Diff2vatiables ( int32 )

Function()
{
Start = GetTickCount()
//Codes for Part1...
End = GetTickCount()
Diff1 = End - Start

Start = GetTickCount()
//Codes for Part2...
End = GetTickCount()
Diff2 = End - Start

return
}
...

Version 2 - Uses RDTSC instruction ( more accurate measurements )

...
Declare Start, End, Diff1, Diff2vatiables ( int64 )

Function()
{
Start =RDTSC
//Codes for Part1...
End =RDTSC
Diff1 = End - Start

Start =RDTSC
//Codes for Part2...
End =RDTSC
Diff2 = End - Start

return
}
...

iliyapolak's picture

Sergey!
Can not use directlythosefunctions mentioned by you because my code is written in java ,but the time difference calculation can be performed exactly as you wrote.Afaik java system library contains high -precision time measurement method which I think is directly translated by the jvm.dll to rdtsc,but overhead of such a translation could besignificant only if jvm jit-compiler will not cache the translation(did not recognize it as a hot-spot code).

Sergey Kostrov's picture
Quoting iliyapolak ...such a behaviour when an argument was closer toPi/2 error was largerwhen compared tosmaller angles closer to 0.
I suppose that this is taylor series inducedinaccuracy...

I would alsomention these two problems:

Limitations of IEEE 754 standard
These Series and Polynomials are onlyapproximating a curve of some function

Consider an equation for Normalized Series up to 9th term that calculatessine:

Sin(x) ~= x*( a - x^2*( b - x^2*( c - x^2*( d - x^2*e )))) ( S )

All coefficients are related to PI, that is, a = PI/2, b = ((PI/2)^3)/3!, and so on.

In the equationStwo sets of coefficients could be used. For example:

Set of Coefficients 1 Set of Coefficients 2
less accurate ( SC1 ) very accurate ( SC2 )

a = 1.5707963290000a = 1.57079632679489661923132169163980000
b = 0.6459640960000 b = 0.64596409750624625365575656389795000
c = 0.0796926259900 c = 0.07969262624616704512050554949047800
d = 0.0046817541020 d = 0.00468175413531868810068546393395340
e = 0.0001604411842 e = 0.00016044118478735982187266087016347

Then, for a double-precision data type and 53-bit FPU precisionsine of 30 degrees is:

Normalized Series up to 9th term with SC1 Sin(30) = 0.5000000008100623500000 ( V1 )
Normalized Series up to 9th term with SC2 Sin(30) = 0.5000000000202800000000 ( V2 )

A true value forsine of 30 degrees is 0.5:

AE1 = V1 - 0.5 = 0.00000000081006235
AE2 = V2 - 0.5 = 0.00000000002028000

You see that Absolute Error AE2 is less than AE1 in ~40 times:

AE1 / AE2 = 0.00000000081006235 / 0.00000000002028000 = 39.94

Normalized Series up to 11th term with a Set of Coefficients ( SC3 )

a = 1.57079632679489661923132169163980000
b = 0.64596409750624625365575656389795000
c = 0.07969262624616704512050554949047800
d = 0.00468175413531868810068546393395340
e = 0.00016044118478735982187266087016347
f = 0.00000359253000000000000000000000000

does a magic:

Sin(30) = 0.5000000000000000000000

and I verified it with all C/C++ compilers that I use.

Best regards,
Sergey

Pages

Login to leave a comment.