# Optimization of sine function's taylor expansion

Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.

Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.

Hi Iliya,

Quoting iliyapolak

...So I thought that the best way to learn is use such a book like "Handbook of Mathematical Functions" and try todevelopemy own implementation to the various formulas. I have already written 52 functions(If you are interested I can send you a Java source file)

with varying degree of an accuracy and speed of execution...

That would be awesome if you share your work. Since it is in Javaa forum"Android Applications on Intel Architecture" or "MKL"could bethe most suitable. Please confirm it with Moderators of Intel Forums.

Thank you!

Best regards,

Sergey

Quoting iliyapolak...

>>less accurate then standard CRT-functionsWhy do not You implement your sine functions exactly as fastsin() which has AE of 0 when compared to CRT sin()?

[

SergeyK] It was decided that for aReal-Time environment a number of terms in serieshas to be reduced to 7and 9.Do you have an experience in Java programming?[

SergeyK]My experience with Java is insignificant. I would say: 95%is C/C++ and 5%all the rest programminglanguages,like FoxPro, Java, Gupta, Assembler, Visual Basic, Pascal.

Have you considered a development ofan API which is based only on complex numbers and complex functions.[

SergeyK] No.Best regards,

Sergey

Thank You Sergey!

I decided to postthe source code of my Functions class.It is still in beta ,but all methods work.You can find there many esoteric functions like :Bessel , Kelvin, Sine Integral ,Cosine Integral ,various Gamma functions and many more.As I stated earlier in my post all of those methods are not written for speed of execution ,moreover they are very slow look herehttp://software.intel.com/en-us/forums/showthread.php?t=106032

You can see how pre-calculation of coefficients and Horner scheme can improve speed of execution(in case of my gamma functionit was ~4x faster).

Please feel free to use my code to port it to C/C++, to improve it and to test it.And do not forget togive your opinion on my work. :)

P.SYou can open the java source file in any editor even notepad ,but I recommned you to use Eclipse Indigo IDE.

## Anexos:

Hi Sergey

Now I'am porting java "Functions" class to C++ static library.Where is it possible I will optimize for speed.I have already ported a few functions.When my job will be done I will post the source file.

>>or "MKL"I think that this forum is too much math-centric and math-oriented and more suited for professional mathematicians than for programmers.

If you are interested in testing more complicated function than sin() please look here http://software.intel.com/en-us/forums/showthread.php?t=106032

I think that learning x86 assembly should be obligatory for professional programmers.It should be 70% C/C++ and 30% assembly.

Quoting Sergey Kostrov

Quoting iliyapolak

Interesting, but did you test standard taylor expansion for sine()exactly as I have written for speed and for an accuracy?..It is already in progress and I'll post results soon...

Comments:

- Number of Iterations - 2^22

- Time of executionof a single call has to be calculated as follows: ( Completed in XXX ticks )divided by a ( Number of iterations ), for example: 297 / 2^22

- Microsoft C++ compiler

- All optimizations disabled

- Release configuration

...

Application - ScaLibTestApp - WIN32_MSC

Tests: Start

>

Test1067 Start<Sub-Test 1.1

Completed in

297ticksCRT Sin( 30.0 ) = 0.4999999999999999400000

Sub-Test 2.1

Completed in

234ticksNormalized Series 7t Sin( 30.0 ) = 0.4999999918690232700000

Sub-Test 3.1

Completed in

234ticksNormalized Series 9t Sin( 30.0 ) = 0.5000000000202800000000

Sub-Test 4.1

Completed in

266ticksNormalized Series 11t Sin( 30.0 ) = 0.5000000000000000000000

Sub-Test 5.1

Completed in

266ticksChebyshev Polynomial 7t Sin( 30.0 ) = 0.4999999476616695500000

Sub-Test 6.1

Completed in

328ticksChebyshev Polynomial 9t Sin( 30.0 ) = 0.4999999997875643800000

Sub-Test 7.1

Completed in

219ticksNormalized:

Chebyshev Polynomial 7t Sin( 30.0 ) = 0.4999999476616694400000

Sub-Test 8.1

Completed in

234ticksNormalized:

Chebyshev Polynomial 9t Sin( 30.0 ) = 0.4999999997875643800000

Sub-Test 9.1

Completed in

203ticksNormalized:

Taylor Series 7t Sin( 30.0 ) = 0.4999999918690232200000

Sub-Test 10.1

Completed in

234ticksNormalized:

Taylor Series 9t Sin( 30.0 ) = 0.5000000000202798900000

Sub-Test 11.1

Completed in

532ticksNormalized:

Taylor Series 23t Sin( 30.0 ) = 0.4999999999999999400000 - FastSinV1 - Not Optimized

Sub-Test 12.1

Completed in

453ticksNormalized:

Taylor Series 23t Sin( 30.0 ) = 0.4999999999999999400000 - FastSinV2 - Optimized

Sub-Test 13.1

Completed in

109ticksNormalized:

Taylor Series 11t Sin( 30.0 ) = 0.4999999999999643100000 - C Macro

Sub-Test 14.1

Completed in

266ticks1.00 deg step for a LUT of Sine Values:

Interpolated Sin( 30.0 ) = 0.5000000000000000000000

Sub-Test 15.1

Completed in

265ticks1.00 deg step for a LUT of Sine Values:

Interpolated Cos( 30.0 ) = 0.8660254037844386000000

>

Test1067 End<Tests: Completed

...

I'll post modified codes of your '

FastSin' functionlater. I need to do my regular a project related stuff.Best regards,

Sergey

Quoting iliyapolak

...I would like to ask you howcan i rewrite this code in order to gain speed of execution improvment...Hi Iliya, I would say this is your "sacred" question and here are a couple of versions of your '

FastSin' function.A version of 'FastSinV2' without checks for a range 0 < x < Pi/2 will be your fastest version

in C language:

Here is another set of performance numbers:

Application - ScaLibTestApp - WIN32_MSC

Tests: Start

>

Test1067 Start<Sub-Test 1.1

Completed in

297ticksCRT Sin( 30.0 ) = 0.4999999999999999400000

Sub-Test 2.1

Completed in

235ticksNormalized Series 7t Sin( 30.0 ) = 0.4999999918690232700000

Sub-Test 3.1

Completed in

234ticksNormalized Series 9t Sin( 30.0 ) = 0.5000000000202800000000

Sub-Test 4.1

Completed in

266ticksNormalized Series 11t Sin( 30.0 ) = 0.5000000000000000000000

Sub-Test 5.1

Completed in

265ticksChebyshev Polynomial 7t Sin( 30.0 ) = 0.4999999476616695500000

Sub-Test 6.1

Completed in

328ticksChebyshev Polynomial 9t Sin( 30.0 ) = 0.4999999997875643800000

Sub-Test 7.1

Completed in

219ticksNormalized:

Chebyshev Polynomial 7t Sin( 30.0 ) = 0.4999999476616694400000

Sub-Test 8.1

Completed in

219ticksNormalized:

Chebyshev Polynomial 9t Sin( 30.0 ) = 0.4999999997875643800000

Sub-Test 9.1

Completed in

203ticksNormalized:

Taylor Series 7t Sin( 30.0 ) = 0.4999999918690232200000

Sub-Test 10.1

Completed in

234ticksNormalized:

Taylor Series 9t Sin( 30.0 ) = 0.5000000000202798900000

Sub-Test 11.1

Completed in

516ticksNormalized:

Taylor Series 23t Sin( 30.0 ) = 0.4999999999999999400000 -

FastSinV1- Not OptimizedSub-Test 12.1

Completed in

406ticksNormalized:

Taylor Series 23t Sin( 30.0 ) = 0.4999999999999999400000 -

FastSinV2- OptimizedSub-Test 13.1

Completed in

360ticksNormalized:

Taylor Series 23t Sin( 30.0 ) = 0.4999999999999999400000 -

FastSinV3- Optimized / No ChecksSub-Test 14.1

Completed in

109ticksNormalized:

Taylor Series 11t Sin( 30.0 ) = 0.4999999999999643100000 -

C MacroSub-Test 15.1

Completed in

266ticks1.00 deg step for a LUT of Sine Values:

Interpolated Sin( 30.0 ) = 0.5000000000000000000000

Sub-Test 16.1

Completed in

265ticks1.00 deg step for a LUT of Sine Values:

Interpolated Cos( 30.0 ) = 0.8660254037844386000000

>

Test1067 End<Tests: Completed

Here are performance numbers for 'sin'C-Macros:

...

Completed in

62ticksNormalized:

Taylor Series

7tSin( 30.0 ) = 0.4999999918690232200000 - C Macro...

Completed in

94ticksNormalized:

Taylor Series

9tSin( 30.0 ) = 0.5000000000202798900000 - C Macro...

Completed in

109ticksNormalized:

Taylor Series

11tSin( 30.0 ) = 0.4999999999999643100000 - C Macro...

Hi Sergey!

Thanks for testing.I hjave a few question regarding the results and method of testing.

Bronxzv in one of his responses told me do not test with a constant value because of possible compiler optimization here isthe quote from his post "

also, as previously noted, I don't think that calling it with a constant is the best idea, insteadI'll advise to samplethe domain like in the example I provided the other day".What does XXX Ticks stand for? Is it nanoseconds or CPU cycles?

Here is my test for fastsin() written exactly as your Unoptimized version:

Tested today fastsin() 1e6 iterations and the result was 15 millisec i.e ~33.39 cycles per iterationfor my CPU.

results

start val of fastsin() 13214314

end val of fastsin() 13214329

running time of fastsin() release code is: 15 millisec

fastsin() is: 0.891207360591512180000000

Why your results are so large almost as my slow implementation of gamma stirling approximation.Please compare here linkhttp://software.intel.com/en-us/forums/showthread.php?t=105474

Seregey!

Did You download Java source file which I uploaded yesterday?

There you have a plenty room for implementing Horner scheme with coeficients pre-calculation.

One of the example optimization of gamma stirling approximation please look here http://software.intel.com/en-us/forums/showpost.php?p=188061

Here is the code and test results

Quoting iliyapolak

...Results of loop-overhead testing

As you can see I cannotmeasure loop overheadmoreover Ialso checked with debugger that empty for-loop is executed.Priority wasset with the help of Process Explorer.Assembly instructions can be counted so overheadis sum of a few x86 instr like"jae[target], add 1 cmp,some_value" andshould be not more than a few cycles per iteration.

...

start value of loop_overhead : 5600529

end value of loop_overhead : 5600529

delta of loop_overhead is : 0...

Hi Iliya,

This is a follow up on two Posts #117 and #114. I think you need to disable ALL optimizations in order to measure an overhead of

an empty '

for' statement. Intel C++ compiler could easily "remove" it. Since itdidn't andyour result was 0 something elsewas wrong. I'll take a look at it some time this week.

Best regards,

Sergey

Quoting iliyapolak

...What does XXX Ticks stand for? Is it nanoseconds or CPU cycles?[

SergeyK] XXX means anumber of ticks. A Win32 API function 'GetTickCount' returns a value inmilliseconds, or 'ticks'. In my tests every 'sine' function is called 2^22 times and a time

for one call is calculated by dividing a 'TicksNumber' by '2^22'. For example:

78ms (ticks ) /2^22=0.000018596649169921875ms ~=0.0000186ms...Quoting iliyapolak

...Why your results are so large almost as my slow implementation...[

SergeyK] Because your computer is faster.Iliya, you're trying to compare uncomparable values. Let's assume that for the same function:

- Value VA in mswas obtained on a computer A with CPU frequency FA

-Value VB in mswas obtained on a computer B with CPU frequency FB

- If value VB is less then value VA the computer B is faster then computer A

All performance numbers I usually post are for reference only.You need to compare your value(s) against another value(s) ( a "reference" ) obtained on the same

computer with a similar CRT-fucntion.

Once again, I don't measure absolute performance of some function in milliseconds, nanoseconds or

clock cycles. I always measure a relative performance. Let's say I've set a target to outperform some

"reference function" from CRT library. If my function is faster a target is achieved. I know that accuracy is

affected since 7, or 9, or 11 terms are used but that is another issue.

Best regards,

Sergey

Quoting Sergey Kostrov

Quoting iliyapolak...Why your results are so large almost as my slow implementation...[

SergeyK] Because your computer is faster....I always measure a relative performance...

And here are results of a very special test:

Performanceof

C 'Sin' Macrosmeasured withRDTSCinstruction is as follows:Normalized Taylor Series Sine (

7terms ) -25clock cyclesNormalized Taylor Series Sine (

9terms ) -35clock cyclesNormalized Taylor Series Sine (

11terms ) -43clock cyclesHowever, at the time of testing a Windows Update was in progress...

Best regards,

Sergey

These results are closer to the results achieaved by me and by bronxzv.From theoretical point of view it should be very fast because of a few addition and multiplication and one assignment of the accumulator register to the memory store.

Try to run your tests for 1e6 iterations.

Here are my results for 2^22 iterations unoptimized fastsin() ,but heavily optimized byIntel C++ compiler with full code inlining inside the loop and 10x loop unrolling.

start val of fastsin() 5566115end val of fastsin() 5566178

running time of fastsin() release code is: 63 millisec

fastsin() is: 0.988566034786635070000000

63 ms / 2^22 = 0.0000150203704833984375ms ~= 15 ns per iteration

Your code ran for 18.6 ns per iterationprobably due toless agressive compiler optimization.

Btw my CPU is kinda slow Core i3 2.226Ghz.

Yes something could be wrong with this measurement.I think that overhead in the worst-case scenario is a few cycles of compare to sth, inc counter and jmp to top_loop assembly instructions.In real world testing CPU should execute such a loop inparallel with inside the loop statements.

Yes I was wrong.It is not only the different CPU's , but also uncontrollable by us non-deterministic environment which can greatly affect the state of our measurements.

MSVCRT.DLL sin() functions after after argument checking calls x87 fsin instruction.I suppose that fsin performes also range reduction and input checking its cpi is ~40-100 cycles you have also add the overhead of an library wrapper and argument checking so the CRT sin() will be always slower than our fastsin() implementation.Even x87 fsin is slower on average than fastsin() because of range reduction and if I'am not wrong coefficient pre-calculations done on-the-fly.This explains the accuracy of fastsin() compared to CRT sin()

Here is the disassembled by the IDA PRO code snippet of _CIsin which is compiler optimized version of many sin() implementations.

indeed, that's why it's notsensibleto measure the timing of an empty loop as an estimate of the impact on a real loop,there is typically noconflicts between GPR increment and compare and the FADD/FMUL ports, moreover the branch is macro-fused and with high prediction hit rate, all in all the actual overhead is almost 0

a great tool to see how several instructions are executed in parallel and to study your critical path is the IACA available here:

http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/

CPU internal logic will easily optimize such a case,as I could say this is a classic case when integer calculations are executed in parallel with floating-point instructions.Situation will be slightly different when floating point based for-loop is used.Here for-loop's fp code could be interleaved with main calculation body and executed by Port0 and Port1 in parallel withloop's body.

Btw. thanks for the link

Quoting iliyapolak

...63 ms / 2^22 = 0.0000150203704833984375ms ~= 15 ns per iteration

Your code ran for 18.6 ns per iterationprobably due toless agressive compiler optimization...I always disable ALL optimizations during testing because I need to evaluate results for 5 different C++ compilers and 6 different platforms.

Quoting iliyapolak

...CRT sin() will be always slower than our fastsin() implementation...This is not always true and CRT 'sin' outperforms some of my sine functions with 9 terms ( see Post #183 /

based on Chebyshev polynomial). This is not a surprise for me because CRT 'sin'is implemented in assembler.

ALL my sine functions are implemented in C.

It could be possible because the cpi of fsin varies between 40-100 cycles per instruction.If fsin can reach 40-50 cpi and your custom chebyshev-based implementation is not optimized agressively by the compiler it is possible than CRT sin() could run faster.

>>This is not a surprise for me because CRT 'sin'is implemented in assemblerMSVCRT.DLL sin() functions have some overhead which is based on argument checking and comparing to 0 and branching on result to the location which in the end calls x87fsin(you can clearly see the call to fsin in my post above).When CRT sin() is called many times compiler can simply inline the code in for-loop block because of almost 100% predictable backwardbranching the assembly code for the overhead will be present in the cache already decoded to uops , so it can greatly speedup execution time of CRT sin().

Btw JVM solution Math.sin also uses fsin ,but range reduction is performed at the bytecode level so it is slower than MSVCRT sinefunctions.

Why do not you try to run fastsin() or your other sinefunctions heavily optimized by Intel compiler.

It can be very interesting to see the results of such a measurement.

Quoting iliyapolak

Why do not you try to run fastsin() or your other sinefunctions heavily optimized by Intel compiler.It can be very interesting to see the results of such a measurement.

Iliya,

I work for a project and I can't do everything I want. I tested your 'FastSin' functionand, unfortunately,

I need to move on with another things.Your set of special functions implemented in Java is good ( Thank you again! )but

it has no practical applications for the project. Please take a look at a thread for more details:

http://software.intel.com/en-us/forums/showthread.php?t=106134&o=a&s=lr

Post #3

Best regards,

Sergey

Quoting iliyapolak

MSVCRT.DLL sin() functions after after argument checking calls x87 fsin instruction.I suppose that fsin performes also range reduction...'

FSIN' instructionaccepts any values in the range from -2^63 to +2^63. If an input value is outside of that range areduction oftheinput valuemust be done and Intel recommends to use '

FPREM' instruction.Best regards,

Sergey

Thank You Sergey.Maybe in foreseeable future you could implement some parts of special functions class.I would be very glad to see it :)

What is your main area of software development?Is it numerical methods software?

I was talking about the range reduction(or mapping)of large values in radian(input -2^63,+2^63) to values in range |x|

Coming a bit back to the topic of this forum I'm wondering why no attempt has been made to parallelize the calculation or otherwise speed up the usual Horner scheme. One method is Estrin's method.

For sine Horner's scheme is as follows (the constants c_n = 1/n! for the usual Taylor approximation):

x2 = x*x;

result = ((((((c13*x2+c11)*x2+c9)*x2+c7)*x2+c5)*x2+c3)*x2+c1)*x;

In assembler (x and result via [eax]):

movsd xmm0,[eax]

movsd xmm1,xmm0 ; x

mulsd xmm0,xmm0

movsd xmm2,xmm0 ; x2

mulsd xmm0,[c13]

addsd xmm0,[c11]

mulsd xmm0,xmm2

addsd xmm0,[c9]

mulsd xmm0,xmm2

addsd xmm0,[c7]

mulsd xmm0,xmm2

addsd xmm0,[c5]

mulsd xmm0,xmm2

addsd xmm0,[c3]

mulsd xmm0,xmm2

addsd xmm0,[c1]

mulsd xmm0,xmm1

movsd [eax],xmm0

A simplified variant of Estrin's scheme is

x2 = x*x;

x4 = x2*x2;

result =

( (((+c11)*x4+c7)*x4+c3)*x2+

(((+c13)*x4+c9)*x4+c5)*x4+c1 )*x;

In assembler (x and result via [eax]):

movsd xmm0,[eax]

movsd xmm1,xmm0 ; x

mulsd xmm0,xmm0

movsd xmm2,xmm0 ; x2

mulsd xmm0,xmm0

movsd xmm3,xmm0

movsd xmm4,xmm0 ; x4

mulsd xmm0,[c13]

mulsd xmm3,[c11]

addsd xmm0,[c9]

addsd xmm3,[c7]

mulsd xmm0,xmm4

mulsd xmm3,xmm4

addsd xmm0,[c5]

addsd xmm3,[c3]

mulsd xmm0,xmm4

mulsd xmm3,xmm2

addsd xmm0,[c1]

addsd xmm0,xmm3

mulsd xmm0,xmm1

movsd [eax],xmm0

At least on Atom N450 and Core i7 920 this is a gain.

Very interesting.

Here is the Horner scheme for polynomial approximation oftan() function.The first four terms of tan() taylor series are shown below.It is not optimal in the term of speed becuse of stack accesses which are used to calculate powers of a variable x.

This is not the Horner scheme which starts with the last coefficient (highest index).

Why do you store the odd powers of x? You don't need to store them and you even need not reload them.

BTW: You should try to omit modifying a register which is needed the very next command - always keep a creator and its consumer dispersed as far as possible, especially when the producer has a long latency.

Right I was thinking about the my other implementation of tan() function written in C and made such a comment.It is straightforward implementation of taylor expansion with pre-calculated coefficients.

Btw polynomial evaluation can also start from the lowest first coefficient saw it many times for example FDLIBM implementation of sine function.

>>Why do you store the odd powers of x? You don't need to store them and you even need not reload themBecuse when I wrote this code I did not know how to efficient multiply the powers of x without the stack accesses.And I simply implemented in assembly varioustaylor expansion without even thinking about the code optimization.

that's one more reason for testing it over asub-domain instead ofa single value

once you have refactored your code to use basic building blockslike EvalPoly5() here http://software.intel.com/en-us/forums/showpost.php?p=186020

your code will be not only more readable but it will be very easy to test alternate evaluation methods like the one suggested by sirrida herehttp://software.intel.com/en-us/forums/showpost.php?p=188457

Interesting what were the values which caused such a great variation in cpi of fsin.

Thanks to this forum and you guys I know how to perform basic code optimization.Refactored code is a lot of faster and more readable and elegant when compared to my previous inefficient and ugly coding style.

Off topic question.How is it possible that every newn member already registered has some points assigned to him.

Quoting iliyapolak

Yes something could be wrong with this measurement.I think that overhead in the worst-case scenario is a few cycles of compare to sth, inc counter and jmp to top_loop assembly instructions.In real world testing CPU should execute such a loop inparallel with inside the loop statements.Hi Iliya,

I just completed a simple verification and when ALL compiler optimizations disabled that code works:

Quoting iliyapolak

Thanks to this forum and you guysI know how to perform basic code optimization...I recommend you to look at Intel manuals at:

http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html?wapkw=Manuals

You could find there a manual on optimization techniques.

Best regards,

Sergey

Quoting iliyapolak

...How is it possible that every newn member already registered has some points assigned to him.I think because they didn't opt-out from the

Intel Black Belt Software Developer Program:http://www.intel.com/software/blackbelt

and every time when a member submits a postthe system assignssome number of points.

Best regards,

Sergey

That means that I have lost many points.

Thanks for the link.

I have this manual.I would also like to recommend you a very interesting book"Inner Loops",here is link http://www.amazon.com/Inner-Loops-Sourcebook-Software-Development/dp/0201479605/ref=sr_1_1?s=books&ie=UTF8&qid=1340854371&sr=1-1&keywords=inner+loops

The book is old (even very old)is still dealing with Pentium and Pentium Pro code optimization,but you can find there a lot of info regarding various code optimization techniques performed mainly in assembly.

In my case I probably did not disable optimization hence the result was 0.

Can you post your result?

Quoting iliyapolak

That means that I have lost many points.Yes, unfortunately. I see that you're already in the program!

I could have had a brown belt.Now working hard to regain my lost points.

The rcpps is likely to be useful only where 12 bits precision may be adequate. It's advocated with iterative improvement for "throughput" optimization where you have independent calculations pending which would be stalled during the time the fpu is occupied by divide. Core I7-3 cuts that stall time in half, so you should be aware that you may be optimizing for a past architecture. We are still struggling with software implementation of divide.

I don't see enough context to understand how you would want rcpps for expansion of a series with known coefficients.

A speed improvement may be obtained by modifying the Horner scheme e.g.

x + x**3 * (a3 + a5 * x**2 + x**4 * (a7 + a9 * x**2))

so as to make fuller use of the instruction pipeline.

The Ivy Bridge "core I7-3" should speed up those minimax rational approximations, and fma hardware would help significantly with these polynomial evaluations.

## Páginas

## Deixar um comentário

Faça login para adicionar um comentário. Não é membro? Inscreva-se hoje mesmo!