# Optimization of sine function's taylor expansion

Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

Here is a screenshot ( results forMicrosoft C++ compiler ):

We have to deal with curve approximation inacurracies and finite precision of those numbers that repressent and approximate the curve. As gou showed in your last post when more terms are added to normalized series the accuracy is closer to the real value but sometimes there are left the traces of rounding and inexact representation of various numbdrs. When writing my library i did some analysis of methods taken from Avramovitz and Stegun and got mixed results because limit of covergence was different when formula was evaluated programaticaly and compared to exact mathematical statementand many times i was stuck with widely varying divergence when my implementation(as exactand faithful as possible when compared to book's formulas)was evaluated software simplyfailed to produce the exact value .I blamed more java compiler implementation of IEEE 754 standard andhardware inacurate representation of real numbersthan formula accuracy(which is accurate mathematicaly as book stated). For example from mathematical point of view taylor series of sine converges almost completely withhigher precision if more terms are added, but try to approximate programmaticalywith x>3 and it will be a failure partly because of catastrophic cancellation induced by alternating sign series.

Sure but I don't really seea realworld use for the min error, for example avery roughLUT based solution may have severalexact results (up to LSB) over the domain (i.e. min error = 0.0) and a good minimax polynomial no single exact result (i.e. min error > 0.0). I'll be interested to see your error analysis here: http://software.intel.com/en-us/forums/showpost.php?p=185933 displayed as max AE and/or average AE over [0.0;90.0] deg if it's a simple option of your software

Quoting iliyapolak

...When writing my library i did some analysis of methods taken from Avramovitz and Stegun and got mixed resultsformula when implemented in software failed to produce the exact value. I blamed more java compiler implementation of Ieee 754 standard

than formula accuracy...

I recommend to run a set of very simple tests in order to verify quality of implementation of IEEE 754 Standard for Java, like:

Can you reproduce these numbers?

Quoting iliyapolak

...implementation of Ieee 754 standard...Discussions about consistency of floating-point results will never stop. Please take a look at a very good article regarding

that subject:

Consistency of Floating-Point Results using the Intel Compilerby Dr. Martyn J. Corden & David Kreitzer

Best regards,

Sergey

PS: Ihave very consistent results across different platforms and C/C++ compilers.

## Fichiers joints:

Here are my results for multiplication of x = 0.1f and y = 0.1f both of them declared as a float values:

x

0.10000000149011612000000000000000

y

0.10000000149011612000000000000000

Here is the multiplication result:

product

0.01000000070780515700000000000000

Here are the results for multiplication of java double values x = 0.1d and y = 0.1d

x1

0.10000000000000000000000000000000

y1

0.10000000000000000000000000000000

product1

0.01000000000000000200000000000000

Sergey how can I use 53-bit precision if java only supports standart IEEE 754 float and double values

Read please this pdf written by W Kahan about awful performance of floating point java implementation

www.cs.berkeley.edu/~wkahan/

JAVAhurt.pdfIEEE 64-bit double has 53-bit precision by definition. Decimal conversion functions can conform to IEEE754 while zeroing out digits beyond the significant 17; maybe yours is rounding earlier so as to hide the differences. Also, the strictest standards of IEEE754 don't apply to values < 0.1.

In order to fully analyze the results of floating point calculations which often can require thousands or millions operationsone need to know how the calculations were affected by round off error its propagation and accumulations. For millions of iterative operations it nearly impossible to predict what the error will be in some arbitrary step out of millions needed to perform calculation.

Hi Iliya,

I see thatyour result( product of 0.1*0.1 / single-precision)isconsistent with myresultinPost #56:

and the same applies to the double-precision data type.

I could assume that

Java VMinitialized theFPUto 24-bit precision in the 1st case, and to53-bit ( default ) precisionin the 2nd case.

So, in both casesresults are consistent.

My concern is as follows: in the 1st case I would expect initialization of the

FPUtoDefault settings instead of 24-bit.Do you need another test ( a matrix multiplication )to verify

Java's floating point support?>>...How can I use 53-bit precision if java only supports standart IEEE 754 float and double values...In C/C++ applications a developer could control a precision with functions '

_control87' or'

_controlfp'. Microsoft added several extensions, like '_control87_2' or '_controlfp_s', butthese two functions areMicrosoft specific.

In

Java APIyou should look for some method similar to '_control87' function. IfJavadoesn't have a method that controlssettings of the

FPUI would consider aJava Native Interface( JNI ) DLLthat implements a wrapper around '

_control87' function.Note: In case of using inline assemblerwith

SSE2instructions in aJNIDLL aprecision couldbe controlled with a set of macros, like '

_MM_SET_ROUNDING_MODE', orintrinsic functions, like '

_mm_setcsr'. Take a look at attached'xmmintrin.h' header file.Remember, that '

_control87' function controls both units (x87andSSE2).Best regards,

Sergey

## Fichiers joints:

Here are Reference Data for

sinefunction to verifyJava's floating point support.Two files are enclosed and this is a short example:

## Fichiers joints:

Sergey!

I think that FPU unit's initializations are performed by jvm automatically(maybe hardcoded and done at jvm.dll startup or when compiling fp code.I need to disassemble jvm.dll and look for instructions which accessMXCSR register)and the programmer is not given the option to manipulate the settings as it is done by Microsoft _control87 function.

Soon I will switch from java to c with inline assembly programming for my math library.

-I would consider aJava Native Interface(JNI)DLL-

The better options is to write in c language.-

Do you need another test ( a matrix multiplication )to verify Java's floating point support-

Yes-

My concern is as follows: in the 1st case I would expect initialization of theFPU toDefault settings instead of 24-bit-Because the results obtained from C/C++ and java calculations are the same when rounded I suppose that that CPU (both of us have Intel CPU) rounding algorithm is responsible for this.

note that there isn'tthenotion of precision with SSEx/AVXx anymore, all SS/PS instructions are with 24-bit precision and SD/PD instructions with 53-bit precision, MXCSRhas 2rounding mode bits but no fp precision control (only the precision exception flag and mask) unlike x87

Hi Sergey

I did very interesting comparision of three methods for sin(x) calculation.

First method is based entirely on coefficients calculation(nothing has been pre-calculated) this method implements a for-loop ,array access and Math.pow calls to obtain the result of taylor series division ofx^n/n!

There is a very large Absolute Error when the result is compared to library Math.sin function.I suppose that the source of error is due to accumulated roundoff error of for loop division.Time measurement ofthose methode was based on Java System.nano which returns time difference in nanoseconds.The result returned by System.nano was whopping 274415 nanoseconds.I suppose that the reason for this is very inefficient code which has two array's accesses ,for loop overhead ,many calls to external library Math.pow divisions and

assignment inside of loop.

Second method was much improved when compared to the first one.I simply pre-calculated reciprocals of factorials and did polynomial multiplication.The running time was 907 nanoseconds compared to Math.sin call (2264 nanoseconds).Absolute Error was 0.0 when compared to library Math.sin.

Here are the results of three sin(0.45)functions compared for speed of execution and accuracy:

value of sine inefficient implementation of taylor series

0.43724781952696220000000000000000

time of calculation of sine is: 274452 nanoseconds

Math.sin

0.43496553411123023000000000000000

time of calculation of Math.sin java library is : 2264 nanoseconds

poly fit taylor series

0.43496553411123023000000000000000

time of calculation polynomial fit of sin is : 906 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 0.002282285415731944

Absolute Error compared to Java Math.sin for polynomial fit method is :0.0

For high precision calculation SSE can not beat scalar x87 FPU.

Here is code for those two methods.

Please compare the results in post #65.

[bash]public class SineFunc {

//sin(x)~ x-x^3/3!+x^5/5!-x^7/7!

/**

* @param args

*/

public static void main(String[] args) {

// TODO Auto-generated method stub

double val = Double.parseDouble(args[0]);

int j = 1;//variable initialization

int exp =( 2*j++)+1;

double[] array = new double [11];

int k = 1;

int denom=1;

long start = System.nanoTime();

for (int i = 0;i

...

AE= 0.43496553411123023000000000000000 - 0.43496553411123023000000000000000=0.0...

Did you get ZeroAE for a complete range of input values? For example, from -180 degreesto +180 degrees?

I can see from your resultthat expansion of the Taylor Series to the 21st termor 23rd term could be used in orderto

get a reference table of sine values.

In my case I know a source of errors and it isrelated tocoefficients that currently use.

Now I'am posting my results for ranges 0,P1/2

sin(0.1), step 0.1value of sine inefficient implementation of taylor series

0.09986005837615321000000000000000

time of calculation of sine is: 285774 nanoseconds

Math.sin

0.09983341664682815000000000000000

time of calculation of Math.sin java library is : 2717 nanoseconds

poly fit taylor series

0.09983341664682815000000000000000

time of calculation polynomial fit of sin is : 906 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 2.66417293250526E-5

Absolute Error compared to Java Math.sin for polynomial fit method is :0.0

sin(0.2)value of sine inefficient implementation of taylor series

0.19888046700922576000000000000000

time of calculation of sine is: 264488 nanoseconds

Math.sin

0.19866933079506122000000000000000

time of calculation of Math.sin java library is : 2717 nanoseconds

poly fit taylor series

0.19866933079506122000000000000000

time of calculation polynomial fit of sin is : 906 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 2.1113621416454786E-4

Absolute Error compared to Java Math.sin for polynomial fit method is :0.0

sin(0.3)value of sine inefficient implementation of taylor series

0.29622157615613700000000000000000

time of calculation of sine is: 265394 nanoseconds

Math.sin

0.29552020666133955000000000000000

time of calculation of Math.sin java library is : 2265 nanoseconds

poly fit taylor series

0.29552020666133955000000000000000

time of calculation polynomial fit of sin is : 906 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 7.013694947974325E-4

Absolute Error compared to Java Math.sin for polynomial fit method is :0.0

sin(0.4)value of sine inefficient implementation of taylor series

0.39104373607380610000000000000000

time of calculation of sine is: 326082 nanoseconds

Math.sin

0.38941834230865050000000000000000

time of calculation of Math.sin java library is : 2717 nanoseconds

poly fit taylor series

0.38941834230865050000000000000000

time of calculation polynomial fit of sin is : 453 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 0.0016253937651555805

Absolute Error compared to Java Math.sin for polynomial fit method is :0.0

sin(0.5)value of sine inefficient implementation of taylor series

0.48250729701915250000000000000000

time of calculation of sine is: 302532 nanoseconds

Math.sin

0.47942553860420300000000000000000

time of calculation of Math.sin java library is : 2265 nanoseconds

poly fit taylor series

0.47942553860420300000000000000000

time of calculation polynomial fit of sin is : 906 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 0.003081758414949509

Absolute Error compared to Java Math.sin for polynomial fit method is :0.0

sin(0.6)value of sine inefficient implementation of taylor series

0.56977260924909550000000000000000

time of calculation of sine is: 360501 nanoseconds

Math.sin

0.56464247339503540000000000000000

time of calculation of Math.sin java library is : 2265 nanoseconds

poly fit taylor series

0.56464247339503540000000000000000

time of calculation polynomial fit of sin is : 906 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 0.0051301358540600805

Absolute Error compared to Java Math.sin for polynomial fit method is :0.0

sin(0.7)value of sine inefficient implementation of taylor series

0.65200002302055430000000000000000

time of calculation of sine is: 329252 nanoseconds

Math.sin

0.64421768723769100000000000000000

time of calculation of Math.sin java library is : 2265 nanoseconds

poly fit taylor series

0.64421768723769100000000000000000

time of calculation polynomial fit of sin is : 906 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 0.007782335782863248

Absolute Error compared to Java Math.sin for polynomial fit method is :0.0

sin(0.8)

value of sine inefficient implementation of taylor series

0.72834988859044850000000000000000

time of calculation of sine is: 273546 nanoseconds

Math.sin

0.71735609089952280000000000000000

time of calculation of Math.sin java library is : 4529 nanoseconds

poly fit taylor series

0.71735609089952280000000000000000

time of calculation polynomial fit of sin is : 453 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 0.010993797690925677

Absolute Error compared to Java Math.sin for polynomial fit method is :0.0

sin(0.9)

value of sine inefficient implementation of taylor series

0.79798255621569720000000000000000

time of calculation of sine is: 323817 nanoseconds

Math.sin

0.78332690962748340000000000000000

time of calculation of Math.sin java library is : 3623 nanoseconds

poly fit taylor series

0.78332690962748340000000000000000

time of calculation polynomial fit of sin is : 906 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 0.014655646588213833

Absolute Error compared to Java Math.sin for polynomial fit method is :0.0

sin(1.0)

value of sine inefficient implementation of taylor series

0.86005837615321990000000000000000

time of calculation of sine is: 268111 nanoseconds

Math.sin

0.84147098480789650000000000000000

time of calculation of Math.sin java library is : 3170 nanoseconds

poly fit taylor series

0.84147098480789650000000000000000

time of calculation polynomial fit of sin is : 906 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 0.01858739134532339

Absolute Error compared to Java Math.sin for polynomial fit method is :0.0

sin(1.1)

value of sine inefficient implementation of taylor series

0.91373769865993570000000000000000

time of calculation of sine is: 366842 nanoseconds

Math.sin

0.89120736006143540000000000000000

time of calculation of Math.sin java library is : 3623 nanoseconds

poly fit taylor series

0.89120736006143540000000000000000

time of calculation polynomial fit of sin is : 1359 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 0.022530338598500288

Absolute Error compared to Java Math.sin for polynomial fit method is :0.0

sin(1.2)

value of sine inefficient implementation of taylor series

0.95818087399276370000000000000000

time of calculation of sine is: 420736 nanoseconds

Math.sin

0.93203908596722630000000000000000

time of calculation of Math.sin java library is : 3623 nanoseconds

poly fit taylor series

0.93203908596722630000000000000000

time of calculation polynomial fit of sin is : 905 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 0.02614178802553746

Absolute Error compared to Java Math.sin for polynomial fit method is :0.0

sin(1.3)

value of sine inefficient implementation of taylor series

0.99254825240862400000000000000000

time of calculation of sine is: 267658 nanoseconds

Math.sin

0.96355818541719300000000000000000

time of calculation of Math.sin java library is : 3623 nanoseconds

poly fit taylor series

0.96355818541719300000000000000000

time of calculation polynomial fit of sin is : 453 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 0.02899006699143103

Absolute Error compared to Java Math.sin for polynomial fit method is :0.0

sin(1.556) very close to Pi/2

value of sine inefficient implementation of taylor series

1.02879965351909490000000000000000

time of calculation of sine is: 379523 nanoseconds

Math.sin

0.99989053635379590000000000000000

time of calculation of Math.sin java library is : 3623 nanoseconds

poly fit taylor series

0.99989053635379600000000000000000

time of calculation polynomial fit of sin is : 906 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 0.02890911716529898

Absolute Error compared to Java Math.sin for polynomial fit method is :1.1102230246251565E-16

As you can see only in last test we have a difference albeit small.Afaik Java Math.sin calls java strictMath class whos algorithms are based on FDLIBM 5.3 library.

FDLIBM library for sin calculation uses polynomial of degree 13.

Please look also at very long execution time for non-efficient method the difference is more than 200th times.I think that array accesses are responsible for this.

Quoting iliyapolak

...For millions of iterative operations it nearly impossible to predict what the error will be in some arbitrary step out of millionsneeded to perform calculation...

TheAccumulated Error could be detected and analyzed. In a very simple test, which I used some time ago, a constant 0.001 is

added 1000 times to a single-precision variable:

Results are as follows and you could see that errors started almost from the beginning:

In order to see an Exact Error at some step the Binary Representations ( in IEEE 754 format ) for

the Current and True Values have to be used. In another words, a bit-to-bit comparision must be done.

The same test with double-precision variables calculated result as follows:

Sergey

You are right , but only when we take into account a simple calculation be it summation or product of two binary inexact number representated.

Please take a lookat my code which calculates the sine function(the" inefficient method"),analyze thealgorithm which uses for-loop to calculate sine series coefficients.As you can see the AE is very large and this code at the beginning only involved 4 loop cycles and the running-time of this function is 100th times slower than library function.It is not soeasy as with simple summation to spot immediately the source of an error and explain the very slow runing time.

Now think about the some physical process which an algorithm tries to describe with time resolution measured in microsecond and calculation being performed inside maybe nested for-loop with dozens operation per cycle.

How you can effectivelyfind and track the source of anerror calculation and various contribution to an error and calculate the error's propagation and accumulation.I think that this would be very daunting and tedious task.

Btw the last sentence was taken from the book "A First Course in Numerical Analysis" by P.Rabinovitz and A. Ralston

not so tedious using interval arithmetic

http://en.wikipedia.org/wiki/Interval_arithmetic

it's easy for example to design an interval arithmetic framework,best in a language that support operators overloading like C++ (so that it's just as readable as classical arithmetic) and to see how the intervals are tighter or wider based on the code arrangement, it works well even for very complex formulas and a lot of iterations

In my numerical analysis book nothing has been said about the interval arithmetics.

The authors describe various statistical methods to measure dispersion of an error

Hi Iliya,

A whileago I posted some results for a Test-Case that multiplies 0.1 by 0.1. I just detected some

inconsistency of results for a Single-Precision ( 'float' )data type. Here is an example:

A test ( 0.1 * 0.1 ) executed from a DLL:

24-bit : [ 0.1 * 0.1 = 0.01000000070780516 ]

53-bit : [ 0.1 * 0.1 = 0.01000000029802323 ]

64-bit : [ 0.1 * 0.1 = 0.01000000029802323 ]

Default : [ 0.1 * 0.1 = 0.01000000029802323 ]

A test ( 0.1 * 0.1 ) executed from an EXE-program:

24-bit : [ 0.1 * 0.1 = 0.01000000070780516 ]

53-bit : [ 0.1 * 0.1 = 0.01000000070780516 ]

64-bit : [ 0.1 * 0.1 = 0.01000000070780516 ]

Default : [ 0.1 * 0.1 = 0.01000000070780516 ]

My questions: Is it related to a '

_control87' or 'printf' CRT-functions, orsomething else?Best regards,

Sergey

Hi Sergey

Hard to answer without therunning codeexample.Does your DLL call printf function?,or you are running it under VS debugger?.If it does call printf put the breakpoint on the printf call(printf is implemented in MSVCRT.DLL)and single-step into a function

check carefully floating-point registers values or memory arguments passed to printf and look for AE in processed values.You can also do the same procedure with your EXE code.

In order to fully track or follow printf call chain kernel debugger must be used to inspect memory-mapped I/O or I/O ports used by video card to read the valuesinto frame buffer and display it,but I think this would be an overkill.

You can also write in assembly or inline assemblysimple procedure involving 0.1 multiplicationand look at fpu registers context directly without formatting and displaying an output by doing this you can narrow down your AE and eliminate contribution(modification of the result)from printf function.

[

SergeyK] An example is in aPost #45. Please take a look.I wrote my time-measurement code almost exactly as You,but the difference calculated by the calls to GetTickCount() function was 0. I read about the time resolution or granularity of this function and on the MSDN site was clearly stated that the resolution is returned in miliseconds.Therefore I switched to using inline RDTSC for more accurate measurement.So howexactly Did you calculate the time delta and acieved nanoseconds accuracy?

Quoting iliyapolak

[SergeyK] An example is in aPost #45. Please take a look.I wrote my time-measurement code almost exactly as You,but the difference calculated by the calls to GetTickCount() function was 0. I read about the time resolution or granularity of this function and on the MSDN site was clearly stated that the resolution is returned in miliseconds.

[

SergeyK] Yes, that is correct. I'm always measuring a relative performance because many platformsare supported.I simply need to know what is a performanceratio between different

functions executed on the same platform,or ondifferent platforms.

Iliya, I'll do a test with RDTSC instruction and will post my results as soon as they are ready.

Therefore I switched to using inline RDTSC for more accurate measurement. So howexactly Did you calculate the time delta and acieved nanoseconds accuracy?

I've executedcalls more than 4 million times ( 2^22 ). As I told I didn't try to measure a performance in nanoseconds.

A pseudo code looks like:

...

NUM_OF_TEST = 2^22

StartTime = ::GetTickCount();

for( i=0;i < NUM_OF_TEST; i++)

{

Callto aSine Function

}

EndTime = ::GetTickCount();

Printf( "Delta: %d", ( EndTime - StartTime ) );

...

Best regards,

Sergey

Hi Sergey

Spent a few hours on performance testing of my elementary functions library(all functions implemented as a polynomial fit).Performance of every functions was measured with inline RDTSC.Did not calculate average values of many thousand runs(I will do it later).

For example the same implementation of sin(x) function written in java and tested with Java System.nanoTime() method returned on average of 1000 runs~1000 nanoseconds that's mean ~2226 CPU cycles(My Cpu runs at speed of2.226Ghz).In comparision sin(x) function written in C as a console app and measured with inline RDTSC achieved speed of execution which equals ~862(hex) cycles translated to nanoseconds it will be 964.24. Please bear in mindthat abovecalculations are not accurate and cannot be relied upon to draw conclusions , because it is well known that on multi-core CPU's with varying frequency as a consequence of speedstep technonlogy usage accurate time measurement with help of RDTSC can not be guaranted.

Here is my implementation of inlineRDTSC.

Running time of CRT

sin(x)was36396 cyclesin hex after translation to decimal units and I got whopping ~99796 nanoseconds this is the cost of CRT library calls.and here is console screen from one of the many tests

fastsin(x)

running time of fastsin() start_lo value :

1874052280 //

starting value of eaxrunning time of fastsin(): start_hi value:

10d

// starting value of edx

running time of fastsin() :end_lo value:788 //

eax value after subtraction

running time of fastsin() :end_hi value:0 //

edx value after subtractionSum:0.434965534111230230000000

fastcos(x)

running time of fastcos() start_lo value :

1876461176

running time of fastcos(): start_hi value:

10d

running time of fastcos() :end_lo value:

LoDelta : 1084

running time of fastcos() :end_hi value:

HiDelta : 0

Sum: 0.852524522059505680000000

fasttan()

running time of fasttan() start_lo value :

1878359514

running time of fasttan(): start_hi value:

10d

running time of fasttan() :end_lo value:

LoDelta : 668

running time of fasttan() :end_hi value:

HiDelta : 0

Sum :0.483055065616578410000000

time measurement of Library sin()

Delta is: 36396 //

LIBRARY CRT call

Result :0.434965534111230230000000Quoting iliyapolak

...that's mean ~2226 CPU cycles(My Cpu runs at speed of2.226Ghz).In comparision sin(x) function written in C as a console app and measured with inline RDTSC achieved speed of execution which equals ~862(hex) cycles...862(hex) =2146(dec) and it consistent with a2226number. In order toimprove accuracyyou can dothe following:- Do all tests on the 2nd CPU

- Switch a priority of the test process to Real-time ( thatreally shows better numbers for ~1% to ~2% )

Best regards,

Sergey

CPUID and RDTSC are very long latency instructions so you are not really measuring the time to compute your polynomial at the moment, the actual time for 12 mul + 11 add mustbe muchlower than what you report, something like 15-30 cycles with 100% cache hit

the best will be to follow Sergey advice: call a lot of times (for ex. 1 milliion times)the polynomial approximation in between your CPUID+RDTSC calls, you must be careful that your compiler don't optimize it by a single call

if you insist calling it only onceyou must remove a delta value from your timings, you can for example comment out all the code between the ASM block to do some calibration, you will see how much "time" it takes to do nothing

I ran the tests on second CPU and also set priority to 24(Real-Time) and did not observe any significant change in the terms of speed of execution.Measured performance of CRT sin(x) with 2^20 iteration loop and average value was

1919.0991 cycles.Disassembled msvcrt.dll inspected sin() function from what i have been able to understand CRT sin() has argument comparision to 0 and if/else statement.If arg >0 sin() jumps to subprocedure which calculates sine with the help of built-in instruction

fsin.Afaik fsin has an throughput(cpi) of ~ 40-100 cpi.I suppose that fsin also performes range-reduction and overflow checking.This situation would show fma to best advantage, but I don't know of a current architecture in which fma has a total latency of 2 clock cycles. On a core i7 the total latency of multiply and add in series is more like 11+ cycles.

Did exactly as youradvise.I performed a few tests with commented out polynomial approximation.The result was oscillating around400cycles.RDTSC and CPUIDtogether have cpi of ~224 cyclesandexecuted two times these instructions completely"destroyed" my measurement.Looked at assembly and saw that compiler used MOVSD,ADDSD,MULSD instructions for polynomial calculation.From the beginning I suspected that a handfull simple XMM instructions cannot be responsible for those hundreds of cycles.

When I tested my assembly code implementation of taylor expansion for sin(x) I ran my code (single term) inside of 1000 000 iterations loop average value was ~4.25cycles per 4 XMM instructions.

indeed for the totallatency my values are too optimitic,my estimates aremore for the rcp throughput that one will get when processing polynomials in a loop

I encourage you to read this insightful papertosee howyou can get the best fromRDTSCPor RDTSC:

http://download.intel.com/embedded/software/IA/324264.pdf

please don't forget to tell us at the end how much the optimized version is faster than you first naive implementation

bronxzv

Thanks for the pdf it was a great read.On average a fewoptimized(polynomial approximations and pre-calculated coefficients)methods(in java) when compared to the same method which are based on for-loop with library function call and divisions are 5-6 times faster and sometimes even 10 times faster.Now I have to optimize 4000 lines of code :).

Btw whenI must write code for some esoteric function like kelvin ker(x) it is not so easy task to precalculate coefficients so the procedure must be written to automate this task based on the for-loops iterations.

out of curiosity I have tested it, thepolynomial discussed here is evaluated in 25.5 +/-0.2 cycles on Ivy Bridge from data in the L1D$ with the scalar version shown below (vectorization avoidedusing \Qvec-)

with 4x unrolling it's down to 23.6 +/- 0.2 cycles

with vectorization and 4x unrolling we are at6.0 +/- 0.1 cycles

ASM :

source code :

[cpp]inline double fastsin(double x)

{

const double

coef1 = -0.16666666666666666666666666666667,

coef2 = 0.00833333333333333333333333333333,

coef3 = -1.984126984126984126984126984127e-4,

coef4 = 2.7557319223985890652557319223986e-6,

coef5 = -2.5052108385441718775052108385442e-8,

coef6 = 1.6059043836821614599392377170155e-10,

coef7 = -7.6471637318198164759011319857881e-13,

coef8 = 2.8114572543455207631989455830103e-15,

coef9 = -8.2206352466243297169559812368723e-18,

coef10 = 1.9572941063391261230847574373505e-20,

coef11 = -3.8681701706306840377169119315228e-23;

const double sqr = x*x; //x^2

return x+x*sqr*(coef1+sqr*(coef2+sqr*(coef3+sqr*(coef4+sqr*(coef5+sqr*(coef6+sqr*(coef7+sqr*(coef8+sqr*(coef9+sqr*(coef10+sqr*(coef11)))))))))));

}

void FastSinTest(const double a[], double b[], int size)

{

for (int i=0; i

Almost the same code was produced by my compiler,only because of CPU architecture VEX-prefixed instr. are not supported.Why Did you inline the function?try to running the test withoutinlining.

How the function chrono.getTime() is implemented?Does it use inline RDTSC or Win32 GetTickCount()?

The results with vectorization switched off are almost as expected to be in the range 20-30 cycles.

sorry but even without the "inline" keyword it's inlined by the Intel compiler, it's generally a good idea to inline polynomial evaluations since there is a long dependency chainand the compiler can schedule in between other unrelated instructions like another polynomial evaluation for ex. and/or unroll evaluation of the same polynomial (don't try it by hand in assembly!)

Chrono is based on WIN32 QueryPerformanceFrequency / QueryPerformanceCounter

Chrono::getTime() returns a time in [ms] with better than 1e-3 ms accuracy

in this example totalrun time is around 1 sec so anywall-clock time based stopwatchis OK

you typicallyget better than single clock accuracywhen you can execute a lot of time the studied code, RDTSC is better suited for more complex cases though

bronxzv

Today I ran another set of tests on my fastsin() function.The code beetwen blocks of RDTSC instructions was loopedmillion times and the delta was divided by the number of loop iterations.

On average the calculated result was 1660 cycles per iteration thats mean 745.73 nanosecond.When I measured only polynomial approximation(coefficients assignment outside of the loop)I got 1075 cycles per iteration.

I also commented out the code and ran the tests the result for the same settings was 117 cycles(probably loop overhead itself)per iteration.I performed those tests according to this post http://software.intel.com/en-us/forums/showthread.php?t=52482.

My only explanation is that RDTSC and CPUID affect the accuracy of thetime measurement for arbitrary number of iteration.Probably cumulative sum of these instructions cpi is added to polynomial execution time cpi and simply distorts the true value.

Ran another set of testswherefastsin() was called1 milliontimes from the main() and the delta measured by the GetTickCount() in miliseconds was 78 miliseconds for 1 million iterations so the one iteration is

0.000078 milisec and this is78000 nanosecond per function call.How did you guys were able to achievemore accurate results.?

Btw here is the response from one of the Intel's engineer:

"rdtsc

{a small number of instructions with a cumulative latency less than hundreds or thousands of clocks} // AS IN MY CASE

rdtsc

is very unlikely to yield a reliable result. The CPUID step is probably not needed either, although there are other opinions on that point.The rdtsc instruction is serializing with respect to itself. It is not actually serializing."

Second answer from the same post:

2." Measurements with RDTSC are most credible if the number of clocks between the pair of RDTSC instructions is at least a few thousand, preferably tens of thousands".

sorry but your timings make nosense at all, for ex. 117 cycles per iteration for the loop overhead

without seeing your source code I can't point to the error(s) in your methodology

concerning RDTSC the best will beto read thePDF I have linked yesterday, it's more recent and far more insightful than the post you are refering to (btw using RDTSC is plain overkill for such a simple case with a single routine under study)

since I made the effort to test it and to post full source code I'll also suggest to compile my example in your environment, just replace the CPU frequency by yours and plug in your favorite stopwatch

I know i need to have some problems with my methodology :)

here is the snippet of code which calls fastsin() from the main()

The result of calculation was 78000 nanosecond for one iteration.

Here the fastsin() function with commented out polynomial approximation in order to" measure" RDTSC and CPUID overhead.

The result of 1million iterations loop in cycles is 117.514276 cycles.Delta is in(hex)7012024 in (dec) 117514276, result is 117514276/1000000 = 117.514276.

[bash] int start_lo,start_hi,time_lo,time_hi,counter;

counter = 1000000;

_asm{

xor eax,eax

xor edx,edx

cpuid

rdtsc

mov start_lo,eax

mov start_hi,edx

}

for(int i = 0;i

I'll be glad to also see your fastsin() implementation

since you call it with a constantthe Intel compiler will probably resolve it at compile time so I can't use it as is, also it will probably not loop over 1 million time since the same variable is written over and over again and it can see that fastsin has no global side effect, I'll suggest to base future tests on my code instead

I suggest to forget about RDTSC entirely at the moment, ged rid of all this _asm stuff and simply measure (with GetTickCount in the *outer loop*) what you are interested in: the polynomial approx.

Full implementation of fastsin() with for-loop to test polynomial block.

[bash] double fastsin(double x){

double sum = 0;

double half_pi,zero_arg;

half_pi = Pi/2;

zero_arg = Zero;

int const Iter = 1000000;

if(x > half_pi){

return (x-x)/(x-x) ;

}else if (x < zero_arg){

return (x-x)/(x-x);

}else{

int start_lo,start_hi,time_lo,time_hi;

_asm{

xor eax,eax

xor edx,edx

cpuid

rdtsc

mov start_lo,eax

mov start_hi,edx

}

for(int i = 0;i

so it looks like you actually call the poly approx.1e12 times, 1e6x from the outer loop, 1e6 xinner loop, the inner loop is probably simplified by the compiler though and you are mostly measuring the speed of printf + CPUID/RDSTC

I'll adapt my code for using GetTickCount and post it shortly so that we have some common testbed

here is the code using GetTickCount, to use it in your context: - replace "DS" by any ostream (for ex. cout) - replace the 3.5e6 constant by yours (CPU cyclesperms), note thatI get the same timings than with the previous version, justmore variation due to the poor resolution of GetTickCount vs QueryPerformanceCounter

[cpp]

inline double fastsin(double x)

{

const double

coef1 = -0.16666666666666666666666666666667,

coef2 = 0.00833333333333333333333333333333,

coef3 = -1.984126984126984126984126984127e-4,

coef4 = 2.7557319223985890652557319223986e-6,

coef5 = -2.5052108385441718775052108385442e-8,

coef6 = 1.6059043836821614599392377170155e-10,

coef7 = -7.6471637318198164759011319857881e-13,

coef8 = 2.8114572543455207631989455830103e-15,

coef9 = -8.2206352466243297169559812368723e-18,

coef10 = 1.9572941063391261230847574373505e-20,

coef11 = -3.8681701706306840377169119315228e-23;

const double sqr = x*x; //x^2

return x+x*sqr*(coef1+sqr*(coef2+sqr*(coef3+sqr*(coef4+sqr*(coef5+sqr*(coef6+sqr*(coef7+sqr*(coef8+sqr*(coef9+sqr*(coef10+sqr*(coef11)))))))))));

}

void FastSinTest(const double a[], double b[], int size)

{

#pragma unroll(4)

for (int i=0; i const int size = 256, nrOfTests = 1000000;

double a[size],b[size];

for (int i=0; i

fastsin() is called once from the main() and inside the function polynomial block is executed 1e6 times.I have never coded it to be called 1e12 times.When I tested it from the main polynomial block was not inside the loop.I simply posted two various test methods and you wrongly assumed that it ran 1e12 times.No it is not the case.

printf is called after the end of _asm block and the overhead of the full trip to the kernel mode multiplied by 1e6 values to be displayed is a lot of more than few hundred cycles per iteration.

here is the main() last test i got 1700cycles probably containsRDTSC+CPUID overhead.

I assumed you posted it in answer to my request here: http://software.intel.com/en-us/forums/showpost.php?p=187014

so, in other words,you have not posted a complete example using GetTickCount soit's difficult to say what you do wrong, more generally without full source code that all peers can compile welose a lot of time in misunderstandings (read:I'm getting bored)

Here is the full implementation of fastsin() tested from the main , and main() function which calls 1e6 times fastsin()the result in millisecond is63

start value: 19839910

end value : 19839973

delta is : 63

sine is: 0.434965534111230230000000

code snippet

so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRC

if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine

## Pages

## Connectez-vous pour laisser un commentaire.