Optimization of sine function's taylor expansion

343 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

Hi everybody,

Iliya, your measured time, that is 63 milliseconds, forthe 'fastsine' functiondoesn't take into account some
overhead for the 'for' loop.

Could you try the following scheme? This is a pseudo code:

...
OverheadOfForStatement = 0
FastSineSpeedInMs = 0
Start = 0
End = 0
NumOfIterations = 2^22

ChangeProcessPriorityToRealTime()

Start = ::GetTickCount()
for( i = 0;i < NumOfIterations; i++) // Verify that aC++ compiler keeps the empty loop!
{
}
End= ::GetTickCount()
OverheadOfForStatement = (End - Start )

Start = ::GetTickCount()
for( i = 0;i < NumOfIterations; i++)
{
FastSine(...)
}
End= ::GetTickCount()
FastSineSpeedInMs = (End - Start- OverheadOfForStatement ) / NumOfIterations

ChangeProcessPriorityToNormal()
...

Best regards,
Sergey

Hi Sergey
Tomorrow I will start to post more comprehensive results from my testings.I will compare Java methods for calculation various math functions against the native code based implementation of the same functions.

"Iliya, your measured time, that is 63 milliseconds, forthe 'fastsine' functiondoesn't take into account some
overhead for the 'for' loop.

Could you try the following scheme? This is a pseudo code":

Results of loop-overhead testing
As you can see I cannotmeasure loop overheadmoreover Ialso checked with debugger that empty for-loop is executed.Priority wasset with the help of Process Explorer.Assembly instructions can be counted so overheadis sum of a few x86 instr like"jae[target], add 1 cmp,some_value" andshould be not more than a few cycles per iteration.

start value of loop_overhead : 5600529

end value of   loop_overhead : 5600529

delta of loop_overhead is : 0

start value of fastsin(): 5600529

end value of fastsin()  : 5600607

delta  of fastsin() and loop_overhead is : 78

sine is: 0.434965534111230230000000

Hi everybody

Tested today execution time Java version of my fastsin() function.Two timing methods were used:

1.polynomial approximation block's directly calculated delta with nanosecond resolution as returned by System.nanoTime() the result was 906 nanosec.

2.One million iterations foor-loop with fastsin() call inside the loop delta measured by System.currentTimeMillisec() method the result was 63 millisec exactly same as native code time delta.

The slow execution time of the first method I think is probably related to the possibility of System.nanoTime()
beign binary translatedby JIT compiler to RDTSC instruction.

How this could be possible that native codefastsin() and java fastsin() both of them have the same speed of execution.Even if Java JIT recognizes and optimizes hot-spots by binary translated code caching.

double value55 = 0; // JAVA version test for fastsin() method

 long start1,end1;

 final int  max_Iter = 1000000;

  start1 = System.currentTimeMillis();

   for(int k = 0; k < max_Iter;k++){

	   value55 = func.fastsin(0.45);

   }

 end1 = System.currentTimeMillis();

 System.out.println("running time of fastsin9) is :" + (end1-start1) + "tmillisec");

  System.out.printf("%15s n", "fastsin()");

  System.out.printf("%.24f n",value55);

	}
RESULT:

running time of fastsin() is :63 millisec

fastsin()

0.434965534111230230000000

How this could be possible that native code fastsin() and java fastsin() both of them have the same speed of execution.Even if Java JIT recognizes and optimizes hot-spots by binary translated code caching.

I'll not rule out the fact that the JIT is able to do well on such a simple example

now, you are using a stopwatch with 1 ms resolution so it's not enough precise to compare timings of around 63 ms (~ 3% max potential error), to improve the situation:

1) increase the iteration count so that it runs for roughly 2 s, i.e. ~30x more iterations, you'll have then 0.1% max error

and/or

2) use a more precise stopwatch based on QueryPerformanceCounter

also, as previously noted, I don't think that calling it with a constant is the best idea, insteadI'll advise to samplethe domain like in the example I provided the other day

Hi
I increased iteration count to 30 000000 iterations and ran it a few times.I got very confused results.It seems to me that java fastsin() ran faster than native C-code fastsin().Can not use timingmethods based on
QueryPerformanceCounter when I sample javacode.
Results for native code in msec

start value of fastsin(): 16389855
end value of fastsin() : 16392273
delta of fastsin() is : 2418
sine is: 0.434965534111230230000000

Results for javain msec

running time of fastsin() is :647 millisec

fastsin()

0.434965534111230230000000

As You can see java fastsin() is almost four times faster than native code.

it's difficult for me to trust that the timingsare roughly the same at 1M iterations andvery different at 30M

I'll be interested to see the timings for both implementations with varying iteration counts, simply add an outer loop with 1M increment

I have forgotten to tell You that java fastsin() is tested from within Eclipse IDE and in the past I got different results with the same code tested from console which ran under conhost.exe and from Eclipse IDE.
I rewrote fastsin() as a standalone class and tested it from the console only.Two setting were used
1. -server switch optimized for the speed
2. -client switch optimized for client application
Native fastsin() ran from VS 2010
results for 1e6 iterations -server switch

C:\Program Files\Java\jdk1.7.0\bin>java -server SineFunc
start value : 1339584027637
end value : 1339584027644
running time of fastsin() is :7 milisec
sine 0.434965534111230230000000

results for 1e6 iterations -client switch

C:\Program Files\Java\jdk1.7.0\bin>java -client SineFunc
start value : 1339584169499
end value : 1339584169520
running time of fastsin() is :21 milisec
sine 0.434965534111230230000000

native code results for 1e6 iterations

start value of fastsin(): 24697391
end value of fastsin() : 24697469
delta of fastsin() is : 78
sine is: 0.434965534111230230000000

results for 2 million iterations

java -server switch

C:\Program Files\Java\jdk1.7.0\bin>java -server SineFunc
start value : 1339584894864
end value : 1339584894871
running time of fastsin() is :7 milisec
sine 0.434965534111230230000000

java -client switch

C:\Program Files\Java\jdk1.7.0\bin>java -client SineFunc
start value : 1339584860265
end value : 1339584860304
running time of fastsin() is :39 milisec

native code 2 million iterations

start value of fastsin(): 25231664
end value of fastsin() : 25231804
delta of fastsin() is : 140 milisec
sine is: 0.434965534111230230000000

results for5 million iterations

java -server switch

C:\Program Files\Java\jdk1.7.0\bin>java -server SineFunc
start value : 1339585376561
end value : 1339585376568
running time of fastsin() is :7 milisec
sine 0.434965534111230230000000

java -client switch 5 million iterations

C:\Program Files\Java\jdk1.7.0\bin>java -client SineFunc
start value : 1339585333038
end value : 1339585333135
running time of fastsin() is :97 milisec
sine 0.434965534111230230000000

native code 5 million iterations

start value of fastsin(): 25629607
end value of fastsin() : 25629965
delta of fastsin() is : 358 milisec
sine is: 0.434965534111230230000000

results for 20 million iterations

java -server switch

C:\Program Files\Java\jdk1.7.0\bin>java -server SineFunc
start value : 1339585728253
end value : 1339585728260
running time of fastsin() is :7 milisec
sine 0.434965534111230230000000

java -client switch

C:\Program Files\Java\jdk1.7.0\bin>java -client SineFunc
start value : 1339585809120
end value : 1339585809500
running time of fastsin() is :380 milisec
sine 0.434965534111230230000000

native code

start value of fastsin(): 26029671
end value of fastsin() : 26031153
delta of fastsin() is : 1482 milisec

sine is: 0.434965534111230230000000

"java -server" scores are obviously bogus, probably due to the way you call your function with a constant (3rd time thatI point it out) which can be optimized in all sort of ways

you were saying that your java/C++ timings are the same at 1M iterations isn't it ? (both 63 ms) it's far from being the case after all

btw you were reporting C++ 1M at 63 ms, and now 78 ms (?),are you sure that you respect basic advices such as disabling speedstep and turbo + ensuring thread afinity to a single core + not running any other software (including antivirus, windows indexation, etc.)

Sorry I did not understand what a constatnt have you been reffering to.Now I have corrected my error and fastsin() calls argument derived from the loop counter.

>>you were saying that your java/C++ timings are the same at 1M iterations isn't it ? (both 63 ms) it's far from being the case after all

Do not relate to this my measurement were wrong (did it with a constant).

Now I have performed another quick set of tests here are the results.I set affinity to single processor ,priority was set to 24(real-time).Everything non-relevant was closed.

java -server 20 million iteration (fastsin() called with reciprocal of loop counter)

C:\Program Files\Java\jdk1.7.0\bin>java -server SineFunc
start value : 1339595370770
end value : 1339595371222
running time of fastsin() is :452 milisec
sine 0.000000000000000000000000

java -client 20 million iterations (fastsin() called with reciprocal of loop counter)

C:\Program Files\Java\jdk1.7.0\bin>java -client SineFunc
start value : 1339595386620
end value : 1339595387533
running time of fastsin() is :913 milisec
sine 0.000000000000000000000000

native code 20 million loop iterations (fastsin() called with reciprocal of loop counter)

start value of fastsin(): 35965687
end value of fastsin() : 35967372
delta of fastsin() is : 1685
sine is: 0.000000000000000000000000

results for 1 million iterations (fastsin() called with reciprocal of loop counter)

java -server

C:\Program Files\Java\jdk1.7.0\bin>java -server SineFunc
start value : 1339596068015
end value : 1339596068045
running time of fastsin() is :30 milisec
sine 0.000000000000000000000000

java -client

C:\Program Files\Java\jdk1.7.0\bin>java -client SineFunc
start value : 1339596081083
end value : 1339596081130
running time of fastsin() is :47 milisec
sine 0.000000000000000000000000

native code

start value of fastsin(): 36452722
end value of fastsin() : 36452800
delta of fastsin() is : 78
sine is: 0.000000000000000000000000

Still as you can see native code is slower maybe probably because of overhead of security cookie beign checked on the stack.

Now I have performed another quick set of tests here are the results.I set affinity to single processor ,priority was set to 24(real-time).Everything non-relevant was closed.

neat, did you also disable enhanced speedstep in the BIOS (+ turbo if you have it) ? it typically introduces wild variations for short runs

fastsin() called with reciprocal of loop counter

bad idea! division may well be as slow as your full polynomial evaluation, I'll advise to work with precomputed values in an array fitting in the L1D$ + inline the C++ call, in other words do exactly like in my example

I disabled turbo in BIOS.
Your method of testing seems to include overheadof two for-loops i.e 256 million calculations of loop counter.
main calls1e6 timesFastSinTest() and inside this function there is256 calculations of loop counter.Can this add significiant overhead and saturate the running time measurement?
I used compound addition statement as an argument to fastsin() , fastsin() is inlined.

result for native code 1 million iterations.

start value of fastsin(): 39492698
end value of fastsin() : 39492760
delta of fastsin() is : 62
sine is: 0.841470444509448080000000

java -server

C:\Program Files\Java\jdk1.7.0\bin>java -server SineFunc
start value : 1339596068015
end value : 1339596068045
running time of fastsin() is :30 milisec

java -client

C:\Program Files\Java\jdk1.7.0\bin>java -client SineFunc
start value : 1339596081083
end value : 1339596081130
running time of fastsin() is :47 milisec

used compound addition statement as an argument to fastsin()>

I'm not sure what you mean here, how about posting the code?

Can this add significiant overhead and saturate the running time measurement?

zero overhead I'll say on a modernIntel processor (4-wide issue), GPR increment and effective addresswill be computedin parallel with FPU computations and the branches (with 99+% prediction hit)will be macro fused, the critical path is clearly the polynomial evaluation with its long dependency chain

I'm not sure if you have still your branches (domain check) at the begining of your fastsin(), if you remove them you should start to see timings in clock cycles nearer than mine, besides my CPU with a bit better IPC, probably more consistent scores between JIT and native too

I'm not sure if you have still your branches (domain check) at the begining of your fastsin(), if you remove them you should start to see timings in clock cycles nearer than mine, besides my CPU with a bit better IPC

Removed branches at the beginning of fastsin()and got the same result 63 ms per 1e6 iterations.What was your result?

For example the running time for gamma() stirling approximation implemented as in post #44
for 1e6 iteration with compound addition is 856 milisecond.Can you check this on your machine?

my latest full example is here: http://software.intel.com/en-us/forums/showpost.php?p=187021

scores were reported here: http://software.intel.com/en-us/forums/showpost.php?p=186968

I suggest that you test it *exactly as is* on your configuration to see how your timingscompare to mine

I'm not in for testing yet another example

Your CPU is more powerful than mine I have core i3 2.226Ghz.I have never been able to reach less than 60ns.
What are your compiler settings?

Did you really test my example *as is* ? 60 ms/1e6 (i.e. >120 clocks) looks way too high

Which Core i3 ?

We don't use the same compiler AFAIK somy settings aren't very useful for you I'll say, look at the ASM instead and post yours

Yes few days ago I tested your code , but i deleted it.
Core i3 Nehalem architecture.
Tommorow I will post asm code. Now I gotta go to work :).
Thanks for your help.

I'm not sure what you mean here, how about posting the code?

double arg;
arg = 1.0d;
inside loop arg is incremented by floating-point addition of 0.000001
arg += 0.000001;
so inside the for-loop ther is an overhead of addsd instruction andmovsd instr.

asm code for polynomial block approx. of fastsin() function

 000f0	f2 0f 10 45 88	 movsd	 xmm0, QWORD PTR _rad$[ebp]

  000f5	f2 0f 59 45 80	 mulsd	 xmm0, QWORD PTR _sqr$[ebp]

  000fa	f2 0f 10 4d 80	 movsd	 xmm1, QWORD PTR _sqr$[ebp]

  000ff	f2 0f 59 4d 90	 mulsd	 xmm1, QWORD PTR _coef11$[ebp]

  00104	f2 0f 58 4d 98	 addsd	 xmm1, QWORD PTR _coef10$[ebp]

  00109	f2 0f 59 4d 80	 mulsd	 xmm1, QWORD PTR _sqr$[ebp]

  0010e	f2 0f 58 4d a0	 addsd	 xmm1, QWORD PTR _coef9$[ebp]

  00113	f2 0f 59 4d 80	 mulsd	 xmm1, QWORD PTR _sqr$[ebp]

  00118	f2 0f 58 4d a8	 addsd	 xmm1, QWORD PTR _coef8$[ebp]

  0011d	f2 0f 59 4d 80	 mulsd	 xmm1, QWORD PTR _sqr$[ebp]

  00122	f2 0f 58 4d b0	 addsd	 xmm1, QWORD PTR _coef7$[ebp]

  00127	f2 0f 59 4d 80	 mulsd	 xmm1, QWORD PTR _sqr$[ebp]

  0012c	f2 0f 58 4d b8	 addsd	 xmm1, QWORD PTR _coef6$[ebp]

  00131	f2 0f 59 4d 80	 mulsd	 xmm1, QWORD PTR _sqr$[ebp]

  00136	f2 0f 58 4d c0	 addsd	 xmm1, QWORD PTR _coef5$[ebp]

  0013b	f2 0f 59 4d 80	 mulsd	 xmm1, QWORD PTR _sqr$[ebp]

  00140	f2 0f 58 4d c8	 addsd	 xmm1, QWORD PTR _coef4$[ebp]

  00145	f2 0f 59 4d 80	 mulsd	 xmm1, QWORD PTR _sqr$[ebp]

  0014a	f2 0f 58 4d d0	 addsd	 xmm1, QWORD PTR _coef3$[ebp]

  0014f	f2 0f 59 4d 80	 mulsd	 xmm1, QWORD PTR _sqr$[ebp]

  00154	f2 0f 58 4d d8	 addsd	 xmm1, QWORD PTR _coef2$[ebp]

  00159	f2 0f 59 4d 80	 mulsd	 xmm1, QWORD PTR _sqr$[ebp]

  0015e	f2 0f 58 4d e0	 addsd	 xmm1, QWORD PTR _coef1$[ebp]

  00163	f2 0f 59 c1	 mulsd	 xmm0, xmm1

  00167	f2 0f 58 45 88	 addsd	 xmm0, QWORD PTR _rad$[ebp]

  0016c	f2 0f 11 45 f8	 movsd	 QWORD PTR _sum$[ebp], xmm0

And look at this fastsin() prolog as you can see int 3 instructions are copied to the buffer.Afaik it is only done in pre-realease code.I think that this accounts for slower exec. speed when compared to java solution

00000	55		 push	 ebp

  00001	8b ec		 mov	 ebp, esp

  00003	81 ec 80 00 00

	00		 sub	 esp, 128		; 00000080H

  00009	57		 push	 edi

  0000a	8d 7d 80	 lea	 edi, DWORD PTR [ebp-128]

  0000d	b9 20 00 00 00	 mov	 ecx, 32			; 00000020H

  00012	b8 cc cc cc cc	 mov	 eax, -858993460		; ccccccccH

  00017	f3 ab		 rep stosd

concerning your prologue code I don't understand why you have it with an inlined call, maybe your "inline" statement is ignored ? do you have a stronger keyword available such as "_forceinline" ?but indeed starting a function with a useless "rep stosd" is not the best thing to doif you want fast code, it's probably a primary source of the poor performance you are enduring so far

anyway, *before startingany performance tuning* the first thing to do is to *compile in release mode*

your compiler keeps loading 'sqr' from the stack instead of using a register for it (when it's used by almost half the instructions, always read only), this looks like another potential source of the lackluster performance, on the other hand the java JIT may well be smarter and allocates a register for it provided thelow register pressure in this example

I see that the computation of 'sqr' is missing in your ASM dump and the useless store to 'rad' isn' shown too, so I suppose there is at least one mulsd + 2 avoidable store instructions not shown here, useless stores are typically bad for performance, next time please post a complete example i.e. the full loop like I do here: http://software.intel.com/en-us/forums/showpost.php?p=186968

all in all your compiler looks pretty weak for optimization and this is the most likely explanation for the java JIT compiled code faster than native code that you experience here

also note this is a perfect example for a JIT compiler since the compilation time of a short program is amortized over several millions iterations of the loop, the JIT compilation overhead is basically zero

Quoting iliyapolak

>>...Read please this pdf written by W Kahan about awful performance of floating point java implementation
>>...www.cs.berkeley.edu/~wkahan/JAVAhurt.pdf

Hi everybody,

Iliya,

Thanks for the link but, unfortunately,I'mnot impressedwith W. Kahan's work. I think it makes sence
to discuss iton some Java forum, or on'Software Tuning, Performance Optimization & Platform Monitoring' forum.

Best regards,
Sergey

>>anyway, *before startingany performance tuning* the first thing to do is to *compile in release mode*

Of course I did not compile it in release ,hence this pesky debug mode overhead was induced by compiler.

>>I see that the computation of 'sqr' is missing in your ASM dump and the useless store to 'rad' isn' shown too

Useless rad assignment was removed thanks for spotting it.For millions of iteration such a useless store can be costly.

>>all in all your compiler looks pretty weak for optimization and this is the most likely explanation for the java JIT compiled code faster than native code that you experience here

Now I post code which was compiledin release mode.

The results of 10million iterations for release code

running time of fastsin() release code is: 31 milisec
fastsin() is: 0.909297421962549370000000

Full code which includes also main() loop
main()'s for - loop fastsin() call fully inlined inside the loop

; 23   :  int main(void){
  00000	55		 push	 ebp

  00001	8b ec		 mov	 ebp, esp

  00003	83 e4 c0	 and	 esp, -64		; ffffffc0H

  00006	83 ec 30	 sub	 esp, 48			; 00000030H
; 24   : 	double e1 = 0;

; 25   : 	 double sine;

; 26   : 	 sine = 0;

; 27   : 	double gam;

; 28   : 	gam = 0;

; 29   : 	double fastgam;

; 30   : 	fastgam = 0;

; 31   : 	 double arg1;

; 32   : 	 arg1 = 1.0f;
  00009	f2 0f 10 05 00

	00 00 00	 movsd	 xmm0, QWORD PTR _one

  00011	53		 push	 ebx

  00012	55		 push	 ebp

  00013	56		 push	 esi

  00014	57		 push	 edi
; 33   : 	unsigned int start2,end2;

; 34   : 	 start2 = GetTickCount();
  00015	8b 3d 00 00 00

	00		 mov	 edi, DWORD PTR __imp__GetTickCount@0

  0001b	f2 0f 11 44 24

	30		 movsd	 QWORD PTR _arg1$[esp+64], xmm0

  00021	ff d7		 call	 edi

  00023	f2 0f 10 15 00

	00 00 00	 movsd	 xmm2, QWORD PTR __real@3e7ad7f2a0000000

  0002b	f2 0f 10 25 00

	00 00 00	 movsd	 xmm4, QWORD PTR __real@3b4761b41316381a

  00033	f2 0f 10 2d 00

	00 00 00	 movsd	 xmm5, QWORD PTR __real@3bd71b8ef6dcf572

  0003b	f2 0f 10 35 00

	00 00 00	 movsd	 xmm6, QWORD PTR __real@3c62f49b46814157

  00043	f2 0f 10 5c 24

	30		 movsd	 xmm3, QWORD PTR _arg1$[esp+64]

  00049	8b f0		 mov	 esi, eax

  0004b	b8 40 42 0f 00	 mov	 eax, 1000000		; 000f4240H

$LL9@main:
; 35   : 	 for(int i2 = 0;i2<10000000;i2++){
  00050	48		 dec	 eax
; 36   : 		 arg1 += 0.0000001f;
  00051	f2 0f 58 da	 addsd	 xmm3, xmm2

  00055	f2 0f 58 da	 addsd	 xmm3, xmm2

  00059	f2 0f 58 da	 addsd	 xmm3, xmm2

  0005d	f2 0f 58 da	 addsd	 xmm3, xmm2

  00061	f2 0f 58 da	 addsd	 xmm3, xmm2

  00065	f2 0f 58 da	 addsd	 xmm3, xmm2

  00069	f2 0f 58 da	 addsd	 xmm3, xmm2

  0006d	f2 0f 58 da	 addsd	 xmm3, xmm2

  00071	f2 0f 58 da	 addsd	 xmm3, xmm2

  00075	f2 0f 58 da	 addsd	 xmm3, xmm2
; 37   : 		 sine = fastsin(arg1);
  00079	66 0f 28 cb	 movapd	 xmm1, xmm3

  0007d	f2 0f 59 cb	 mulsd	 xmm1, xmm3

  00081	66 0f 28 f9	 movapd	 xmm7, xmm1

  00085	f2 0f 59 fc	 mulsd	 xmm7, xmm4

  00089	66 0f 28 c5	 movapd	 xmm0, xmm5

  0008d	f2 0f 5c c7	 subsd	 xmm0, xmm7

  00091	f2 0f 59 c1	 mulsd	 xmm0, xmm1

  00095	f2 0f 5c c6	 subsd	 xmm0, xmm6

  00099	f2 0f 59 c1	 mulsd	 xmm0, xmm1

  0009d	f2 0f 58 05 00

	00 00 00	 addsd	 xmm0, QWORD PTR __real@3ce952c77030ad4a

  000a5	f2 0f 59 c1	 mulsd	 xmm0, xmm1

  000a9	f2 0f 5c 05 00

	00 00 00	 subsd	 xmm0, QWORD PTR __real@3d6ae7f3e733b81f

  000b1	f2 0f 59 c1	 mulsd	 xmm0, xmm1

  000b5	f2 0f 58 05 00

	00 00 00	 addsd	 xmm0, QWORD PTR __real@3de6124613a86d09

  000bd	f2 0f 59 c1	 mulsd	 xmm0, xmm1

  000c1	f2 0f 5c 05 00

	00 00 00	 subsd	 xmm0, QWORD PTR __real@3e5ae64567f544e4

  000c9	f2 0f 59 c1	 mulsd	 xmm0, xmm1

  000cd	f2 0f 58 05 00

	00 00 00	 addsd	 xmm0, QWORD PTR __real@3ec71de3a556c734

  000d5	f2 0f 59 c1	 mulsd	 xmm0, xmm1

  000d9	f2 0f 5c 05 00

	00 00 00	 subsd	 xmm0, QWORD PTR __real@3f2a01a01a01a01a

  000e1	f2 0f 59 c1	 mulsd	 xmm0, xmm1

  000e5	f2 0f 58 05 00

	00 00 00	 addsd	 xmm0, QWORD PTR __real@3f81111111111111

  000ed	f2 0f 59 c1	 mulsd	 xmm0, xmm1

  000f1	f2 0f 5c 05 00

	00 00 00	 subsd	 xmm0, QWORD PTR __real@3fc5555555555555

  000f9	f2 0f 59 cb	 mulsd	 xmm1, xmm3

  000fd	f2 0f 59 c1	 mulsd	 xmm0, xmm1

  00101	f2 0f 58 c3	 addsd	 xmm0, xmm3

  00105	f2 0f 11 44 24

	30		 movsd	 QWORD PTR _sine$[esp+64], xmm0

  0010b	0f 85 3f ff ff

	ff		 jne	 $LL9main  

Here is the code compiler choses to use x87 instructions

  00000	dd 44 24 04	 fld	 QWORD PTR _x$[esp-4]

  00004	d9 c0		 fld	 ST(0)

  00006	d8 c9		 fmul	 ST(0), ST(1)
; 103  :

; 104  : 		    sum = x+x*sqr*(coef1+sqr*(coef2+sqr*(coef3+sqr*(coef4+sqr*(coef5+sqr*(coef6+sqr*(coef7+sqr*(coef8+sqr*(coef9+sqr*(coef10+sqr*(coef11)))))))))));
  00008	dd 05 00 00 00

	00		 fld	 QWORD PTR __real@3b4761b41316381a

  0000e	d8 c9		 fmul	 ST(0), ST(1)

  00010	dc 2d 00 00 00

	00		 fsubr	 QWORD PTR __real@3bd71b8ef6dcf572

  00016	d8 c9		 fmul	 ST(0), ST(1)

  00018	dc 25 00 00 00

	00		 fsub	 QWORD PTR __real@3c62f49b46814157

  0001e	d8 c9		 fmul	 ST(0), ST(1)

  00020	dc 05 00 00 00

	00		 fadd	 QWORD PTR __real@3ce952c77030ad4a

  00026	d8 c9		 fmul	 ST(0), ST(1)

  00028	dc 25 00 00 00

	00		 fsub	 QWORD PTR __real@3d6ae7f3e733b81f

  0002e	d8 c9		 fmul	 ST(0), ST(1)

  00030	dc 05 00 00 00

	00		 fadd	 QWORD PTR __real@3de6124613a86d09

  00036	d8 c9		 fmul	 ST(0), ST(1)

  00038	dc 25 00 00 00

	00		 fsub	 QWORD PTR __real@3e5ae64567f544e4

  0003e	d8 c9		 fmul	 ST(0), ST(1)

  00040	dc 05 00 00 00

	00		 fadd	 QWORD PTR __real@3ec71de3a556c734

  00046	d8 c9		 fmul	 ST(0), ST(1)

  00048	dc 25 00 00 00

	00		 fsub	 QWORD PTR __real@3f2a01a01a01a01a

  0004e	d8 c9		 fmul	 ST(0), ST(1)

  00050	dc 05 00 00 00

	00		 fadd	 QWORD PTR __real@3f81111111111111

  00056	d8 c9		 fmul	 ST(0), ST(1)

  00058	dc 25 00 00 00

	00		 fsub	 QWORD PTR __real@3fc5555555555555

  0005e	d9 c9		 fxch	 ST(1)

  00060	d8 ca		 fmul	 ST(0), ST(2)

  00062	de c9		 fmulp	 ST(1), ST(0)

  00064	de c1		 faddp	 ST(1), ST(0)
; 105  :

; 106  :

; 107  :

; 108  :

; 109  :

; 110  :

; 111  :

; 112  : 	   return sum;

; 113  :    }
  00066	c3		 ret	 0

?fastsin@@YANN@Z ENDP					; fastsin

we cansee that this code has still a lot of useless computations : all the addsd mm3,xmm2 at the start of the loop,it's a strange thing to basically multiply the value by 10, it's maybedue to the increment declaredas a floatand that's added to a double ?

31ms for 1e7 iterations, i.e. ~6 cycles per iteration (20xfaster than your previous reports) is too low, I don't see how it canmatch with the code you posted above, I suspect that you reported the timings of the 1e6 x case

it looks like your compiler do a better job at generating x87 code than SSE2

>>it's maybedue to the increment declaredas a floatand that's added to a double

Yes.

>>31ms for 1e7 iterations, i.e. ~6 cycles per iteration (20xfaster than your previous reports) is too low

This isstrange because when I measured exec. time with 1e6 iterations the delta was 0.Bear in mind that only fastsin() is inlined inside the for-loop andmy other function (Exponential Integral polynomial approx)when compiled with release modeis called from the main andits result is 78 millisec per 1e6 itarations.

start val of fastsin() 19222349
end val of fastsin() 19222349
running time of fastsin() release code is: 0 millisec
fastsin() is: 0.891207360591512180000000

Compiler settings

Zi /nologo /W3 /WX- /O2 /Ob2 /Oi /Ot /Oy /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm- /EHsc /GS- /Gy /arch:SSE2 /fp:precise /Zc:wchar_t /Zc:forScope /Fp"Release\inline.c.pch" /FAcs /Fa"Release" /Fo"Release" /Fd"Release\vc100.pdb" /Gd /analyze- /errorReport:queue

Here are the results of fastexp() function which is almost the same as fastsin()
1e6 iteration called with loop counter.

start value of fastexp() is: 20642489
end value of fastexp() is: 20642520
delta of fastexp() is: 31 millisec

soafter all it looks like your compiler unroll it 10x (see mov eax, 1000000 in the ASM) since you sum the results of the polynomial evaluation and your argument series is perfectly regular andknown at compile timeit's actually "optimized" and defeat your measurements => one more time, I'lladvise to use my (now one week old!) example instead: it was carefully designed to avoid such compiler issues

I will give try to your method and use it as a model to defeat compiler unrolling and inlining.But my other methods who are also implemented as poly approximation was not inlined and unrolled.I will post asm code of those metod.Does it not seem strange that only in the case of fastsin compiler used such a optimalization.
Bronxzv thanks for your help I learned many interested things by reading your answers.

Hi everybody,

Iliya, I just completed reading all latest posts from the Page #9 and I'll try to do some testing with your 'FastSin' function
implemented in C. So, in what thread do you want to continue discussion? At the moment you have already two threads...

Best regards,
Sergey

Hi Sergey
Glad to see you back.:)
I prefer to continue discussion in this thread because this thread is already known and most viewed.
I'am uploading very interesting document named "K5 Transcendental Functions".
This pdf describes testing and implementation in hardware (AMD K5 CPU) of transcendental functions.
From this doc you can clearly see that engineers used Mathematica program as a reference model to test their fixed-precision implementation against.Also you can learn that taylor expansion with pre-calculated coefficients was used.

附件: 

附件尺寸
下载 cp_1995-04.pdf506.96 KB

Also you can learn that taylor expansion with pre-calculated coefficients was used

actually they say the coefficients were generated on-the-fly (page 165) thus the choice of Taylor's series which is easy and fast

AFAIK modern methodsuseprecomputed coefficientswith minimax polynomials computed offline, for a software implementation the memory used by the coefficients is a non-issue anyway

also note that the range reduction algorithm they report isn't easily amenable to a software implementation since it works only with multiple precision arithmetic, see Elementary Functions Algorithms and Implementation 2nd edition (Jean-Michel Muller, Birkhuser) pages 173-191 forother range reduction methods like Cody and Waite or Payne and Hanek

Hi everybody!

Another set of tests this time gamma() function stirling approximation.
There are two implementations.
One of them is based on 1d arrays ,library pow() calls and division inside for-loop
to calculate coefficients and to sum the series expansion.Both functions use 14 terms of stirling expansion.
Second method is faster because array'scoefficients fillingand summation is eliminated
and also pow() library calls are not used.Instead the function relies on simple calculation of
argument x ofconsecutivepowers mulitplied by constant denominator and divided by numerator.
Pre-calculation of coefficients was not possible because of dependency of denominator on argument x.
Accuracy when compared to Mathematica code is between 15-16 decimal places.

Code for C implementation of stirling gamma function
slower version

[bash]inline double gamma(double x){
double sum,result;
sum = 0;
result = 0;
if( x > gamma_huge){
return (x-x)/(x-x); //NaN

}else if( x < one){
return (x-x)/(x-x);
}else{

double ln,power,pi_sqrt,two_pi,arg;

two_pi = 2*Pi;
double coef1[] = {1.0,1.0,139.0,571.0,163879.0,
5246819.0,534703531.0,4483131259.0,
432261921612371.0, 6232523202521089.0,
25834629665134204969.0,1579029138854919086429.0,
746590869962651602203151.0,1511513601028097903631961.0
};

double coef2[] = {12.0,288.0,51840.0,2488320.0,209018880.0,
75246796800.0, 902961561600.0,86684309913600.0,
514904800886784000.0, 86504006548979712000.0,
13494625021640835072000.0,9716130015581401251840000.0,
116593560186976815022080000.0,2798245444487443560529920000.0

};
int const size = 14;

double temp[size];
double temp2[size];
double temp3[size];
int i,j;
long k;
k = 0;
for(i = 0;i
Results of 1e6 iterations gamma() called with an argument incremented by 0.000001.Microsoft Optimizied compiler

gamma() start value is: 13214329
gamma() end value is: 13214688
execution time of gamma() 1e6 iterations is: 359 millisec
gamma 1.999999997453735200000000

Intel Compiler optimized for speed

gamma() start value is: 24295767
gamma() end value is: 24296016
execution time of gamma() 1e6 iterations is: 249 millisec
gamma 1.999999997453735200000000

Intel compiler settings

/Zi /nologo /W3 /MP /O2 /Ob2 /Oi /Ot /Oy /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /GS- /Gy /arch:SSE2 /fp:precise /QxHost /Zc:wchar_t /Zc:forScope /Fp"Release\inline.c.pch" /FAcs /Fa"Release" /Fo"Release" /Fd"Release\vc100.pdb" /Gd /TP

Tested today fastsin() 1e6 iterations and the result was 15 millisec i.e ~33.39 cycles per iterationfor my CPU.Finally the results are close to yours.

results

start val of fastsin() 13214314
end val of fastsin() 13214329
running time of fastsin() release code is: 15 millisec
fastsin() is: 0.891207360591512180000000

bronxzv
"AFAIK modern methodsuseprecomputed coefficientswith minimax polynomials computed offline, for a software implementation the memory used by the coefficients is a non-issue anyway".

When I studied this book "Numerical Recipes in C" I saw that many of their implementations are based on
for-loops coefficient calculation .In order to optimize the speed of execution the authors do not use (for powers) calculations library pow() calls.Few times I also saw when division was used instead of reciprocal multiplication.

Results for the faster version of gamma stirling approximation which uses 10 stirling series coefficients.Was not able to use Horner scheme directly because of an accumulated error.

Code for faster version

inline double fastgamma(double x){

	    double sum,result;

	    sum = 0;

	    result = 0;

	    if( x > gamma_huge){

		    return (x-x)/(x-x);

	    }else if( x < one){

		    return (x-x)/(x-x);
	    }else{
		     double ln,power,pi_sqrt,two_pi,arg;

			two_pi = 2*Pi;

			double nom1,nom2,nom3,nom4,nom5,nom6,nom7,nom8,nom9,nom10;

			double denom1,denom2,denom3,denom4,denom5,denom6,denom7,denom8,denom9,denom10;

			double coef1,coef2,coef3,coef4,coef5,coef6,coef7,coef8,coef9,coef10;

			double z1,z2,z3,z4,z5,z6,z7,z8,z9,z10;

			nom1 = 1.0;

			nom2 = 1.0;

			nom3 = 139.0;

			nom4 = 571.0;

			nom5 = 163879.0;

			nom6 =  5246819.0;

			nom7 = 534703531.0;

			nom8 = 4483131259.0;

			nom9 =  432261921612371.0;

			nom10 = 6232523202521089.0;
			denom1 = 12.0;

			denom2 = 288.0;

			denom3 = 51840.0;

			denom4 = 2488320.0;

			denom5 = 209018880.0;

			denom6 = 75246796800.0;

			denom7 =  902961561600.0;

			denom8 = 86684309913600.0;

			denom9 = 514904800886784000.0;

			denom10 = 86504006548979712000.0;
               z1 = x;

			z2 = z1*z1;//x^2

			z3 = z2*z1;//x^3

			z4 = z3*z1;//x^4

			z5 = z4*z1;//x^5

			z6 = z5*z1;//x^6

			z7 = z6*z1;//x^7

			z8 = z7*z1;//x^8

			z9 = z8*z1;//x^9

			z10 = z9*z1;//x^10

			ln = exp(-x);

		     arg = x-0.5;

		     power = pow(x,arg);

		     pi_sqrt = sqrt(two_pi);

		     sum = ln*power*pi_sqrt;

			coef1 = nom1/(z1*denom1);

			coef2 = nom2/(z2*denom2);

			coef3 = nom3/(z3*denom3);

			coef4 = nom4/(z4*denom4);

			coef5 = nom5/(z5*denom5);

			coef6 = nom6/(z6*denom6);

			coef7 = nom7/(z7*denom7);

			coef8 = nom8/(z8*denom8);

			coef9 = nom9/(z9*denom9);

			coef10 = nom10/(z10*denom10);

			result = one+coef1+coef2-coef3-coef4+coef5+coef6-coef7-coef8+coef9+coef10;

	    }

	    return sum*result;

    }

Results of 1e6 iterations of fastgamma() called with an argument 2.0 incremented by 0.000001

fastgamma() start value is 13214688
fastgamma() end value is 13214875
execution time of fastgamma() 1e6 iterations is: 187 millisec
fastgamma 2.000000016396600500000000

The same test compiled with Intel compiler

fastgamma() start value is 24296016
fastgamma() end value is 24296157
execution time of fastgamma() 1e6 iterations is: 141 millisec
fastgamma 2.000000016396600500000000

Compiler settings

/Zi /nologo /W3 /MP /O2 /Ob2 /Oi /Ot /Oy /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /GS- /Gy /arch:SSE2 /fp:precise /QxHost /Zc:wchar_t /Zc:forScope /Fp"Release\inline.c.pch" /FAcs /Fa"Release" /Fo"Release" /Fd"Release\vc100.pdb" /Gd /TP

I found a couple of notes related to some concerns about Performance and Security Cookie Checking ( or Buffer Security Checks )...

1.In Post #132 ( current thread )
// ...Still as you can see native code is slower maybe probably because of overhead of security cookie beign checked on the stack...

2. InPost #142 ( current thread )
// ...indeed starting a function with a useless "rep stosd" is not the best thing to do if you want fast code...

3. In a thread 'Java polynomial approx. faster than C code polynomial approx.':

http://software.intel.com/en-us/forums/showthread.php?t=105973&o=a&s=lr

In Post #5:

Zi /nologo /W3 /WX- /O2 /Ob2 /Oi /Ot /Oy /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE"
/Gm- /EHsc /GS- /Gy /arch:SSE2 /fp:precise /Zc:wchar_t /Zc:forScope
/Fp"Release\inline.c.pch" /FAcs /Fa"Release" /Fo"Release" /Fd"Release\vc100.pdb" /Gd /analyze-/errorReport:queue

Isee thata compiler optionBuffer Security Checkis disabled: '/GS-'.

Best regards,
Sergey

Quoting iliyapolakResults for the faster version of gamma stirling approximation which uses 10 stirling series coefficients...
Hi Iliya,

Wouldn't it better to have a different thread for it? For example, all discussions could be separated as follows:

'FastSin' in a Thread A
'FastGamma' in a Thread B
'FastExp' in a Thread C

etc.

I think in that case acontent of threads in the forumwill be more consistent and useful for everybody.

Best regards,
Sergey

Wouldn't it better to have a different thread for it? For example, all discussions could be separated as follows:

'FastSin' in a Thread A
'FastGamma' in a Thread B
'FastExp' in a Thread C

Sergey!
I think that this is a good idea to separate discussion in different threads.
Now I'am starting with gamma function optimization and speed of execution testing.

It was not when I compiled in Debug mode.Hence those pesky rep stosd instruction at the functions's prologue were present.

I've made a comment regarding Buffer Security Checks ( compiler option /GS ) in a Post #160. Please take a look.

Hi Sergey!

I created new thread on gamma stirling approximation Can you try to test those functions.

I've made a comment regarding Buffer Security Checks ( compiler option /GS ) in a Post #160. Please take a look.

The code was compiled with release mode with buffer checks disabled.Earlier compilation were in Debug mode.

Hi Iliya,

I'd like to make a comment regarding your method of error processing, that is:
...
return ( x-x / x-x );
...

For any 'x' expression ( x-x/x-x ) equals to -1.#IND00000000000. Here is a Test-Case:
...
double x = -25.0L;
double y = ( x-x ) / ( x-x );
printf( "%.15f\n", y );
...
Output:
-1.#IND00000000000

CRT trigonometric functions 'sin' or 'cos' always return some value because they do a range reduction. Why did you decide
to use( x-x / x-x )?

Best regards,
Sergey

Hi Sergey I did not implement range reduction in fastsin.The function simply returns NaN value when an input is out of range I know that taylor series for sin gives wrong results when an argument in radian is larger than 3.So I used simple input checking method To calculate only range 0<|x|Btw please take look at fastgamma() various implementation's speed of execution testing you can see that Horner scheme approach can be 4x faster than iterative method.
P.S
In fastsin() function posted here the input checking is wrongly written.It should be

if(x > (Pi/2)){
return (x-x)/(x-x); //return NaN
}else if(-x < (-Pi/2)){
return (-x+x)/(-x+x);//return NaN
}else{
// fastsin() code here .....

Quoting iliyapolak...I did not implement range reduction in fastsin.

[SergeyK] Almost the same ( 75% ) for trigonometric API isin my case.

The function simply returns NaN value when an input is out of range...

Hi Iliya,

I don't know if you have some requirements for your API or you're doing some R&D. So, I simply would like to express
my opinion and it is as follows:

Let me mention a well known fact that the sine trigonometric function repeats itself every 2*PI radians. In case of
large angles a subtraction of a multiple of 2*PI could be done and a new value of the angle will be
between 0 and 2*PI. Then,let's think about practical applications ofsome API. What if somebody will need to calculate
the sine of 750 degrees ( PI/6 or 30 degrees )? It is possible to make a note that it is mandatory for a user
to do its own range reduction. A CRT-function 'sin' would be very impractical if it wouldn't have a range reduction and
everybody would be forced to do it. It would create a "perfect" case for inconsistencies of FP-operations
with tirgonometric functions and everybody would complain about it.

Best regards,
Sergey

Quoting Sergey Kostrov...let's think about practical applications ofsome API. What if somebody will need to calculate the sine of 750 degrees ( PI/6 or 30 degrees )?..

I decided to apply a more flexible approach when working on implementation of trigonometric API:

- For a set of macros:

input value is in radians
there is no range reduction
always returns some value
speed rating ( ***** )

- For a set of methods ina template class:

input value could be in radians or in degrees
there is no range reduction
always returns some value
speed rating ( **** )

- For a set of non-template based external functions:

input value could be in radians or in degrees
there is a range reduction
always returns some value
speed rating ( *** )

All these sets are implemented in C, absolutely portable, faster then standard CRT-functions ( speed rating *** ),
less accurate then standard CRT-functions.

Best regards,
Sergey

Quoting Sergey Kostrov...All these sets are implemented in C, absolutely portable, faster then standard CRT-functions...

And one more comment regarding performance...

Performance of differentSine functions compared to CRT-function 'sin' when tested on different platforms:

- Normalized Taylor Series up to 7th term is the Fastest implementation
for Windows Desktop Platforms

- Normalized Series up to 7th term is the Fastest implementation
for Windows CE Platforms

- Normalized Chebyshev Polynomial up to 9th term is the Fastest implementation
for 16-bit Platforms ( like, MS-DOS )

Performance of differentSine functions compared to CRT-function 'sin' when tested on different platforms:

Interesting, but did you test standard taylor expansion for sine()exactly as I have written for speed and for an accuracy?
My result was 16 millisec for 1e6 iterations.Exactly 35.616 cycles per ns for fastsin() without the input checking.As I remenber your results were higher something like 140 cycles per iteration.

I don't know if you have some requirements for your API or you're doing some R&D. So, I simply would like to express
my opinion and it is as follows

Hi Sergey
I'am doing it for fun and for sake of learning.I have only 8 months of serious programming experience and I'am interested mainly in numerical analysis and computational physics(doing first steps).So I thought that the best way to learn is use such a book like "Handbook of Mathematical Functions" and try todevelope my own implementation to the various formulas.I have already written 52 functions(If you are interested I can send you a Java source file) with varying degree of an accuracy and speed of execution.

>>Let me mention a well known fact that the sine trigonometric function repeats itself every 2*PI

I know this pretty well andas I stated above it is not serious API code whichwill not be distributed amongst the users.Not now but maybe later. :)

I decided to apply a more flexible approach when working on implementation of trigonometric API

This is a very good reference model of how to implement and develope code.

>>less accurate then standard CRT-functions

Why do not You implement your sine functions exactly as fastsin() which has AE of 0 when compared to CRT sin()?

Do you have an experience in Java programming?
Have you considered a development ofan API which is based only on complex numbers and complex functions.

Quoting iliyapolak

Performance of differentSine functions compared to CRT-function 'sin' when tested on different platforms:

Interesting, but did you test standard taylor expansion for sine()exactly as I have written for speed and for an accuracy?..
It is already in progress and I'll post results soon.

Best regards,
Sergey

页面

发表评论

登录添加评论。还不是成员?立即加入