Optimization of sine function's taylor expansion

343 posts / novo 0
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.

Hi Iliya,

Quoting iliyapolak...So I thought that the best way to learn is use such a book like "Handbook of Mathematical Functions" and try todevelope
my own implementation to the various formulas. I have already written 52 functions(If you are interested I can send you a Java source file)
with varying degree of an accuracy and speed of execution...

That would be awesome if you share your work. Since it is in Javaa forum"Android Applications on Intel Architecture" or "MKL"could bethe most suitable. Please confirm it with Moderators of Intel Forums.

Thank you!

Best regards,
Sergey

Quoting iliyapolak...
>>less accurate then standard CRT-functions

Why do not You implement your sine functions exactly as fastsin() which has AE of 0 when compared to CRT sin()?

[SergeyK] It was decided that for aReal-Time environment a number of terms in serieshas to be reduced to 7and 9.

Do you have an experience in Java programming?

[SergeyK]My experience with Java is insignificant. I would say: 95%is C/C++ and 5%all the rest programminglanguages,
like FoxPro, Java, Gupta, Assembler, Visual Basic, Pascal.

Have you considered a development ofan API which is based only on complex numbers and complex functions.

[SergeyK] No.

Best regards,
Sergey

Thank You Sergey!
I decided to postthe source code of my Functions class.It is still in beta ,but all methods work.You can find there many esoteric functions like :Bessel , Kelvin, Sine Integral ,Cosine Integral ,various Gamma functions and many more.As I stated earlier in my post all of those methods are not written for speed of execution ,moreover they are very slow look herehttp://software.intel.com/en-us/forums/showthread.php?t=106032
You can see how pre-calculation of coefficients and Horner scheme can improve speed of execution(in case of my gamma functionit was ~4x faster).
Please feel free to use my code to port it to C/C++, to improve it and to test it.And do not forget to give your opinion on my work. :)
P.S
You can open the java source file in any editor even notepad ,but I recommned you to use Eclipse Indigo IDE.

Anexos: 

AnexoTamanho
Download Functions.java88.93 KB

Since it is in Javaa forum"Android Applications on Intel Architecture" or "MKL"could bethe most suitable. Please confirm it with Moderators of Intel Forums.

Hi Sergey

Now I'am porting java "Functions" class to C++ static library.Where is it possible I will optimize for speed.I have already ported a few functions.When my job will be done I will post the source file.

>>or "MKL"

I think that this forum is too much math-centric and math-oriented and more suited for professional mathematicians than for programmers.

It is already in progress and I'll post results soon

If you are interested in testing more complicated function than sin() please look here http://software.intel.com/en-us/forums/showthread.php?t=106032

My experience with Java is insignificant. I would say: 95%is C/C++ and 5%all the rest programminglanguages,
like FoxPro, Java, Gupta, Assembler, Visual Basic, Pascal

I think that learning x86 assembly should be obligatory for professional programmers.It should be 70% C/C++ and 30% assembly.

Quoting Sergey KostrovQuoting iliyapolak

Performance of differentSine functions compared to CRT-function 'sin' when tested on different platforms:

Interesting, but did you test standard taylor expansion for sine()exactly as I have written for speed and for an accuracy?..
It is already in progress and I'll post results soon...

Comments:

- Number of Iterations - 2^22
- Time of executionof a single call has to be calculated as follows: ( Completed in XXX ticks )divided by a ( Number of iterations ), for example: 297 / 2^22
- Microsoft C++ compiler
- All optimizations disabled
- Release configuration

...
Application - ScaLibTestApp - WIN32_MSC
Tests: Start
> Test1067 Start <
Sub-Test 1.1
Completed in 297 ticks
CRT Sin( 30.0 ) = 0.4999999999999999400000

Sub-Test 2.1
Completed in 234 ticks
Normalized Series 7t Sin( 30.0 ) = 0.4999999918690232700000

Sub-Test 3.1
Completed in 234 ticks
Normalized Series 9t Sin( 30.0 ) = 0.5000000000202800000000

Sub-Test 4.1
Completed in 266 ticks
Normalized Series 11t Sin( 30.0 ) = 0.5000000000000000000000

Sub-Test 5.1
Completed in 266 ticks
Chebyshev Polynomial 7t Sin( 30.0 ) = 0.4999999476616695500000

Sub-Test 6.1
Completed in 328 ticks
Chebyshev Polynomial 9t Sin( 30.0 ) = 0.4999999997875643800000

Sub-Test 7.1
Completed in 219 ticks
Normalized:
Chebyshev Polynomial 7t Sin( 30.0 ) = 0.4999999476616694400000

Sub-Test 8.1
Completed in 234 ticks
Normalized:
Chebyshev Polynomial 9t Sin( 30.0 ) = 0.4999999997875643800000

Sub-Test 9.1
Completed in 203 ticks
Normalized:
Taylor Series 7t Sin( 30.0 ) = 0.4999999918690232200000

Sub-Test 10.1
Completed in 234 ticks
Normalized:
Taylor Series 9t Sin( 30.0 ) = 0.5000000000202798900000

Sub-Test 11.1
Completed in 532 ticks
Normalized:
Taylor Series 23t Sin( 30.0 ) = 0.4999999999999999400000 - FastSinV1 - Not Optimized

Sub-Test 12.1
Completed in 453 ticks
Normalized:
Taylor Series 23t Sin( 30.0 ) = 0.4999999999999999400000 - FastSinV2 - Optimized

Sub-Test 13.1
Completed in 109 ticks
Normalized:
Taylor Series 11t Sin( 30.0 ) = 0.4999999999999643100000 - C Macro

Sub-Test 14.1
Completed in 266 ticks
1.00 deg step for a LUT of Sine Values:
Interpolated Sin( 30.0 ) = 0.5000000000000000000000

Sub-Test 15.1
Completed in 265 ticks
1.00 deg step for a LUT of Sine Values:
Interpolated Cos( 30.0 ) = 0.8660254037844386000000

> Test1067 End <
Tests: Completed
...

I'll post modified codes of your 'FastSin' functionlater. I need to do my regular a project related stuff.

Best regards,
Sergey

Quoting iliyapolak...I would like to ask you how can i rewrite this code in order to gain speed of execution improvment...
Hi Iliya, I would say this is your "sacred" question and here are a couple of versions of your 'FastSin' function.

...

// Normalized Taylor Series ( up to 23rd term ) - V1 - Not Optimized
RTdouble FastSinV1( RTdouble dX );
RTdouble FastSinV1( RTdouble dX )

{

	RTdouble dSum = 0.0L;
	if(  dX > (  MC_PI / 2.0L ) )								// Checks for a range 0 < x < Pi/2

	{

		return ( dX - dX )/( dX - dX );							// Returns NaN

	}

	else

	if( -dX < ( -MC_PI / 2.0L ) )

	{

		return (-dX + dX )/(-dX + dX );							// Returns NaN

	}

	else

	{

		RTdouble dCoef1, dCoef2, dCoef3, dCoef4, dCoef5, dCoef6,

				 dCoef7, dCoef8, dCoef9, dCoef10, dCoef11,

				 dRad, dSqr;
		dCoef1  = -0.16666666666666666666666666666667000;		// 1/3!

		dCoef2  =  0.00833333333333333333333333333333000;		// 1/5!

		dCoef3  = -1.9841269841269841269841269841270e-04;		// 1/7!

		dCoef4  =  2.7557319223985890652557319223986e-06;		// 1/9!

		dCoef5  = -2.5052108385441718775052108385442e-08;		// 1/11!

		dCoef6  =  1.6059043836821614599392377170155e-10;		// 1/13!

		dCoef7  = -7.6471637318198164759011319857881e-13;		// 1/15!

		dCoef8  =  2.8114572543455207631989455830103e-15;		// 1/17!

		dCoef9  = -8.2206352466243297169559812368723e-18;		// 1/19!

		dCoef10 =  1.9572941063391261230847574373505e-20;		// 1/21!

		dCoef11 = -3.8681701706306840377169119315228e-23;		// 1/23!
		dRad = dX;

		dSqr = dX * dX;											// dX^2
		dSum = dRad + dRad*dSqr*( dCoef1 + dSqr*( dCoef2 + dSqr*( dCoef3 + dSqr*( dCoef4 +

					  dSqr*( dCoef5 + dSqr*( dCoef6 + dSqr*( dCoef7 + dSqr*( dCoef8 +

					  dSqr*( dCoef9 + dSqr*( dCoef10 + dSqr*( dCoef11 )))))))))));

	}
	return ( RTdouble )dSum;

}

...

...

// Normalized Taylor Series ( up to 23rd term ) - V2 - Optimized
RTdouble dCoef1  = -0.16666666666666666666666666666667000;		// 1/3!

RTdouble dCoef2  =  0.00833333333333333333333333333333000;		// 1/5!

RTdouble dCoef3  = -1.9841269841269841269841269841270e-04;		// 1/7!

RTdouble dCoef4  =  2.7557319223985890652557319223986e-06;		// 1/9!

RTdouble dCoef5  = -2.5052108385441718775052108385442e-08;		// 1/11!

RTdouble dCoef6  =  1.6059043836821614599392377170155e-10;		// 1/13!

RTdouble dCoef7  = -7.6471637318198164759011319857881e-13;		// 1/15!

RTdouble dCoef8  =  2.8114572543455207631989455830103e-15;		// 1/17!

RTdouble dCoef9  = -8.2206352466243297169559812368723e-18;		// 1/19!

RTdouble dCoef10 =  1.9572941063391261230847574373505e-20;		// 1/21!

RTdouble dCoef11 = -3.8681701706306840377169119315228e-23;		// 1/23!
RTdouble FastSinV2( RTdouble dX );
RTdouble FastSinV2( RTdouble dX )

{

	if(  dX > (  MC_PI / 2.0L ) )								// Checks for a range 0 < x < Pi/2

	{

		return (dX-dX)/(dX-dX);									// Returns NaN

	}

	else

	if( -dX < ( -MC_PI / 2.0L ) )

	{

		return (-dX+dX)/(-dX+dX);								// Returns NaN

	}
	return ( RTdouble )dX + dX*dX*dX*( dCoef1 + dX*dX*( dCoef2 + dX*dX*( dCoef3 + dX*dX*( dCoef4 +

							dX*dX*( dCoef5 + dX*dX*( dCoef6 + dX*dX*( dCoef7 + dX*dX*( dCoef8 +

							dX*dX*( dCoef9 + dX*dX*( dCoef10 + dX*dX*( dCoef11 )))))))))));

}

...

A version of 'FastSinV2' without checks for a range 0 < x < Pi/2 will be your fastest version
in C language:

...

// Normalized Taylor Series ( up to 23rd term ) - V3 - Optimized - Without Checks for a range 0 < x < Pi/2
RTdouble dCoef1  = -0.16666666666666666666666666666667000;		// 1/3!

RTdouble dCoef2  =  0.00833333333333333333333333333333000;		// 1/5!

RTdouble dCoef3  = -1.9841269841269841269841269841270e-04;		// 1/7!

RTdouble dCoef4  =  2.7557319223985890652557319223986e-06;		// 1/9!

RTdouble dCoef5  = -2.5052108385441718775052108385442e-08;		// 1/11!

RTdouble dCoef6  =  1.6059043836821614599392377170155e-10;		// 1/13!

RTdouble dCoef7  = -7.6471637318198164759011319857881e-13;		// 1/15!

RTdouble dCoef8  =  2.8114572543455207631989455830103e-15;		// 1/17!

RTdouble dCoef9  = -8.2206352466243297169559812368723e-18;		// 1/19!

RTdouble dCoef10 =  1.9572941063391261230847574373505e-20;		// 1/21!

RTdouble dCoef11 = -3.8681701706306840377169119315228e-23;		// 1/23!
RTdouble FastSinV3( RTdouble dX );
RTdouble FastSinV3( RTdouble dX )

{

	return ( RTdouble )dX + dX*dX*dX*( dCoef1 + dX*dX*( dCoef2 + dX*dX*( dCoef3 + dX*dX*( dCoef4 +

							dX*dX*( dCoef5 + dX*dX*( dCoef6 + dX*dX*( dCoef7 + dX*dX*( dCoef8 +

							dX*dX*( dCoef9 + dX*dX*( dCoef10 + dX*dX*( dCoef11 )))))))))));

}

...

Here is another set of performance numbers:

Application - ScaLibTestApp - WIN32_MSC
Tests: Start
> Test1067 Start <

Sub-Test 1.1
Completed in 297 ticks
CRT Sin( 30.0 ) = 0.4999999999999999400000

Sub-Test 2.1
Completed in 235 ticks
Normalized Series 7t Sin( 30.0 ) = 0.4999999918690232700000

Sub-Test 3.1
Completed in 234 ticks
Normalized Series 9t Sin( 30.0 ) = 0.5000000000202800000000

Sub-Test 4.1
Completed in 266 ticks
Normalized Series 11t Sin( 30.0 ) = 0.5000000000000000000000

Sub-Test 5.1
Completed in 265 ticks
Chebyshev Polynomial 7t Sin( 30.0 ) = 0.4999999476616695500000

Sub-Test 6.1
Completed in 328 ticks
Chebyshev Polynomial 9t Sin( 30.0 ) = 0.4999999997875643800000

Sub-Test 7.1
Completed in 219 ticks
Normalized:
Chebyshev Polynomial 7t Sin( 30.0 ) = 0.4999999476616694400000

Sub-Test 8.1
Completed in 219 ticks
Normalized:
Chebyshev Polynomial 9t Sin( 30.0 ) = 0.4999999997875643800000

Sub-Test 9.1
Completed in 203 ticks
Normalized:
Taylor Series 7t Sin( 30.0 ) = 0.4999999918690232200000

Sub-Test 10.1
Completed in 234 ticks
Normalized:
Taylor Series 9t Sin( 30.0 ) = 0.5000000000202798900000

Sub-Test 11.1
Completed in 516 ticks
Normalized:
Taylor Series 23t Sin( 30.0 ) = 0.4999999999999999400000 - FastSinV1 - Not Optimized

Sub-Test 12.1
Completed in 406 ticks
Normalized:
Taylor Series 23t Sin( 30.0 ) = 0.4999999999999999400000 - FastSinV2 - Optimized

Sub-Test 13.1
Completed in 360 ticks
Normalized:
Taylor Series 23t Sin( 30.0 ) = 0.4999999999999999400000 - FastSinV3 - Optimized / No Checks

Sub-Test 14.1
Completed in 109 ticks
Normalized:
Taylor Series 11t Sin( 30.0 ) = 0.4999999999999643100000 - C Macro

Sub-Test 15.1
Completed in 266 ticks
1.00 deg step for a LUT of Sine Values:
Interpolated Sin( 30.0 ) = 0.5000000000000000000000

Sub-Test 16.1
Completed in 265 ticks
1.00 deg step for a LUT of Sine Values:
Interpolated Cos( 30.0 ) = 0.8660254037844386000000

> Test1067 End <
Tests: Completed

Here are performance numbers for 'sin'C-Macros:

...
Completed in 62 ticks
Normalized:
Taylor Series 7t Sin( 30.0 ) = 0.4999999918690232200000 - C Macro
...
Completed in 94 ticks
Normalized:
Taylor Series 9t Sin( 30.0 ) = 0.5000000000202798900000 - C Macro
...
Completed in 109 ticks
Normalized:
Taylor Series 11t Sin( 30.0 ) = 0.4999999999999643100000 - C Macro
...

Sub-Test 11.1
Completed in 532 ticks
Normalized:
Taylor Series 23t Sin( 30.0 ) = 0.4999999999999999400000 - FastSinV1 - Not Optimized

Sub-Test 12.1
Completed in 453 ticks
Normalized:
Taylor Series 23t Sin( 30.0 ) = 0.4999999999999999400000 - FastSinV2 - Optimized

Hi Sergey!
Thanks for testing.I hjave a few question regarding the results and method of testing.
Bronxzv in one of his responses told me do not test with a constant value because of possible compiler optimization here isthe quote from his post "also, as previously noted, I don't think that calling it with a constant is the best idea, insteadI'll advise to samplethe domain like in the example I provided the other day".

What does XXX Ticks stand for? Is it nanoseconds or CPU cycles?

Here is my test for fastsin() written exactly as your Unoptimized version:

Tested today fastsin() 1e6 iterations and the result was 15 millisec i.e ~33.39 cycles per iterationfor my CPU.

results

start val of fastsin() 13214314
end val of fastsin() 13214329
running time of fastsin() release code is: 15 millisec
fastsin() is: 0.891207360591512180000000

Why your results are so large almost as my slow implementation of gamma stirling approximation.Please compare here linkhttp://software.intel.com/en-us/forums/showthread.php?t=105474

Seregey!

Did You download Java source file which I uploaded yesterday?
There you have a plenty room for implementing Horner scheme with coeficients pre-calculation.
One of the example optimization of gamma stirling approximation please look here http://software.intel.com/en-us/forums/showpost.php?p=188061

Here is the code and test results

inline double fastgamma3(double x){

	     double result,sum,num,denom;

		result = 0;

		sum = 0;

if(x >= 0.01f && x <= one){

  double const coef1 = 6.69569585833067770821885e+6;

  double const coef2 = 407735.985300921332020398;

  double const coef3 = 1.29142492667105836457693e+6;

  double const coef4 = 1.00000000000000000000000000e+00;

  double const coef5 = 6.69558099277749024219574e+6;

  double const coef6 = 4.27571696102861619139483e+6;

  double const coef7 = -2.89391642413453042503323e+6;

  double const coef8 = 317457.367152592609873458;

  num = coef1+x*(coef2+x*(coef3));//MiniMaxApproximation calculated by Mathematica 8

  denom = coef4+x*(coef5+x*(coef6+x*(coef7+x*(coef8))));//MiniMaxApproximation calculated by Mathematica 8
  return num/denom;
}else if( x >= one && x <= gamma_huge){

	          double const coef_1 = 0.08333333333333333333333333;

	          double const coef_2 = 0.00347222222222222222222222;

			double const coef_3 = -0.00268132716049382716049383;

			double const coef_4 = -0.000229472093621399176954733;

			double const coef_5 = 0.000784039221720066627474035;

			double const coef_6 = 0.0000697281375836585777429399;

			double const coef_7 = -0.000592166437353693882864836;

			double const coef_8 = -0.0000517179090826059219337058;

			double const coef_9 = 0.000839498720672087279993358;

			double const coef_10 = 0.0000720489541602001055908572;

			double ln,power,pi_sqrt,two_pi,arg;

			two_pi = 2*Pi;

			double invx = 1/x;

			ln = exp(-x);

			 arg = x-0.5;

		     power = pow(x,arg);

		     pi_sqrt = sqrt(two_pi);

			 sum = ln*power*pi_sqrt;

			 result = one+invx*(coef_1+invx*(coef_2+invx*(coef_3+invx*(coef_4+invx*(coef_5+invx*(coef_6+invx*(coef_7+invx*(coef_8+invx*(coef_9+invx*(coef_10))))))))));
            }

return sum*result;

  }
Speed of execution for first branch (MiniMaxApproximation) 1e6 iterationsfastgamma3() start value is 25488363fastgamma3() end value is 25488379execution time of fastgamma3() 1e6 iterations is: 16 millisecfastgamma3() is: 1.489191725597434100000000

Quoting iliyapolak...Results of loop-overhead testing
As you can see I cannotmeasure loop overheadmoreover Ialso checked with debugger that empty for-loop is executed.Priority wasset with the help of Process Explorer.Assembly instructions can be counted so overheadis sum of a few x86 instr like"jae[target], add 1 cmp,some_value" andshould be not more than a few cycles per iteration.
...
start value of loop_overhead : 5600529
end value of loop_overhead : 5600529
delta of loop_overhead is : 0
...

Hi Iliya,

This is a follow up on two Posts #117 and #114. I think you need to disable ALL optimizations in order to measure an overhead of
an empty 'for' statement. Intel C++ compiler could easily "remove" it. Since itdidn't andyour result was 0 something else
was wrong. I'll take a look at it some time this week.

Best regards,
Sergey

Quoting iliyapolak...What does XXX Ticks stand for? Is it nanoseconds or CPU cycles?

[SergeyK] XXX means anumber of ticks. A Win32 API function 'GetTickCount' returns a value in
milliseconds, or 'ticks'. In my tests every 'sine' function is called 2^22 times and a time
for one call is calculated by dividing a 'TicksNumber' by '2^22'. For example:

78 ms (ticks ) / 2^22 = 0.000018596649169921875 ms ~= 0.0000186 ms

...

Quoting iliyapolak...Why your results are so large almost as my slow implementation...

[SergeyK] Because your computer is faster.

Iliya, you're trying to compare uncomparable values. Let's assume that for the same function:

- Value VA in mswas obtained on a computer A with CPU frequency FA
-Value VB in mswas obtained on a computer B with CPU frequency FB
- If value VB is less then value VA the computer B is faster then computer A

All performance numbers I usually post are for reference only.

You need to compare your value(s) against another value(s) ( a "reference" ) obtained on the same
computer with a similar CRT-fucntion.

Once again, I don't measure absolute performance of some function in milliseconds, nanoseconds or
clock cycles. I always measure a relative performance. Let's say I've set a target to outperform some
"reference function" from CRT library. If my function is faster a target is achieved. I know that accuracy is
affected since 7, or 9, or 11 terms are used but that is another issue.

Best regards,
Sergey

Quoting Sergey KostrovQuoting iliyapolak...Why your results are so large almost as my slow implementation...

[SergeyK] Because your computer is faster.

...I always measure a relative performance...

And here are results of a very special test:

Performanceof C 'Sin' Macros measured with RDTSC instruction is as follows:

Normalized Taylor Series Sine ( 7 terms ) - 25 clock cycles
Normalized Taylor Series Sine ( 9 terms ) - 35 clock cycles
Normalized Taylor Series Sine ( 11 terms ) - 43 clock cycles

However, at the time of testing a Windows Update was in progress...

Best regards,
Sergey

Normalized Taylor Series Sine ( 7 terms ) - 25 clock cycles
Normalized Taylor Series Sine ( 9 terms ) - 35 clock cycles
Normalized Taylor Series Sine ( 11 terms ) - 43 clock cycles

These results are closer to the results achieaved by me and by bronxzv.From theoretical point of view it should be very fast because of a few addition and multiplication and one assignment of the accumulator register to the memory store.

78 ms (ticks ) / 2^22 = 0.000018596649169921875 ms ~= 0.0000186

Try to run your tests for 1e6 iterations.

Here are my results for 2^22 iterations unoptimized fastsin() ,but heavily optimized byIntel C++ compiler with full code inlining inside the loop and 10x loop unrolling.

start val of fastsin() 5566115
end val of fastsin() 5566178
running time of fastsin() release code is: 63 millisec
fastsin() is: 0.988566034786635070000000

63 ms / 2^22 = 0.0000150203704833984375ms ~= 15 ns per iteration
Your code ran for 18.6 ns per iterationprobably due toless agressive compiler optimization.
Btw my CPU is kinda slow Core i3 2.226Ghz.

This is a follow up on two Posts #117 and #114. I think you need to disable ALL optimizations in order to measure an overhead of
an empty 'for' statement. Intel C++ compiler could easily "remove" it. Since itdidn't andyour result was 0 something else
was wrong. I'll take a look at it some time this week

Yes something could be wrong with this measurement.I think that overhead in the worst-case scenario is a few cycles of compare to sth, inc counter and jmp to top_loop assembly instructions.In real world testing CPU should execute such a loop inparallel with inside the loop statements.

- Value VA in mswas obtained on a computer A with CPU frequency FA
-Value VB in mswas obtained on a computer B with CPU frequency FB
- If value VB is less then value VA the computer B is faster then computer A

Yes I was wrong.It is not only the different CPU's , but also uncontrollable by us non-deterministic environment which can greatly affect the state of our measurements.

Let's say I've set a target to outperform some
"reference function" from CRT library

MSVCRT.DLL sin() functions after after argument checking calls x87 fsin instruction.I suppose that fsin performes also range reduction and input checking its cpi is ~40-100 cycles you have also add the overhead of an library wrapper and argument checking so the CRT sin() will be always slower than our fastsin() implementation.Even x87 fsin is slower on average than fastsin() because of range reduction and if I'am not wrong coefficient pre-calculations done on-the-fly.This explains the accuracy of fastsin() compared to CRT sin()
Here is the disassembled by the IDA PRO code snippet of _CIsin which is compiler optimized version of many sin() implementations.

               push    edx

.text:6FF759A3                 fstcw   [esp+4+var_4]

.text:6FF759A7                 jz      loc_6FF7E5AC

.text:6FF759AD                 cmp     [esp+4+var_4], 27Fh

.text:6FF759B3                 jz      short loc_6FF759BB

.text:6FF759B5                 fldcw   ds:word_6FF5C0D6

.text:6FF759BB

.text:6FF759BB loc_6FF759BB:                           ; CODE XREF: sub_6FF759A2+11j

.text:6FF759BB                 fsin ; x87 instruction call
 .text:6FF759BD                 fstsw   ax

.text:6FF759C0                 sahf

.text:6FF759C1                 jp      loc_6FF7E552

.text:6FF759C7

.text:6FF759C7 loc_6FF759C7:                           ; CODE XREF: sub_6FF759A2+8BC4j

.text:6FF759C7                 cmp     dword_6FFF50C4, 0

.text:6FF759CE                 jnz     loc_6FF6C003

.text:6FF759D4                 mov     edx, 1Eh

.text:6FF759D9                 lea     ecx, unk_6FFF5390

.text:6FF759DF                 jmp     loc_6FF5EE3C

.text:6FF759DF sub_6FF759A2    endp


In real world testing CPU should execute such a loop in parallel with inside the loop statements

indeed, that's why it's notsensibleto measure the timing of an empty loop as an estimate of the impact on a real loop,there is typically noconflicts between GPR increment and compare and the FADD/FMUL ports, moreover the branch is macro-fused and with high prediction hit rate, all in all the actual overhead is almost 0

a great tool to see how several instructions are executed in parallel and to study your critical path is the IACA available here:
http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/

increment and compare and the FADD/FMUL ports, moreover the branch is macro-fused and with high prediction hit rate, all in all the actual overhead is almost 0

CPU internal logic will easily optimize such a case,as I could say this is a classic case when integer calculations are executed in parallel with floating-point instructions.Situation will be slightly different when floating point based for-loop is used.Here for-loop's fp code could be interleaved with main calculation body and executed by Port0 and Port1 in parallel withloop's body.

Btw. thanks for the link

Quoting iliyapolak...63 ms / 2^22 = 0.0000150203704833984375ms ~= 15 ns per iteration
Your code ran for 18.6 ns per iterationprobably due toless agressive compiler optimization...

I always disable ALL optimizations during testing because I need to evaluate results for 5 different C++ compilers and 6 different platforms.

Quoting iliyapolak...CRT sin() will be always slower than our fastsin() implementation...
This is not always true and CRT 'sin' outperforms some of my sine functions with 9 terms ( see Post #183 /
based on Chebyshev polynomial). This is not a surprise for me because CRT 'sin'is implemented in assembler.
ALL my sine functions are implemented in C.

This is not always true and CRT 'sin' outperforms some of my sine functions with 9 terms ( see Post #183 /
based on Chebyshev polynomial

It could be possible because the cpi of fsin varies between 40-100 cycles per instruction.If fsin can reach 40-50 cpi and your custom chebyshev-based implementation is not optimized agressively by the compiler it is possible than CRT sin() could run faster.

>>This is not a surprise for me because CRT 'sin'is implemented in assembler

MSVCRT.DLL sin() functions have some overhead which is based on argument checking and comparing to 0 and branching on result to the location which in the end calls x87fsin(you can clearly see the call to fsin in my post above).When CRT sin() is called many times compiler can simply inline the code in for-loop block because of almost 100% predictable backwardbranching the assembly code for the overhead will be present in the cache already decoded to uops , so it can greatly speedup execution time of CRT sin().
Btw JVM solution Math.sin also uses fsin ,but range reduction is performed at the bytecode level so it is slower than MSVCRT sinefunctions.

I always disable ALL optimizations during testing because I need to evaluate results for 5 different C++ compilers and 6 different platforms

Why do not you try to run fastsin() or your other sinefunctions heavily optimized by Intel compiler.
It can be very interesting to see the results of such a measurement.

Quoting iliyapolak

I always disable ALL optimizations during testing because I need to evaluate results for 5 different C++ compilers and 6 different platforms

Why do not you try to run fastsin() or your other sinefunctions heavily optimized by Intel compiler.
It can be very interesting to see the results of such a measurement.

Iliya,

I work for a project and I can't do everything I want. I tested your 'FastSin' functionand, unfortunately,
I need to move on with another things.Your set of special functions implemented in Java is good ( Thank you again! )but
it has no practical applications for the project. Please take a look at a thread for more details:

http://software.intel.com/en-us/forums/showthread.php?t=106134&o=a&s=lr

Post #3

Best regards,
Sergey

Quoting iliyapolak

Let's say I've set a target to outperform some
"reference function" from CRT library

MSVCRT.DLL sin() functions after after argument checking calls x87 fsin instruction. I suppose that fsin performes also range reduction...
'FSIN' instructionaccepts any values in the range from -2^63 to +2^63. If an input value is outside of that range areduction of
theinput valuemust be done and Intel recommends to use 'FPREM' instruction.

Best regards,
Sergey

Your set of special functions implemented in Java is good ( Thank you again! )but
it has no practical applications for the project

Thank You Sergey.Maybe in foreseeable future you could implement some parts of special functions class.I would be very glad to see it :)

What is your main area of software development?Is it numerical methods software?

FSIN' instructionaccepts any values in the range from -2^63 to +2^63

I was talking about the range reduction(or mapping)of large values in radian(input -2^63,+2^63) to values in range |x|

Coming a bit back to the topic of this forum I'm wondering why no attempt has been made to parallelize the calculation or otherwise speed up the usual Horner scheme. One method is Estrin's method.

For sine Horner's scheme is as follows (the constants c_n = 1/n! for the usual Taylor approximation):
x2 = x*x;
result = ((((((c13*x2+c11)*x2+c9)*x2+c7)*x2+c5)*x2+c3)*x2+c1)*x;

In assembler (x and result via [eax]):
movsd xmm0,[eax]
movsd xmm1,xmm0 ; x
mulsd xmm0,xmm0
movsd xmm2,xmm0 ; x2
mulsd xmm0,[c13]
addsd xmm0,[c11]
mulsd xmm0,xmm2
addsd xmm0,[c9]
mulsd xmm0,xmm2
addsd xmm0,[c7]
mulsd xmm0,xmm2
addsd xmm0,[c5]
mulsd xmm0,xmm2
addsd xmm0,[c3]
mulsd xmm0,xmm2
addsd xmm0,[c1]
mulsd xmm0,xmm1
movsd [eax],xmm0

A simplified variant of Estrin's scheme is
x2 = x*x;
x4 = x2*x2;
result =
( (((+c11)*x4+c7)*x4+c3)*x2+
(((+c13)*x4+c9)*x4+c5)*x4+c1 )*x;

In assembler (x and result via [eax]):
movsd xmm0,[eax]
movsd xmm1,xmm0 ; x
mulsd xmm0,xmm0
movsd xmm2,xmm0 ; x2
mulsd xmm0,xmm0
movsd xmm3,xmm0
movsd xmm4,xmm0 ; x4
mulsd xmm0,[c13]
mulsd xmm3,[c11]
addsd xmm0,[c9]
addsd xmm3,[c7]
mulsd xmm0,xmm4
mulsd xmm3,xmm4
addsd xmm0,[c5]
addsd xmm3,[c3]
mulsd xmm0,xmm4
mulsd xmm3,xmm2
addsd xmm0,[c1]
addsd xmm0,xmm3
mulsd xmm0,xmm1
movsd [eax],xmm0

At least on Atom N450 and Core i7 920 this is a gain.

Very interesting.
Here is the Horner scheme for polynomial approximation oftan() function.The first four terms of tan() taylor series are shown below.It is not optimal in the term of speed becuse of stack accesses which are used to calculate powers of a variable x.

main_loop:

addps xmm5,step

movups xmm7,xmm5

movups xmm0,xmm7 ;xmm0 accumulator

mulps xmm7,xmm7

movups xmm6,xmm7 ;copy x^2

mulps xmm7,xmm0 ;x^3

movups [ebp-16],xmm7 ;store x^3

movups xmm1,coef1

mulps xmm1,xmm7

addps xmm0,xmm1 ;x+x^3/3

movups xmm7,[ebp-16]

mulps xmm7,xmm6 ;x^5

movups [ebp-32],xmm7 ;store x^5

movups xmm1,coef2

mulps xmm1,xmm7

addps xmm0,xmm1 ;x+x^3/3+2*x^5/15

movups xmm7,[ebp-32]

mulps xmm7,xmm6 ;x^7

movups [ebp-48],xmm7 ;store x^7

movups xmm1,coef3

mulps xmm1,xmm7

addps xmm0,xmm1 ;17*x^7/315

movups xmm7,[ebp-48]

mulps xmm7,xmm6 ;x^9

movups[ebp-64],xmm7 ;store x^9

movups xmm1,coef4

mulps xmm1,xmm7

addps xmm0,xmm1 ;62*x^9/2835

This is not the Horner scheme which starts with the last coefficient (highest index).
Why do you store the odd powers of x? You don't need to store them and you even need not reload them.
BTW: You should try to omit modifying a register which is needed the very next command - always keep a creator and its consumer dispersed as far as possible, especially when the producer has a long latency.

This is not the Horner scheme which starts with the last coefficient (highest index).

Right I was thinking about the my other implementation of tan() function written in C and made such a comment.It is straightforward implementation of taylor expansion with pre-calculated coefficients.
Btw polynomial evaluation can also start from the lowest first coefficient saw it many times for example FDLIBM implementation of sine function.

>>Why do you store the odd powers of x? You don't need to store them and you even need not reload them

Becuse when I wrote this code I did not know how to efficient multiply the powers of x without the stack accesses.And I simply implemented in assembly varioustaylor expansion without even thinking about the code optimization.

It could be possible because the cpi of fsin varies between 40-100 cycles per instruction

that's one more reason for testing it over asub-domain instead ofa single value

once you have refactored your code to use basic building blockslike EvalPoly5() here http://software.intel.com/en-us/forums/showpost.php?p=186020
your code will be not only more readable but it will be very easy to test alternate evaluation methods like the one suggested by sirrida herehttp://software.intel.com/en-us/forums/showpost.php?p=188457

Interesting what were the values which caused such a great variation in cpi of fsin.

Thanks to this forum and you guys I know how to perform basic code optimization.Refactored code is a lot of faster and more readable and elegant when compared to my previous inefficient and ugly coding style.
Off topic question.How is it possible that every newn member already registered has some points assigned to him.

Quoting iliyapolak

This is a follow up on two Posts #117 and #114. I think you need to disable ALL optimizations in order to measure an overhead of
an empty 'for' statement. Intel C++ compiler could easily "remove" it. Since itdidn't andyour result was 0 something else
was wrong. I'll take a look at it some time this week

Yes something could be wrong with this measurement.I think that overhead in the worst-case scenario is a few cycles of compare to sth, inc counter and jmp to top_loop assembly instructions.In real world testing CPU should execute such a loop inparallel with inside the loop statements.

Hi Iliya,

I just completed a simple verification and when ALL compiler optimizations disabled that code works:

	...

	// Sub-Test 6 - Overhead of Empty For Statement

	{

	///*

		CrtPrintf( RTU("Sub-Test 6   - [ Empty For Statement   ]n") );
		g_uiTicksStart = SysGetTickCount();

		for( RTint t = 0; t < 100000000; t++ )

		{

			;

		}

		CrtPrintf( RTU("Sub-Test 6   - 100,000,000 iterations - %4ld msn"),

				   ( RTint )( SysGetTickCount() - g_uiTicksStart ) );

	//*/

	}

	...

Quoting iliyapolakThanks to this forum and you guys I know how to perform basic code optimization...
I recommend you to look at Intel manuals at:

http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html?wapkw=Manuals

You could find there a manual on optimization techniques.

Best regards,
Sergey

Quoting iliyapolak...How is it possible that every newn member already registered has some points assigned to him.

I think because they didn't opt-out from the Intel Black Belt Software Developer Program:

http://www.intel.com/software/blackbelt

and every time when a member submits a postthe system assignssome number of points.

Best regards,
Sergey

and every time when a member submits a postthe system assignssome number of points

That means that I have lost many points.

I recommend you to look at Intel manuals at

Thanks for the link.
I have this manual.I would also like to recommend you a very interesting book"Inner Loops",here is link http://www.amazon.com/Inner-Loops-Sourcebook-Software-Development/dp/0201479605/ref=sr_1_1?s=books&ie=UTF8&qid=1340854371&sr=1-1&keywords=inner+loops

The book is old (even very old)is still dealing with Pentium and Pentium Pro code optimization,but you can find there a lot of info regarding various code optimization techniques performed mainly in assembly.

just completed a simple verification and when ALL compiler optimizations disabled that code

In my case I probably did not disable optimization hence the result was 0.
Can you post your result?

Quoting iliyapolak

and every time when a member submits a postthe system assignssome number of points

That means that I have lost many points.

Yes, unfortunately. I see that you're already in the program!

I could have had a brown belt.Now working hard to regain my lost points.

The rcpps is likely to be useful only where 12 bits precision may be adequate. It's advocated with iterative improvement for "throughput" optimization where you have independent calculations pending which would be stalled during the time the fpu is occupied by divide. Core I7-3 cuts that stall time in half, so you should be aware that you may be optimizing for a past architecture. We are still struggling with software implementation of divide.
I don't see enough context to understand how you would want rcpps for expansion of a series with known coefficients.
A speed improvement may be obtained by modifying the Horner scheme e.g.
x + x**3 * (a3 + a5 * x**2 + x**4 * (a7 + a9 * x**2))
so as to make fuller use of the instruction pipeline.
The Ivy Bridge "core I7-3" should speed up those minimax rational approximations, and fma hardware would help significantly with these polynomial evaluations.

Páginas

Deixar um comentário

Faça login para adicionar um comentário. Não é membro? Inscreva-se hoje mesmo!