Results for the faster version of gamma stirling approximation which uses 10 stirling series coefficients.Was not able to use Horner scheme directly because of an accumulated error when implemented division inside the polynomial.So my code looks really ugly:)

Code for faster version.

inline double fastgamma(double x){ double sum,result; sum = 0; result = 0; if( x > gamma_huge){ return (x-x)/(x-x); }else if( x < one){ return (x-x)/(x-x); }else{ double ln,power,pi_sqrt,two_pi,arg; two_pi = 2*Pi; double nom1,nom2,nom3,nom4,nom5,nom6,nom7,nom8,nom9,nom10; double denom1,denom2,denom3,denom4,denom5,denom6,denom7,denom8,denom9,denom10; double coef1,coef2,coef3,coef4,coef5,coef6,coef7,coef8,coef9,coef10; double z1,z2,z3,z4,z5,z6,z7,z8,z9,z10; nom1 = 1.0; nom2 = 1.0; nom3 = 139.0; nom4 = 571.0; nom5 = 163879.0; nom6 = 5246819.0; nom7 = 534703531.0; nom8 = 4483131259.0; nom9 = 432261921612371.0; nom10 = 6232523202521089.0; denom1 = 12.0; denom2 = 288.0; denom3 = 51840.0; denom4 = 2488320.0; denom5 = 209018880.0; denom6 = 75246796800.0; denom7 = 902961561600.0; denom8 = 86684309913600.0; denom9 = 514904800886784000.0; denom10 = 86504006548979712000.0; z1 = x; z2 = z1*z1;//x^2 z3 = z2*z1;//x^3 z4 = z3*z1;//x^4 z5 = z4*z1;//x^5 z6 = z5*z1;//x^6 z7 = z6*z1;//x^7 z8 = z7*z1;//x^8 z9 = z8*z1;//x^9 z10 = z9*z1;//x^10 ln = exp(-x); arg = x-0.5; power = pow(x,arg); pi_sqrt = sqrt(two_pi); sum = ln*power*pi_sqrt; coef1 = nom1/(z1*denom1); coef2 = nom2/(z2*denom2); coef3 = nom3/(z3*denom3); coef4 = nom4/(z4*denom4); coef5 = nom5/(z5*denom5); coef6 = nom6/(z6*denom6); coef7 = nom7/(z7*denom7); coef8 = nom8/(z8*denom8); coef9 = nom9/(z9*denom9); coef10 = nom10/(z10*denom10); result = one+coef1+coef2-coef3-coef4+coef5+coef6-coef7-coef8+coef9+coef10; } return sum*result; }

Results of 1e6 iterations of fastgamma() called with an argument 2.0 incremented by 0.000001

Microsoft optimized compiler

**fastgamma() start value is 13214688fastgamma() end value is 13214875execution time of fastgamma() 1e6 iterations is: 187 millisecfastgamma 2.000000016396600500000000**

The same test compiled with Intel compiler

**fastgamma() start value is 24296016fastgamma() end value is 24296157execution time of fastgamma() 1e6 iterations is: 141 millisecfastgamma 2.000000016396600500000000**

IntelCompiler settings

/Zi /nologo /W3 /MP /O2 /Ob2 /Oi /Ot /Oy /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /GS- /Gy /arch:SSE2 /fp:precise /QxHost /Zc:wchar_t /Zc:forScope /Fp"Release\inline.c.pch" /FAcs /Fa"Release" /Fo"Release" /Fd"Release\vc100.pdb" /Gd /TP

## Optimization and Execution speed testing of gamma stirling approximation

Hi everybody!

My previous thread on sine function optimization grow very large so I was being asked by Sergey Kostrov to create a new threads for my other functions implementations.

Now I'am starting with gamma stirling approximation.The code is based on formula taken from "Handbook of Mathematical Functions" p. 257 formula 6.1.37

Function was tested against the book's table and later against the Mathematica 8which is used as areference model for an accuracy comparision.

There are two implementations.

One of them is based on 1d arrays ,library pow() calls and division inside for-loop

to calculate coefficients and to sum the series expansion.First function uses 14 terms of stirling expansion second uses 10 terms of stirling expansion btw regarding the slower implementation with more terms was not able to increase the accuracy when compared to Mathematica 8 reference result.

Second method is faster because array'scoefficients fillingand summation is eliminated

and also pow() library calls are not used.Instead the function relies on simple calculation of

argument x ofconsecutivepowers mulitplied by constant denominator and divided by thenumerator.

Pre-calculation of coefficients was not possible because of dependency of denominator on argument x.

Accuracy when compared to Mathematica code is between 15-16 decimal places.

Feel free to optimize and test this code and please do not forget to post your opinion.

Code for C implementation of stirling gamma function

slower version

[bash]inline double gamma(double x){

double sum,result;

sum = 0;

result = 0;

if( x > gamma_huge){

return (x-x)/(x-x); //NaN

}else if( x < one){

return (x-x)/(x-x);

}else{

double ln,power,pi_sqrt,two_pi,arg;

two_pi = 2*Pi;

double coef1[] = {1.0,1.0,139.0,571.0,163879.0,

5246819.0,534703531.0,4483131259.0,

432261921612371.0, 6232523202521089.0,

25834629665134204969.0,1579029138854919086429.0,

746590869962651602203151.0,1511513601028097903631961.0

};

double coef2[] = {12.0,288.0,51840.0,2488320.0,209018880.0,

75246796800.0, 902961561600.0,86684309913600.0,

514904800886784000.0, 86504006548979712000.0,

13494625021640835072000.0,9716130015581401251840000.0,

116593560186976815022080000.0,2798245444487443560529920000.0

};

int const size = 14;

double temp[size];

double temp2[size];

double temp3[size];

int i,j;

long k;

k = 0;

for(i = 0;i

Results of 1e6 iterations gamma() called with an argument incremented by 0.000001.Microsoft Optimizied compiler

gamma() start value is: 13214329gamma() end value is: 13214688

execution time of gamma() 1e6 iterations is: 359 millisec

gamma 1.999999997453735200000000

Intel Compiler optimized for speed

gamma() start value is: 24295767gamma() end value is: 24296016

execution time of gamma() 1e6 iterations is: 249 millisec

gamma 1.999999997453735200000000

Intel compiler settings

/Zi /nologo /W3 /MP /O2 /Ob2 /Oi /Ot /Oy /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /GS- /Gy /arch:SSE2 /fp:precise /QxHost /Zc:wchar_t /Zc:forScope /Fp"Release\\inline.c.pch" /FAcs /Fa"Release\\" /Fo"Release\\" /Fd"Release\\vc100.pdb" /Gd /TP