Optimization and Execution speed testing of gamma stirling approximation

Optimization and Execution speed testing of gamma stirling approximation

Hi everybody!

My previous thread on sine function optimization grow very large so I was being asked by Sergey Kostrov to create a new threads for my other functions implementations.
Now I'am starting with gamma stirling approximation.The code is based on formula taken from "Handbook of Mathematical Functions" p. 257 formula 6.1.37
Function was tested against the book's table and later against the Mathematica 8which is used as areference model for an accuracy comparision.

There are two implementations.
One of them is based on 1d arrays ,library pow() calls and division inside for-loop
to calculate coefficients and to sum the series expansion.First function uses 14 terms of stirling expansion second uses 10 terms of stirling expansion btw regarding the slower implementation with more terms was not able to increase the accuracy when compared to Mathematica 8 reference result.
Second method is faster because array'scoefficients fillingand summation is eliminated
and also pow() library calls are not used.Instead the function relies on simple calculation of
argument x ofconsecutivepowers mulitplied by constant denominator and divided by thenumerator.
Pre-calculation of coefficients was not possible because of dependency of denominator on argument x.
Accuracy when compared to Mathematica code is between 15-16 decimal places.
Feel free to optimize and test this code and please do not forget to post your opinion.

Code for C implementation of stirling gamma function
slower version

[bash]inline double gamma(double x){
double sum,result;
sum = 0;
result = 0;
if( x > gamma_huge){
return (x-x)/(x-x); //NaN

}else if( x < one){
return (x-x)/(x-x);
}else{

double ln,power,pi_sqrt,two_pi,arg;

two_pi = 2*Pi;
double coef1[] = {1.0,1.0,139.0,571.0,163879.0,
5246819.0,534703531.0,4483131259.0,
432261921612371.0, 6232523202521089.0,
25834629665134204969.0,1579029138854919086429.0,
746590869962651602203151.0,1511513601028097903631961.0
};

double coef2[] = {12.0,288.0,51840.0,2488320.0,209018880.0,
75246796800.0, 902961561600.0,86684309913600.0,
514904800886784000.0, 86504006548979712000.0,
13494625021640835072000.0,9716130015581401251840000.0,
116593560186976815022080000.0,2798245444487443560529920000.0

};
int const size = 14;

double temp[size];
double temp2[size];
double temp3[size];
int i,j;
long k;
k = 0;
for(i = 0;i
Results of 1e6 iterations gamma() called with an argument incremented by 0.000001.Microsoft Optimizied compiler

gamma() start value is: 13214329
gamma() end value is: 13214688
execution time of gamma() 1e6 iterations is: 359 millisec
gamma 1.999999997453735200000000

Intel Compiler optimized for speed

gamma() start value is: 24295767
gamma() end value is: 24296016
execution time of gamma() 1e6 iterations is: 249 millisec
gamma 1.999999997453735200000000

Intel compiler settings

/Zi /nologo /W3 /MP /O2 /Ob2 /Oi /Ot /Oy /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /GS- /Gy /arch:SSE2 /fp:precise /QxHost /Zc:wchar_t /Zc:forScope /Fp"Release\\inline.c.pch" /FAcs /Fa"Release\\" /Fo"Release\\" /Fd"Release\\vc100.pdb" /Gd /TP

31 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

Results for the faster version of gamma stirling approximation which uses 10 stirling series coefficients.Was not able to use Horner scheme directly because of an accumulated error when implemented division inside the polynomial.So my code looks really ugly:)

Code for faster version.

inline double fastgamma(double x){

	    double sum,result;

	    sum = 0;

	    result = 0;

	    if( x > gamma_huge){

		    return (x-x)/(x-x);

	    }else if( x < one){

		    return (x-x)/(x-x);
	    }else{
		     double ln,power,pi_sqrt,two_pi,arg;

			two_pi = 2*Pi;

			double nom1,nom2,nom3,nom4,nom5,nom6,nom7,nom8,nom9,nom10;

			double denom1,denom2,denom3,denom4,denom5,denom6,denom7,denom8,denom9,denom10;

			double coef1,coef2,coef3,coef4,coef5,coef6,coef7,coef8,coef9,coef10;

			double z1,z2,z3,z4,z5,z6,z7,z8,z9,z10;

			nom1 = 1.0;

			nom2 = 1.0;

			nom3 = 139.0;

			nom4 = 571.0;

			nom5 = 163879.0;

			nom6 =  5246819.0;

			nom7 = 534703531.0;

			nom8 = 4483131259.0;

			nom9 =  432261921612371.0;

			nom10 = 6232523202521089.0;
			denom1 = 12.0;

			denom2 = 288.0;

			denom3 = 51840.0;

			denom4 = 2488320.0;

			denom5 = 209018880.0;

			denom6 = 75246796800.0;

			denom7 =  902961561600.0;

			denom8 = 86684309913600.0;

			denom9 = 514904800886784000.0;

			denom10 = 86504006548979712000.0;
               z1 = x;

			z2 = z1*z1;//x^2

			z3 = z2*z1;//x^3

			z4 = z3*z1;//x^4

			z5 = z4*z1;//x^5

			z6 = z5*z1;//x^6

			z7 = z6*z1;//x^7

			z8 = z7*z1;//x^8

			z9 = z8*z1;//x^9

			z10 = z9*z1;//x^10

			ln = exp(-x);

		     arg = x-0.5;

		     power = pow(x,arg);

		     pi_sqrt = sqrt(two_pi);

		     sum = ln*power*pi_sqrt;

			coef1 = nom1/(z1*denom1);

			coef2 = nom2/(z2*denom2);

			coef3 = nom3/(z3*denom3);

			coef4 = nom4/(z4*denom4);

			coef5 = nom5/(z5*denom5);

			coef6 = nom6/(z6*denom6);

			coef7 = nom7/(z7*denom7);

			coef8 = nom8/(z8*denom8);

			coef9 = nom9/(z9*denom9);

			coef10 = nom10/(z10*denom10);

			result = one+coef1+coef2-coef3-coef4+coef5+coef6-coef7-coef8+coef9+coef10;

	    }

	    return sum*result;

    }

Results of 1e6 iterations of fastgamma() called with an argument 2.0 incremented by 0.000001
Microsoft optimized compiler

fastgamma() start value is 13214688
fastgamma() end value is 13214875
execution time of fastgamma() 1e6 iterations is: 187 millisec
fastgamma 2.000000016396600500000000

The same test compiled with Intel compiler

fastgamma() start value is 24296016
fastgamma() end value is 24296157
execution time of fastgamma() 1e6 iterations is: 141 millisec
fastgamma 2.000000016396600500000000

IntelCompiler settings

/Zi /nologo /W3 /MP /O2 /Ob2 /Oi /Ot /Oy /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /GS- /Gy /arch:SSE2 /fp:precise /QxHost /Zc:wchar_t /Zc:forScope /Fp"Release\inline.c.pch" /FAcs /Fa"Release" /Fo"Release" /Fd"Release\vc100.pdb" /Gd /TP

for a better speed, a first easy change will be to replace

coefN = nomN / (zN*denomN)

with

coefN = nomN/denomN* rcpzN

nomN/denomN: precomputed constant
rcpzN = (1/x)^N, you can compute 1/x once then use only multiplications to get (1/x)^2, etc.

it should be a lot faster since you'll avoid most divides

for a better accuracy, add first the smallest absolute values, since x>1 it looks likeit will be better to do result =coef10+coef9-coef8... +one

next steps may be to see how you can rewrite it using Horner's scheme (based on 1/x instead of x) and get rid of the libc calls like exp() and pow()

Good answer bronxzv :)
After studying the formula I came with the same solution like yours.If I could get rid of divides I would have been able to reduce the exec speed probably by ~140-160 cycles.I implemented Horner-like scheme initialy exactly as in formula ,but encountered an error probably because of wrongoperators precedence.The accuracy was less than 2 decimal places.
How can I get rid libc calls?.In the case of exp() I can"inline" my fastexp() but will it be faster than library call.
Can I enforce compiler to inline pow() and exp() calls both of them are not hot-spots for this code.

How can I get rid libc calls?.In the case of exp() I can "inline" my fastexp() but will it be faster than library call

It's beyond my expertise but I suppose the goal will be to find the best polynomial or rational approximations forgamma(x) without using the Stirling'sformula (Muller advise to use Maple for this but Mathematica should be OK too) it will be probably a series of approximations to cover the whole domain, now I may well be wrong since I havezero practical experience with gamma functions

There is also Lanczos approximation to gamma function.With stirling approx I'am able to achieve an accuracy of 15-16 decimal places when compared to Mathematica 8 implementation.Developing more accurate approximation is also beyond my expertise.I'am only trying to implement in programming languages variousfunctions approximations from " Handbook of Mathematical Functions".
It is possible to force compiler to inline library calls.

It is possible to force compiler to inline library calls.

probably but the best you can hope is to have a direct call to the native x87 instructions (it will be slow and amenable neither to vectorized code nor FMA usage for AVX2 targets), even if you inline the code for some exp() and pow() approximations you will basically do 3x a range reduction and will compute 3 different polynomial/rational approximations when the best (speed vs accuracy) solutionis probably to use a single range reduction and a single minimax polynomial/rational approximation, now if Mathematica do all the work for you you'll not learn as much as you do by studying different methods, so I suppose one question is what is your key goal

>>probably but the best you can hope is to have a direct call to the native x87 instructions

I think that inlining of library calls could depend on frequency of these calls.As you can see fastgamma() calls only oncemath library functions so compiler did notinline the code, but if the calls were made from within the for-loop which is natural hot-spot and 100% branch predicted in such a case smart compiler could perform inlining to reduce calling overhead.

>>even if you inline the code for some exp() and pow() approximations you will basically do 3x a range reduction and will compute 3 different polynomial/rational approximations when the best (speed vs accuracy)

Range reduction and input checking in non-periodic functions will be also present during function call when code execution is transfered to MSVCRT.DLL which implements math libraries.
The best solution is to write the formula in such a way to eliminate dependency on library calls,but not always will it be possible.

>>solutionis probably to use a single range reduction and a single minimax polynomial/rational approximation

Yes it would be the best options in terms of speed.

>>Mathematica do all the work for you you'll not learn as much as you do by studying different methods,

I suppose that Mathematica uses a Lanczos approximation with arbitrary precision.
I could enter such a command like N[Gamma[some value],precision value] to obtain the result but I would like to write the code by studying different methods and relying on Mathematicaas a reference model for testing.
It is also possible to write such a code directly in Mathematica as procedural program ,but my knowledge of Mathematica programming is basic.

Icould enter such a command like N[Gamma[some value],precision value] to obtain the result but I would like to write the code by studying different methods and relying on Mathematica as a reference model for testing.

I was talking about a method to get the all the coefficients of the mimimax approximation, not a single value, IIRCin the Muller's book he says it's something you get after runs of around 5 minutes with Waterloo Maple, so I suppose it's as simple as saying something like"give me the minimax approximation for gamma(x) in the sub-domain [1.0,1e6] with 53-bit precision" and after a while it will return saying "you must use a xth order approximation and your coefficients are these: a0=xxx, a1=xxx,..."

EDIT: just found this http://www.maplesoft.com/support/help/Maple/view.aspx?path=numapprox/minimax

also if you don't have yet the Muller's book some background material on minimax approximations and the Remez algorithm can be found in this paper (lacka lot of details though):
http://www.research.scea.com/research/pdfs/RGREENfastermath_GDC02.pdf

Hi Iliya,

Thank you. I'll providefeedback soon including some additional comments for the 'FastSin' function.

A message for the Intel Software Network administration:

What do you think about a new forum "Algorithms", or something like this?

Best regards,
Sergey

I was talking about a method to get the all the coefficients of the mimimax approximation, not a single value

Misunderstood your post.Mathematica 8 has minmax approximation package which is suitable for such a calculation here is linkhttp://reference.wolfram.com/mathematica/FunctionApproximations/tutorial/FunctionApproximations.html
scroll down to see minimax approximation.
Today I will try to use this approx on gamma function.

>>in the sub-domain [1.0,1e6]

For arbitrary precision system as Mathematica the upper range is possible to calculate,but on fixed-precission system value like 2.0e2 gives overflow.

>>also if you don't have yet the Muller's book some background material on minimax approximations and the Remez algorithm can be found in this paper (lacka lot of details though)

Thanks for link.I don't have Muller's book ,but I will try to study "A first course in numerical analyssis" good textbook butfilled with fortran-like pseudocode without anyword of how to implement those algorithms in C.

I'll providefeedback soon including some additional comments for the 'FastSin' function

Sergey could you alsotry to test gamma approximation ?

give me the minimax approximation for gamma(x) in the sub-domain [1.0,1e6] with 53-bit precision" and after a while it will return saying "you must use a xth order approximation and your coefficients are these: a0=xxx, a1=xxx,..."

Here are the results for MiniMaxApproximations obtained from Mathematica 8 package FunctionApproximation

command entered:
mmagam = MiniMaxApproximation[Gamma[x],{x,{1,2},2,4},WorkingPrecision 
24] // evaluation on interval 1,2 by rational polynomials degree 2 and 4 , precission set to 24 dec. places
The result of approximations
{{1.00000000000000000000000, 1.04467894661900711357003, <- THESE POINTS ARE PLACES WHERE AN ERROR OCCURS

  1.17319919869742825989377, 1.36675484999411230927585,

  1.59085223396159103095876, 1.79954784055008990140185,

  1.94692992215923571877971,

  2.00000000000000000000000}, {(-30.8228716136997999101970 +

     3.59003268120785313248828 x -

     5.50695236588033950921865 x^2)/(1 -

     35.5921195782261091560663 x - 5.00150832504309216048443 x^2 +

     8.14273016476890957389399 x^3 -

     1.28888709928612902004067 x^4),

-1.97331347687791230838473*10^-7}}<- This is the Max error of approximation
The rational polynomial
(-30.8228716136997999101970 + 3.59003268120785313248828 x -

   5.50695236588033950921865 x^2)/(1 - 35.5921195782261091560663 x -

   5.00150832504309216048443 x^2 + 8.14273016476890957389399 x^3 -

   1.28888709928612902004067 x^4)

give me the minimax approximation for gamma(x) in the sub-domain [1.0,1e6] with 53-bit precision" and after a while it will return saying "you must use a xth order approximation and your coefficients are these: a0=xxx

Second MiniMaxApproximation of Gamma(x)on range [0.01,1].The Relative Error Plot was added as an attachment

MiniMaxApproximation of Gamma[x] on interval[0.01,1] 
(6.69569585833067770821885*10^6 + 407735.985300921332020398 x +

   1.29142492667105836457693*10^6 x^2)/(1 +

   6.69558099277749024219574*10^6 x +

   4.27571696102861619139483*10^6 x^2 -

   2.89391642413453042503323*10^6 x^3 + 317457.367152592609873458 x^4)

附件: 

附件尺寸
下载 project_testing1.jpg7.34 KB

fastgamma() was completely rewritten.I pre-calculated coefficients and implemented Horner scheme for polynomial fit of stirling approximation.

Here is the code of improved version:

inline  double fastgamma2(double x){

	    double sum,result;

	    sum = 0;

	    result = 0;
	     if( x > gamma_huge){

		    return (x-x)/(x-x);

	    }else if( x < one){

		    return (x-x)/(x-x);
	    }else{

		               double const coef1 = 0.08333333333333333333333333;

			double const coef2 = 0.00347222222222222222222222;

			double const coef3 = -0.00268132716049382716049383;

			double const coef4 = -0.000229472093621399176954733;

			double const coef5 = 0.000784039221720066627474035;

			double const coef6 = 0.0000697281375836585777429399;

			double const coef7 = -0.000592166437353693882864836;

			double const coef8 = -0.0000517179090826059219337058;

			double const coef9 = 0.000839498720672087279993358;

			double const coef10 = 0.0000720489541602001055908572;

			double ln,power,pi_sqrt,two_pi,arg;

			two_pi = 2*Pi;

			double invx = 1/x;

			ln = exp(-x);

			 arg = x-0.5;

		     power = pow(x,arg);

		     pi_sqrt = sqrt(two_pi);

			 sum = ln*power*pi_sqrt;

			 result = one+invx*(coef1+invx*(coef2+invx*(coef3+invx*(coef4+invx*(coef5+invx*(coef6+invx*(coef7+invx*(coef8+invx*(coef9+invx*(coef10))))))))));

	    }

	    return sum*result;

    }
Here are the results for 1e6 iterations ,compiled with Intel compiler
fastgamma2() start value is 14962227
fastgamma2() end value is 14962305
execution time of fastgamma2() 1e6 iterations is: 78 millisec  As You can see this code is 2x faster than previous version 
fastgamma2 is: 2.000000016396600100000000

As You can see this code is 2x faster than previous version

cool, you're fast, optimizing code is fun isn't it ?

cool, you're fast, optimizing code is fun isn't it ?

I really like to optimize code.Now I will try to optimize 45 functions which are based on iterative methods.
Fastgamma2() is 2x faster than previous version,but it is still dominated by two library calls.By crude calculation each of them takes 25-30 ns to execute.
Bronxzv Did you look at MiniMaxApproximation for Gamma(x) at interval [0.01,1] this rational polynomial will introduce some overhead when implemented in fastgamma2().

now I will try to optimize 45 functions which are based on iterative methods

that's quite a lot! to keep your code clean and manageable I'll advise to design a library of simple helper functions such asEvalPoly5() here http://software.intel.com/en-us/forums/showpost.php?p=186020, the main functions will be a bit shorter and more readable without impact on speed

look at MiniMaxApproximation for Gamma(x) at interval [0.01,1] this rational polynomial will introduce some overhead when implemented in fastgamma2().

I don't knowwhich overhead you are refering to, it's simply two low order polynomials and a division it should be a lot faster than your last example since it's not calling pow() and exp()
the code for the rational approx. will be basically: EvalPoly2(a0,a1,a2,x) / EvalPoly4(b0,b1,b2,b3,b4,x) with a0..a2, b0..b4 constants

I'll advise to design a library of simple helper functions such asEvalPoly5()

This is good approach to keep code clean.I will redesignmy C++ static library exactly as you advised me.I have only 8 months of serious programmingexperience :)

>>I don't knowwhich overhead you are refering to, it's simply two low order polynomials and a division it should be a lot faster than your last example since it's not calling pow() and exp()

Sadly Mathematica 8 was not able to calculate MiniMaxApproximationfor fastgamma(x)on whole interval i.e. [0.001,160] because of "division by zero" errors.Knowing that fastgamma(x) is not accurate on interval [0,1]
I decided to calculate minimax approximating polynomial only for this interval.So I can not get rid of library pow() and exp()calls.I willimplement two branching statement on input checking simply when 0.001Mathematica's minimax approximating polynomial will be executed and when 1.0
P.S.
Bronxzv have you ever tried to write your library of special and elementary function from the book "Handbook of Mathematical Functions".

Bronxzv have you ever tried to write your library of special and elementary function from the book "Handbook of Mathematical Functions".

Hi Ilia,

I have basically writen only the functions useful for my 3D rendering needs, some directly taken from Abramowitz & Stegun, but never a full library like you are doing

3D rendering use cases are generally easy since you don't need a lot of precision, the key goal is to have approximations that you can vectorize, though

functions that I use the most are sin, cos, arccos, ln, exp

thanks tothe Muller's book I have now abasicunderstanding of the subject, beyond my current needs

3D rendering use cases are generally easy since you don't need a lot of precision, the key goal is to have approximations that you can vectorize, though

functions that I use the most are sin, cos, arccos, ln, exp

Yes 3D rendering is mostly based on the elementary functions.Regarding precision do you use in your project double precission floating-point values because afaik Windows 7 rendering stack and current display hardware cannot work with more than 10-12bit per colour component.
How do you calculate more esoteric function like "BSDF" which could be physically based and involves calculation of an integral(as stated in "Physically based rendering")book.

Regarding precision do you use in your project double precission floating-point valuesbecause afaik Windows 7 rendering stack and current display hardware cannot work with more than 10-12bit per colour component.

mostly single precision but double precision is needed in some cases, think to the traversal of a deep scene-graph for example,in a big model the camera movements will be jerky when close to a small detail with single precision
finalLDR images (low dynamic range) aren't a good indicator of the precision required for shading, imagine a bright areabehind a dark translucent surface, thus all intermediate shading computations are with HDR imagesand only at the end we do the final tone mapping, this is the standard solution also for modern GPU-based realtime renderers

How do you calculate more esoteric function like "BSDF" which could be physically based and involves calculation of an integral(as stated in "Physically based rendering")book.

rational approximations and LUT-based (using look-up tables), AVX2 gather instructions will come handy to speed up the LUT based ones

finalLDR images (low dynamic range) aren't a good indicator of the precision required for shading, imagine a bright areabehind a dark translucent surface, thus all intermediate shading computations are with HDR imagesand only at the end we do the final tone mapping, this is the standard solution also for modern GPU-based realtime renderers

As i understood it for internal processing you use double precision which could be converted to single precision for the final result which will be displayed on 10-12 bit RGBA hardware.

As i understood it for internal processing you use double precision which could be converted to single precision for the final result which will be displayed on 10-12 bit RGBA hardware.

no, all shading computations are with single precision (good enough quality and twice the throughput than with double precision, also note that 16-bit per component is already considered HDR), only some geometry transforms are with double precision, I'll say that for the renderer it's 99% float (SP)and 1% double (DP) overall

the tone mapping operator is also float to float, only thevery final stage(gamma conversion + decimation) isfloat to int

So the duty or perhaps roleof GPU in your project KRIBI 3D is reduced to copying with help of Directx the rendering context to thedevice's frame buffer and displaying it.

exactly, there is a cool picturefor this pure softwarerendering approach at slide 62 of this seminalpresentation:
http://graphics.cs.williams.edu/archive/SweeneyHPG2009/TimHPG2009.pdf

It was a great read a very insightful article.So I think that in the nearest future pure software approach will take on a hybrid software-hardware approach.It means the "death" to Nvidia.
Arrival such a architecture as a MIC which is more able to operate as a typical CPU and probably being used as a large co-processor to ofload the main CPU from vector intensive floating-pont calcualtions and inability of Nvidia to enter x86 market and be a threat to Intel or even Amd probably means that Nvidia will disappear like 3dfx.
What is your opinion on this?

IIRC nVIDIA roadmap for hexascalehas very ambitious designs with ARM cores and massive SIMD fp arrays, it's a bit early to count them out, also I was among the many peoplepredicting the unavoidable demiseof Apple in the 90s, see how we were wrong...

provided that our engine is 100% basedon high-levelC++ code there is no reason to keep it only x86 based, depending on future market realities

Testing and implementation of fastgamma() which isbased on MiniMaxApproximation for 0.01
Here is code:

inline double fastgamma3(double x){

	     double result,sum,num,denom;

		result = 0;

		sum = 0;

if(x >= 0.01f && x <= one){

  double const coef1 = 6.69569585833067770821885e+6;

  double const coef2 = 407735.985300921332020398;

  double const coef3 = 1.29142492667105836457693e+6;

  double const coef4 = 1.00000000000000000000000000e+00;

  double const coef5 = 6.69558099277749024219574e+6;

  double const coef6 = 4.27571696102861619139483e+6;

  double const coef7 = -2.89391642413453042503323e+6;

  double const coef8 = 317457.367152592609873458;

  num = coef1+x*(coef2+x*(coef3));//MiniMaxApproximation calculated by Mathematica 8

  denom = coef4+x*(coef5+x*(coef6+x*(coef7+x*(coef8))));//MiniMaxApproximation calculated by Mathematica 8
  return num/denom;
}else if( x >= one && x <= gamma_huge){

	          double const coef_1 = 0.08333333333333333333333333;

	          double const coef_2 = 0.00347222222222222222222222;

			double const coef_3 = -0.00268132716049382716049383;

			double const coef_4 = -0.000229472093621399176954733;

			double const coef_5 = 0.000784039221720066627474035;

			double const coef_6 = 0.0000697281375836585777429399;

			double const coef_7 = -0.000592166437353693882864836;

			double const coef_8 = -0.0000517179090826059219337058;

			double const coef_9 = 0.000839498720672087279993358;

			double const coef_10 = 0.0000720489541602001055908572;

			double ln,power,pi_sqrt,two_pi,arg;

			two_pi = 2*Pi;

			double invx = 1/x;

			ln = exp(-x);

			 arg = x-0.5;

		     power = pow(x,arg);

		     pi_sqrt = sqrt(two_pi);

			 sum = ln*power*pi_sqrt;

			 result = one+invx*(coef_1+invx*(coef_2+invx*(coef_3+invx*(coef_4+invx*(coef_5+invx*(coef_6+invx*(coef_7+invx*(coef_8+invx*(coef_9+invx*(coef_10))))))))));
            }

return sum*result;

  }
Speed of execution for first branch (MiniMaxApproximation) 1e6 iterations
fastgamma3() start value is 25488363
fastgamma3() end value is 25488379
execution time of fastgamma3() 1e6 iterations is: 16 millisec
fastgamma3() is: 1.489191725597434100000000

In the range 0.01Speed of execution has been improved by 4.5x when compared to my first implementation based on iterative calculations.

nVIDIA roadmap for hexascalehas very ambitious designs with ARM cores

It also depends on thesupport from Microsoft.

登陆并发表评论。