Efficient cube root calculation.

Efficient cube root calculation.

I am trying to write a cube root routine that is many times faster than the susual C expression "pow(a,1.0/3.0)".Recently Ihave received outstanding help from the Intel Fortran Compiler Forum, and one of the routines that were suggested in that Forum was:

http://www.worldserver.com/turk/opensource/CubeRoot.c.txt.

The idea is to write a low level C routine, based on the rational approximation that is nearly as efficient as the square root routine that is implemented in hardware. Square root uses 38 clock cycles on a Pentium 4 CPU, and a software cube root routine using something like 70-80 clock cycles is the goal. In the Fortran Forum it was recommended to contact the C forum, due to the many low level C experts in this forum.

In the routine in the link above I have made the following observations.

1) As the numerator and denominator in the rational approximation both can be written (((a* fr + b) * fr +c)*fr+d)*fr + e

14 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

(message was sent before it was finished)

In http://www.worldserver.com/turk/opensource/CubeRoot.c.txt I have made the following observations.

1) As the numerator and denominator in the rational approximation both can be written (((a* fr + b) * fr +c)*fr+d)*fr + e, one could calcualte these simulataneously in the upper and lower part of an SSE2 register.

2) It should be possible to do some of the calculations in parallell, as x87 and SSE2 instruction can be executed simultaneously in the Pentium 4.

Here is a draft, the "|" denotes that it we hope that the instructions on both sides of the "|" will be calculated simultaneously due the out of order execution core in the Pentium 4.

I would very much appreciate if you would help me to write this routine in the proper way to make the routine as efficient as possible. The idea is to use SSE or SSE2 and the program should run on a Intel Pantium 4 CPU using the Intel C++ compiler.

Best regards,

Lars Petter Endresen

float CubeRoot(float x)
{
float fr, r;
int ex, shx;

/* Store constants in SSE2 registers and argument reduction at the same time */
fr = frexp(x, &ex);| xmm1 = [45.2548339756803022511987494,14.80884093219134573786480845];
shx = ex % 3;| xmm2 = [192.2798368355061050458134625, 151.9714051044435648658557668];
if (shx > 0)| xmm3= [119.1654824285581628956914143, 168.5254414101568283957668343];
shx -= 3; | xmm4 =[13.43250139086239872172837314, 33.9905941350215598754191872];
fr = ldexp(fr, shx);| xmm5 = [0.1636161226585754240958355063, 1.0];

xmm7 = xmm0 = fr;

/* Use quadric rational polynomial with error < 2^(-24) and x87 division at the same time */
| mulpd xmm0, xmm1
|addpd xmm0, xmm2
| mulpd xmm0, xmm7
| addpd xmm0, xmm3
ex = (ex - shx) / 3; | mulpd xmm0, xmm7
| addpd xmm0, xmm4
| mulpd xmm0, xmm7
| addpd xmm0, xmm5

/*Extract numerator and denumerator from SSE2 register, and do division */
shufpd xmm0, xmm6, /*what is the syntax here ? */
divsd xmm0, xmm6
fr = xmm0;

r = ldexp(fr, ex);

return;
}

But why don't you usecbrt functions from Intel Compiler?
On C language they are:
float cbrtf(float); // about 60 clock cycles
double cbrt(double);

But why don't you usecbrt functions from Intel Compiler?
On C language they are:
float cbrtf(float); // about 60 clock cycles
double cbrt(double);

(Sorry for late answer -- I'm new user here:)

But why don't you use cbrt functions from Intel Compiler (like sqrt)?

In C interface they look like:

float cbrtf (float); // about 60 clock cycles

double cbrt (double); // about 82 clock cycles

long double cbrtl (long double); // about 274 clock cycles

All have accuracy 0.55 ulp (practically all bits in result are correct).

-----------

And if you use "these" cbrt and cbrtf in a loop and turned Vectorizer on (see compiler's command line options), then you can gain better performance:

cbrt: ??? cycles

cbrtf: 40 cycles (if I am right)

(These functions will have accuracy of about 2 incorrect bits.)

-----------

And if you compute cbrt() of each element of some big array,

then you can gain even better performance using Intel libraries IPP or MKL.

MKL cbrt vector functions:

VDCBRT, VML_HA-version: 41 cycles (double-precision, almost all bits are correct)

VDCBRT, VML_LA-version: 32 cycles (double-precision, 2 bits are incorrect)

VSCBRT, VML_HA-version: 22 cycles (single-precision, almost all bits are correct)

VSCBRT, VML_LA-version: 17 cycles (single-precision, 2 bits are incorrect)

MKL fortran example of using VML_HA-version of double-precision vector cbrt:

call VMLSETMODE(VML_HA)

call VSCBRT(n,a,r)

Here a -- array of input arguments (single-precision),

r -- array of results (single-precision),

n -- length of these arrays.

Comment: you can call VMLSETMODE function only once (or can not to call it at all, because VML_HA is default mode).

----------

IPP cbrt vector functions (VMLMODE is not needed):

ippsCbrt_64f_A53(a,r,n): 41 cycles (double-precision, almost all bits are correct)

ippsCbrt_64f_A50(a,r,n): 32 cycles (double-precision, 2 bits are incorrect)

ippsCbrt_32f_A24(a,r,n): 22 cycles (single-precision, almost all bits are correct)

ippsCbrt_32f_A21(a,r,n): 17 cycles (single-precision, 2 bits are incorrect)

ippsCbrt_32f_A11(a,r,n): 12 cycles (single-precision, 11 bits are correct or 12 incorrect)

(This is C-interface. Sorry, I don't know exactly, but maybe in fortran names of these functions should be converted to upper case or lower case. I can reveal it if you need.)

(All data are approximate.)

Thanks a lot for the help. I have received some comments directly from Intel also, regarding the cbrt function in C++. Here is the FORTRAN intertface to that function. This interface works for any calling convention, STDCALL, CDECL andFASTCALL. We have found that this function is at least four times faster than writing X**0.3333333333333333 in FORTRAN.

Question: Would this interface result in an inlined cbrt in FORTRAN? We are using Intel Visual Fortran 8.0!


DOUBLE PRECISION FUNCTION CBRTC(X)

DOUBLE PRECISION X

INTERFACE

DOUBLE PRECISION FUNCTION CBRT(Y)

DOUBLE PRECISION Y

!DEC$ ATTRIBUTES C, ALIAS:'_cbrt' :: CBRT

END FUNCTION CBRT

END INTERFACE

CBRTC = CBRT(%VAL(X))

END

I don't know will this function be inlined or not, but I think that inlining would not give much performance gain.

I can suggest one way to get higher performance using Vectorizer (it was mentioned in message above). Vectorizer is part of compiler. Vectorized CBRT should be about 40 cycles (2 incorrect bits in significand). To use it on Pentium 4 you should:
1) Use CBRT in a loop and use successive elements of arrays as arguments and results. For example in C-interface it looks like:
for ( i=0; i<10; i=i+1 )
R[i] = CBRT( A[i] )
here R and A are arrays containing double-precision elements.
2) add a compiler's command line option /QxW (on Windows) or -xW (on Linux).
During compilation you should see a message: ": LOOP WAS VECTORIZED."

Thanks a lot for this very useful help.

Yes, actually we are going to do two cbrt calculations, in the follwing manner.

y = cbrt(beta) -cbrt(1-beta);

This could be reqritten:

A[0] = beta;

A[1] = 1-beta;

for ( i=0; i<1; i=i+1 )
R[i] = CBRT( A[i] );

y = R[0] - R[1];

This would be 80 clock cycles then?

Best regards from Lars Petter Endresen

In this case, the compiler would not auto-vectorize. If there is an SSE2 paired cbrt() function in the svml, you could call it explicitly.

It really seems that the loop will not be vectorized because itis too small... Cbrt function exists in svml lib, but maybeit is not good ideato call it directly from there (this is Compiler'sinternal library)...

The entire function that contains the expression "cbrt(beta)-cbrt(1-beta)" is very simple:

double __stdcall FUN(double *beta)

{

double e = *beta;

double f = 1.0 - *beta;

double d = (cbrt(e) + 1 - e * .12614422239872725 - cbrt(f)) *

1.6765391932197435 - ((f * f + e * e) * 4. + 1) * .005 * f *

e * (1 - e * 2.);

return(d);

}

Maybe this function with be more efficient without using SSE2 at all, but instead use only standard X87 FPU instruction, since it takes time to convert the argument to cbrt() from SSE2 to X87 format (the input to cbrt is X87 I think)?

Best wishes

Lars Petter Endresen

Hello!

I learned that cbrt function will be vectorized in next update of Intel C and Fortran compilers 8.1.

Instead calling cbrt function in Fortran, you should use expression like a(i)**(1.0d0/3.0d0), because there is no cbrt function in Fortran standard.

Below is an example of using vectorized cbrt in Fortran on Pentium 4.

Command line to compile example:
> ifort cbrtf.f /QxW /FAs
Here:
cbrtf.f is source file (see below).
/QxW option enables vectorization
/FAs option is needed if you want to check if loop with "r(i) = a(i)**(1.0d0/3.0d0)" was vectorized correctly. In correct case you should find string "call _vmldCbrt2" in assembly file cbrtf.asm

The file cbrtf.f:
function FUN(beta)
real*8 beta
real*8 a(2),r(2),d,d1,d2
a(1) = beta
a(2) = 1.0d0 - beta
do i=1,2
r(i) = a(i)**(1.0d0/3.0d0)
enddo
d1 = r(1) + 1.0d0 - beta * 0.12614422239872725d0 - r(2)
d1 = d1 * 1.6765391932197435d0
d2 = ( a(2) * a(2) + beta * beta) * 4.0d0 + 1.0d0
d2 = d2 * 0.005d0 * a(2) * beta * (1.0d0 - beta * 2.0d0)
d = d1 - d2
FUN = d
end

program myprogram
real*8 x
x = FUN(1.0d0)
print *,x
end

OK. Thanks a lot for the superuseful information.

:-)

When do you expect 8.1 to be around?

Unfortunately I don't know.

Leave a Comment

Please sign in to add a comment. Not a member? Join today