3th degree equation optimisation

3th degree equation optimisation

Dear All,

How can we rewriting the follow code to improve speed ?

I tried all the compiler option and found what was the optimum but still, I would like to have it quicker.

TCARRE=T*T      
TCUBE=T*TCARRE
ENTH=C(1)*TCUBE+C(2)*TCARRE+C(3)*T+C(4)

 

Thanks for your help,

Vincent

16 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Read, among others, about the Horner/ Qin Jiushao method:

     ENTH = ((C(1)*T + C(2))*T + C(3))*T + C(4)

P.S.: Corrected in response to TimP below: he pointed out that I had the indices of C in ascending order, whereas the O.P. has them in descending order.

Or modified Horner,

ENTH=(C(1)*T+C(2))*TCARRE+(C(3)*T+C(4))

Try with /assume:protect_parens

Vincent,

Are you saying that what you wrote is better than as mecej4 implied ?

ENTH = ((C(1)*T + C(2))*T + C(3))*T + C(4)

This version has fewer multiplies, but I am wondering if your version has better register retention ?

John

>>...This version has fewer multiplies, but I am wondering if your version has better register retention?..

John,

Even if there are less multiplications in that equation you don't take into account that at a lowest level there are Latencies, Throughputs ( some calculations could be executed in parallel ) for instructions and Pipelines in a CPU. So, for a really small equation with a very limited number of degrees I wouldn't worry. Is that really critical for a 3rd degree polynomial?

If you think that you're right go practical and prove that there is even a small difference.

I hope that you've heard that expression from Mr. Spock in Star Trek movie: Any difference that makes no difference, is no difference...

Sergey,

I have always used the approach of mecej4, on the basis of minimal operations.
I was only asking Vincent if his question was based on experience that showed his approach worked better.
Tim's organisation of the equation is interesting, if the compiler can utilise this.

It is difficult to see the practical effect, as if it is being performed many times, then other vectorisation arrangements outside this code snippet could be more relevant.

John

 At assembly level the code (taylor sine approx. by Horner scheme) could look like this:

                    movups xmm1,veclib1//load of coefficients
                    mulps xmm0,xmm1
                    addps xmm7,xmm0 //Xmm7 accumulates the result
                    movups xmm1,veclib2
                    mulps xmm0,xmm1
                    addps xmm7,xmm0
                    movups xmm1,veclib3
                    mulps  xmm0,xmm1
                    addps xmm7,xmm0
                    movups xmm1,veclib4
                    mulps  xmm0,xmm1
                    addps xmm7,xmm0
                    movups xmm1,veclib5

For few terms I do not think if one less or one more multiplication does really matter.

Thank you all, I replaced my code by the code suggested by mecej4 and got 10% speed increase.

 

Another (but similar) question about solving a 2sd degree equation, is there a way to rewrite this code so that it works faster ?

 

<fortran>

     ABC_Error = 0

       IF (A.EQ.0.)  THEN

          IF (B.EQ.0.)  THEN

             IF (C.EQ.0.)  THEN

                ABC_Error = -2

             ELSE

                ABC_Error = -3

             END IF

          ELSE

             X1 = -C/B

             X2 = X1

             ABC_Error = 1

          END IF

       ELSE

          D = B*B-4.*A*C

          IF (D.LT.0.)  THEN

               ABC_Error = -1

          ELSE IF (D.EQ.0.)

  THEN

             X1 = -B/(2.*A)

             X2 = X1

          ELSE

             XA = (-B-SIGN(SQRT(D),B))/(2.*A)

             XB = C/(A*XA)

             X1 = MAX(XA,XB)

             X2 = MIN(XA,XB)

          END IF

       END IF

 </fortran>

>>...It is difficult to see the practical effect, as if it is being performed many times...

I disagree with that because I tested many polynomials of higher degree during last two years.

One cycle of latency per term does not make really any difference.Such a code can be also executed in parallel by two different Ports if there is no interdependencies involved.

Latencies obviously depend on factors which haven't been defined here.  It may be easiest to guess latencies if you compile with /QxHost for Haswell.  In that case, the sequence mecej4 suggested (making allowance for apparent reversal of the array) would execute 3 fma operations in sequence.  The best which could be done, using 2 more registers, would be to reduce this to 2 fma operations in sequence, along with another 2 fp operations which could be done in parallel.

The form posted first would probably take the latency of at least 3 sequential additions (12 cycles?) more than the one posed by mecej4, while also maximizing number of registers tied up.

The code sequence above was not optimized for newest architecture and was compiled on Centrino processor.

The real concern is the load of coefficients in the form of SoA from memory.Btw on Merom processor addps will execute in 3 cycles(by Agner).

>>>s there a way to rewrite this code so that it works faster>>>

At least you can try to get rid of division operation and replace it with multiplication by inverted value(of course where it is applicable).

I think a major impediment to us providing useful answers is you have expressed your requirements in too narrow of scope.

TCARRE=T*T       
TCUBE=T*TCARRE 
ENTH=C(1)*TCUBE+C(2)*TCARRE+C(3)*T+C(4)

or

ENTH = ((C(4)*T + C(3))*T + C(2))*T + C(1)

As recommended.

I suspect that your application has (the equivilent of) arrays of these 4-element arrays. The organization of your data will greatly affect the advice you receive. Example:

real(8) :: yourModel(:,:) ! allocated to yourModel(4,nThingies)

And you encapsulate your expression into:

real(8) function foo(C,T)
real(8) :: C(4), T
foo = ((C(4)*T + C(3))*T + C(2))*T + C(1)
end function foo

And you call the funciton
DO I=1, nThingies
ENTH(I) = foo(yourModel(:,I), T)
END DO

Then note that T is invariatant for the duration of the loop

In this case, then consider:

real(8) function foo(C,T)
real(8) :: C(4), T(4)
real(8) :: temp(4)
temp = C * T
foo = temp(1) + temp(2) + temp(3) + temp(4)
end function foo

real(8) :: Tvec(4)
...
Tvec(1) = 1.0_8
Tvec(2) = Tvec(1)*T
Tvec(3) = Tvec(2)*T
Tvec(4) = Tvec(3)*T
DO I=1,nThingies
ENTH(I) = foo(yourModel(:,I), Tvec)
END DO

Jim Dempsey

www.quickthreadprogramming.com

Jim,

Very interesting and valid points.

On your second option for consideration by Vincent, can you please comment on the relative merits and demerits of using DOT_PRODUCT intrinsic in the FOO function, say as follows?

PURE FUNCTION foo(C,T) RESULT(ENTH)
 
   REAL(8), INTENT(IN) :: C(:)
   REAL(8), INTENT(IN) :: T(:)
 
   !.. Function result
   REAL(KIND(C)) :: ENTH
 
   ENTH = DOT_PRODUCT(C, T)
 
END FUNCTION foo
 

FortranFan has it right, my code was presented to illustrate how you should re-think the problem.

Jim Dempsey

www.quickthreadprogramming.com

Leave a Comment

Please sign in to add a comment. Not a member? Join today