Keeping data in cache

Keeping data in cache

HI,

I have a program that takes a week or so to run on an i7 or E5-2690 (single threaded) and I'm looking forward to moving it to a MIC/PHI board (multi-threaded).  About 15% of the time is spent on evaluating a single polynomial that uses four 122-element parameter statements as sources for the coefficients in the equation.  I would like to try to lock those real coefficients into cache.  Is there a way to do that from Fortran?  Is there a way to see if they are already being kept in cache (there is a fair amount of work between calls to that function so I doubt it)?

Thanks for any ideas here.

35 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

PARAMETER constants are an unlikely subject for cache miss activity, but you can check with profiling tools such as Intel VTune. Code cache misses would be equally likely.
Intel(c) Xeon Phi(tm) increases the importance of vectorization as well as threading for full performance, so the list of questions which are usually ignored about performance of polynomial evaluation is lengthened.

I used VTune to identify this equation as the hottest hotspot in a couple thousand lines of code. I suppose I could, if I knew how, lock the equation in cache as well. They would certainly fit in the Phi or E5-2690 L2 caches and it would be reasonable to try it in the L1 caches. I understand the importance of the vectorization, which is a difficult in this code, and we are addressing that as well.
The question remains: in there a way I can lock the data (and code) in cache from a Fortran program?

H=a*H4(i)
H=a*(H3(i)+H)
H=a*(H2(i)+H)
H=a*(H1(i)+H)
H=H0(i)+H

or any other way to speed this up?
thanks

The statements shown access four different arrays. Thus, the memory accesses have rather large stride. Have you considered declaring Hx(5,122), and writing, for example,

H=(((Hx(5,i)*a+Hx(4,i))*a+Hx(3,i))*a+Hx(2,i))*a+Hx(1,i)?

Hi mecej4:
Good idea, we'll test it. I had discarded that idea on the basis that double indices were slow, but had not considered the stride issue. Still it would be much faster is it didn't have to go out to memory for the array. Anyone? Locking in cache?

mecej4: I should have mentioned that I appreciate the Boltzmann gravestone as the Saha-Boltzmann equation is a central part of what I'm doing.
TimP: could you elaborate on "PARAMETER constants are an unlikely subject for cache miss activity"?

As your code is limited by the latencies of multiplication and addition, and you don't appear to be making an effort to vectorize, there would be plenty of time to resolve any cache misses on the coefficient array. Even if there are no cache misses, you are spending 52 cycles (on corei7-2) on multiply and add instructions per result.
You could achieve more instruction level parallelism by
H = H0(i) + a*H1(i) + a*a*(H2(i) + a*H3(i) + a*a*H4(i))
but you would still need to get both vector and threaded parallelism to approach available CPU performance, where now you are well below 1% of peak floating point performance.
If you were limited by cache misses, you would expect VTune to show most of the time spent on the first multiply and none on the adds, accompanied by a high level of last level cache misses.
If you vectorize, you have effectively locked each coefficient in register at least across the parallel operands. So, if vectorization has no other appeal to you, you can at least consider it as improving data locality as you suggested.

If "i" is varying more than "a", you could create an a power vector of the form "forall (i=0:5) aa(i) = a**i"
or : aa(0) = 1 ; do i = 1,4 ; aa(i) = a*aa(i-1) ; end do
then, using Mecej4, approach
H=(((Hx(5,i)*a+Hx(4,i))*a+Hx(3,i))*a+Hx(2,i))*a+Hx(1,i)?
becomes
H = dot_product (hx(:,i), aa)
This might be more suited to vectorisation
ie, simplify what is not in the inner loop

TimP:
Thanks for the elaboration. I had originally written it simply: H= H0(i)+ a* H1(i)+ a*a* H2(i)+ a*a*a* H3(i)+ a*a*a*a* H4(i), but found, with a non-Intel compiler on an earlier chip, that the multi-line version seemed to be more efficient. From the old days, that form would help the compiler optimize register use. One of the major reasons we recently switched to the Intel compiler was Vtune. Actually, the first line: H=a*H4(i), does take longer than any of the rest but I was/am puzzled by the fact that it was not hugely more time consuming. I assumed this meant that all the coefficients were being pulled from memory. It seems that one lesson here is to reconsider some of the optimizations when changing compilers and CPUs.

Threaded parallelism is probably not much of an option as we are multi-threading at the program level (the outputs of each program image can be added). Also, only one of these is calculated at a time (calculate H, do a lot of calculations to determine a new a, calculate a new H) so I think that the only target for vectorization of this polynomial is one instance of this equation.

I have not figured out how to show L2 or L1 data cache misses yet. I'll start breaking the code apart to tease out some better diagnostics.

Hi John,
This function is not in an inner loop and i and a are unpredictable. The routine is called by a parent routine that does much calculation to come up with an a and a (precursor to) i. I'll be trying all the suggested forms as small improvements in this evaluation will have a noticeable effect on my run times.
thanks

Bruce,
The combination of the DO in a, then Dot_Product might improve the cacheing. It can be difficult to predict
Also, based on what I have understood from your posts, it appears to me that the searching for the updated values of a and i might be the more time consuming part. A smarter way of finding a better estimate of "a" can be an elusive goal. Would using a history of "a" help, say using the last 2 estimates of "a", as a 2-point interpolation or 3-point extrapolation, although this would require storing 2 "a" fields?

Hi John,
H is actually the calculation of the Hjerting function, used to calculate the Voigt function. The values that go into the calculation are usually independent of their history. About 15% of the time they are not uncorrelated, which turns out to be about the cost in saving the resulting H and testing from the previous call to the function H. Not saving the previous value, which is a push, lets me make the H function elemental. It has to be calculated each time.

"I had originally written it simply: H= H0(i)+ a* H1(i)+ a*a* H2(i)+ a*a*a* H3(i)+ a*a*a*a* H4(i),"
In principle, various re-arrangements between this and the Horner form (which minimizes the number of multiplications) are possible within the limits of the Fortran (but not C) rules. C rules would permit this to be optimized down to 7 multiplications. I haven't seen a Fortran compiler which could optimize polynomial evaluation beyond what C rules permit.
Where the series has terms of magnitude decreasing with order, the Horner evaluation should be the most accurate.

If the evaluation of H is at all time consuming, your VTune profile could be showing a delay in availability of H for the first usage. That would further swamp the possibility of re-fetching the coefficients as a bottleneck. It would also indicate that vectorization would not help unless this H evaluation can be vectorized.

Bruce,

I haven't tested the code (I do not have all the particulars) an outline of my suggestion is observed the the following code


subroutine compute(n,a,H,H0,H1,H2,H3,H4)

    integer :: n

    real(8) :: a

    real(8) :: H(n),H0(n),H1(n),H2(n),H3(n),H4(n)

    real(8) :: Awide(4),Hwide(4)

    integer :: i
    Awide = a

    Awide(2:4) = Awide(2:4) * a

    Awide(3:4) = Awide(3:4) * a

    Awide(4) = Awide(4) * a

    do i=1,n

        Hwide(1) = H1(i)

        Hwide(2) = H2(i)

        Hwide(3) = H3(i)

        Hwide(4) = H4(i)

        Hwide = Hwide * Awide

        H(i) = H0(i) + (Hwide(1) + Hwide(2) + Hwide(3) + Hwide(4))

    end do

end subroutine compute

Where your actual implimentation might use module arrays for H,H0,...

Jim Dempsey

www.quickthreadprogramming.com

I appreciate all the responses to my cache question. This function costs me about 25 cpu hours/week/thread! Again, there is no significant hysteresis in u and a. Since folks have asked questions about other features of the function and it is very short (now), I reproduce here the full working code:

real elemental function H(u,a) !elemental
intent (in) u,a
real::a,u
real, parameter :: H0(121)= (/1.,.990050,.960789,.913931,.852144, &
.778801,.697676,.612626,.527292,.444858, &
.367879,.298197,.236928,.184520,.140858, &
.105399,.077305,.055576,.039164,.027052, &
.0183156,.0121552,.0079071,.0050418,.0031511, &
.0019305,.0011592,.0006823,.0003937,.0002226, &
.0001234,.0000671,.0000357,.0000186,.0000095, &
.0000048,.0000024,.0000011,.0000005,.0000002,.0000001, &
.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0, &
.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0, &
.0,.0,.0,.0,.0,.0,.0,.0,.0,.0, &
.0,.0,.0,.0,.0,.0,.0,.0,.0,.0, &
.0,.0,.0,.0,.0,.0,.0,.0,.0,.0, &
.0,.0,.0,.0,.0,.0,.0,.0,.0,.0, &
.0,.0,.0,.0,.0,.0,.0,.0,.0,.0/)
real, parameter :: H1(121)= (/-1.12838,-1.10596,-1.04048,-.93703,-.80346, &
-0.64945,-0.48552,-0.32192,-.16772,-.03012, &
.08594,.17789,.24537,.28981,.31394, &
.32130,.31573,.30094,.28027,.25648, &
.231726,.207528,.184882,.164341,.146128, &
.130236,.116515,.104739,.094653,.086005, &
.078565,.072129,.066526,.061615,.057281, &
.053430,.049988,.046894,.044098,.041561,.039250, &
.037222,.035195,.033478,.031762,.030293,.028824, &
.027556,.026288,.025184,.024081,.023113,.022146, &
.021293,.020441,.019685,.018929,.018255,.017582, &
.016978,.016375,.015833,.015291,.014801,.014312, &
.013869,.013426,.013023,.012620,.012253,.011886, &
.01155025,.0112145,.01090675,.0105990,.0103161,.0100332, &
.00977255,.0095119,.00927125,.0090306, &
.0088079,.0085852,.0083787,.0081722,.00798035,.0077885, &
.00760995,.0074314,.00726495,.0070985, &
.0069430,.0067875,.0066421,.0064967,.0063605,.0062243, &
.00609655,.0059688,.00584875,.0057287, &
.00561585,.0055030,.00539665,.0052903,.00519005,.0050898, &
.0049952,.0049006,.00481115,.0047217, &
.00463715,.0045526,.0044725,.0043924,.00431645,.0042405, &
.00416845,.0040964,.00402795,.0039595/)
real, parameter :: H2(121)= (/1.,.9702,.8839,.7494,.5795, &
.3894,.1953,.0123,-.1476,-.2758, &
-.3679,-.4234,-.4454,-.4392,-.4113, &
-.3689,-.3185,-.2657,-.2146,-.1683, &
-.12821,-.09505,-.06863,-.04830,-.03315, &
-.02220,-.01451,-.00927,-.00578,-.00352, &
-.00210,-.00122,-.00070,-.00039,-.00021, &
-.00011,-.00006,-.00003,-.00001,-.00001,.00000, &
.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0, &
.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0, &
.0,.0,.0,.0,.0,.0,.0,.0,.0,.0, &
.0,.0,.0,.0,.0,.0,.0,.0,.0,.0, &
.0,.0,.0,.0,.0,.0,.0,.0,.0,.0, &
.0,.0,.0,.0,.0,.0,.0,.0,.0,.0, &
.0,.0,.0,.0,.0,.0,.0,.0,.0,.0/)
real, parameter :: H3(121) = (/-.752,-.722,-.637,-.505,-.342, &
-.165,.007,.159,.280,.362, &
.405,.411,.386,.339,.280, &
.215,.153,.097,.051,.015, &
-.0101,-.0265,-.0355,-.0391,-.0389, &
-.0363,-.0325,-.0282,-.0239,-.0201, &
-.0167,-.0138,-.0115,-.0096,-.0080, &
-.0068,-.0058,-.0050,-.0043,-.0037,-.0033, &
-.00293,-.00257,-.00231,-.00205,-.00186,-.00166, &
-.00151,-.00137,-.00125,-.00113,-.00104,-.00095, &
-.00087,-.00080,-.00074,-.00068,-.00063,-.00059, &
-.00055,-.00051,-.00047,-.00044,-.00041,-.00038, &
-.00036,-.00034,-.00032,-.00030,-.00028,-.00026, &
-.000245,-.00023,-.00022,-.00021,-.00020, &
-.00019,-.00018,-.00017,-.00016,-.00015,-.00014, &
-.00013,-.000125,-.00012,-.000115,-.00011,-.000105, &
-.00010,-.000095,-.00009,-.000085,-.00008,-.00008, &
-.00008,-.000075,-.00007,-.00007,-.00007,-.000065, &
-.00006,-.00006,-.00006,-.000055,-.00005,-.00005, &
-.00005,-.000045,-.00004,-.00004,-.00004,-.00004, &
-.00004,-.000035,-.00003,-.00003,-.00003,-.00003, &
-.00003,-.00003,-.00003/)
real, parameter :: H4(121)= (/.50,.48,.40,.30,.17, &
.03,-.09,-.20,-.27,-.30,-.31,-.28,-.24,-.18,-.12, &
-.07,-.02,.02,.04,.05, &
.058,.056,.051,.043,.035,.027,.020,.015,.010,.007, &
.005,.003,.002,.001,.001,.000,.000,.000,.000,.000,.000, &
.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0, &
.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0, &
.0,.0,.0,.0,.0,.0,.0,.0,.0,.0, &
.0,.0,.0,.0,.0,.0,.0,.0,.0,.0, &
.0,.0,.0,.0,.0,.0,.0,.0,.0,.0, &
.0,.0,.0,.0,.0,.0,.0,.0,.0,.0, &
.0,.0,.0,.0,.0,.0,.0,.0,.0,.0/)

i= (10* abs(u)) +1

if(i.gt.121) then
!cc outside of table use far wing formula p229 Gray.
u2=u*u
u4=u2*u2
Hsub1=0.56419/u2 + 0.846/u4
Hsub3=-0.56/u4
H=a*(Hsub1+a*a*Hsub3)
return
endif

!cc H= H0(i)+ a* H1(i)+ a*a* H2(i)+ a*a*a* H3(i)+ a*a*a*a* H4(i)
H=a*H4(i)
H=a*(H3(i)+H)
H=a*(H2(i)+H)
H=a*(H1(i)+H)
H=H0(i)+H

return
end

There have been some good ideas about re-structuring the expression, which I will test in the coming week. HOWEVER, my first question (TimP?) remains: how can I lock the parameters into cache? If it is reasonably possible, it seems worth trying. I have not yet found in Vtune how to see if there are L1 or L2 cache misses. There seem to be a modest number of LLC misses, but it should be useful to move up the cache hierarchy.

thanks to all.

An alternative might be:
i = (10* abs(u)) + 1
!
if (i > 121) then
!cc outside of table use far wing formula p229 Gray.
u2 = u*u
u4 = u2*u2
Hsub1 = 0.56419/u2 + 0.846/u4
Hsub3 = -0.56/u4
!
H = a*(Hsub1+a*a*Hsub3)
!
else if (i > 41) then
!
! H = a*a* H3(i)
! H = a * (H1(i) + H)
H = a* (H1(i)+ a*a* (H3(i))
else
!
!cc H= H0(i)+ a* H1(i)+ a*a* H2(i)+ a*a*a* H3(i)+ a*a*a*a* H4(i)
! H = a* H4(i)
! H = a*(H3(i) + H)
! H = a*(H2(i) + H)
! H = a*(H1(i) + H)
! H = H0(i) + H
!
H = H0(i)+ a* (H1(i)+ a* (H2(i)+ a* (H3(i)+ a* H4(i) )))

endif

return
end
.
I would expect this to perform better if a significant proportion of I > 41.
I tried to test this, but found your use of the following might do better:
H=a*H4(i)
H=a*(H3(i)+H)
H=a*(H2(i)+H)
H=a*(H1(i)+H)
H=H0(i)+H
.
It might be that H is retained in a register and improves performance ?
Test both and see.
These results which defy old style optimisation logic make it difficult to understand how to do better !!
.
John

Bruce,
.
I wondered if the following (change of order) might work better:
H = ((( H4(i)*a + H3(i))*a + H2(i))*a + H1(i))*a + H0(i)
.
Certainly the performance difference between this and your 5-line expression is a puzzle.
.
I am attaching my test example, which shows, using Ifort Ver 12.1, that all attempts I have made to improve performance fail !!
Even the attempt to minimise multiplications for i > 41 fails !!
.
I note that the accuracy of the Hi tables appears to be poor. Could these be calculated at the first call, or is accuracy not that important ?
.
Attached is my test example which shows that all alternatives I have tried do not work !
H_dev was by best guess at an improved coding and it does not get there.
The only explaination I can give is that your 5-line expression is more suited to vectorising. Otherwise, how can fewer multiplications produce a worse result ?
Including the test "if (i > 41) then" to reduce multiplication count makes things worse ? Is the code size causing the problem ?
Does anyone else have an explaination for this result ?
.
John

Fichiers joints: 

Fichier attachéTaille
Télécharger halt.f9010 Ko

If I have not made a silly mistake, this is a good example of how frustrating it can be to optimise code, when you don't know what existing optimisation the compiler and the processor are providing.
The simple approach assumes that a reduced operations count or fewer instructions should improve performance.
I can make excuses that I don't understand the way the processor manages the cache, as being the reason for this simple approach not working, but at the end of the day, I don't understand this result !
( I have tried to write the testing code to trick the compiler out of skipping redundant calls, which can be a problem, so hopefully this is not the reason.)
(Testing for i>41 looks to be a good option, although I don't know the probability that i > 41, to make the test worthwile.)
.
Does anyone have a good explaination why testing for i > 41, to reduce the multiply count, does not work in my test example ?
.
John

This article
http://chadaustin.me/2009/02/latency-vs-throughput/

contains a brief mention of dependency chains, for those who haven't become used to the concepts of instruction pipelining, throughput, and latency, over the last 3 decades.
A fuller description of the topics might be found in Agner Fog's articles.

Tim,
.
Thank you for providing advice on the concepts of instruction pipelining, throughput, and latency, over the last 3 decades.
My question relates to the impact of changes I made to the H function.
Basically, I changed the H function code from:
.
H=a*H4(i)
H=a*(H3(i)+H)
H=a*(H2(i)+H)
H=a*(H1(i)+H)
H=H0(i)+H
return
.
to:
.
if (i > 41) then
H_alt = a*a* H3(i)
H_alt = a * (H1(i) + H_alt)
else
H_alt = a* H4(i)
H_alt = a*(H3(i)+H_alt)
H_alt = a*(H2(i)+H_alt)
H_alt = a*(H1(i)+H_alt)
H_alt = H0(i)+H_alt
end if
return
.
where i > 41 occurs 66% of the time in my test program.
.
A traditional analysis would suggest that identifying i>41 should provide a significant improvement, especially in my test example, where the value of i is randomly selected between 1:120.
The run times I get with ifort Ver 12.1 are:
H 1.60600000000000 Original multi-line formula
H_new 1.68500000000000 use of dot product for vectorisation (not much worse!)
H_alt 2.32500000000000 limit multiply if i>41
H_dev 2.37100000000000 limit multiply if i>41 and single line formula
H_xx 2.37100000000000 limit multiply if i>41 and use multi-line formula
.
This basicaly shows that function H_alt runs significantly slower than function H, while there are significantly fewer multiplications in H_alt. The introduction of the "if(i>41)" test slows the function down.
The article you referred to identifies ways of removing dependency chain latency, although function H does not appear to have solved that problem, as the main calculation sequentially updates the value of H.
.
I am still wondering why the function H_alt did not succeed in improving performance?
.
Perhaps I should try:
H=a* H3(i)
H=a* H
H=a*(H1(i)+H)
.
John

Postscript:
I can not see how either:
- the introduction of the "if (i > 41)" test can change the dependency chain latency characteristic of the main (i<=41) code, or
- how the code for i>41 runs slower.
.
There is however, a definate change in the efficiency of the modified code, as the slower performance of H_alt occurs for multiple compilers, so somenthing is improving the performance of the original H calculation.
.
There must be something to the way the processor groups the instructions, which the if test is upsetting.
Could Bruce's idea be correct, where jumping to i>41 drops H0, H2 and H4 from the cache ?
Without the code for i>41, the code for H is more compact. Could this be an explaination ?
If anyone can explain this change I'd appreciate their input.

Hi,
The way I broke up the lines originally reflected what I had learned about FORTRAN compilers in the 60s and 70s. It is nice to see that that approach seems to still work well -- thanks John. Agner Fog's manuals are on C++ or assembly; the compilers of the first are, I gather, substantially different than Fortran compilers and writing assembly would seem is what one does when the compiler can't do its job well. I think that my original coding reflects chadaustin to some degree (H and a should be kept in registers and the parameters are only pulled once and there are no intermediate calculations), I'll have to look at it more carefully to test the optimization.

BUT, Tim, returning again to my original question about locking the parameter data in cache: I take it that either I can't do it or you don't know how to do it. But thanks for the other insights.

Bruce,
.
I am not aware of how you could lock your parameter arrays in the cache to improve performance. I would not think it possible or a good idea, as optimisation of the cache is best kept away from us programmers.
.
I have tested two ideas for improving your routine:
1) was to generate an a power vector and store the Hi values in a vector to utilise the vector instructions. I called this H_new in the attached file. I tested this and it runs slightly slower than your H routine; so no help.
.
2) was to identify that for i > 41, only H1 and H3 are non zero, so I wrote a change of the form:
.
else if (i > 41) then
!
H = a* (H1(i)+ a*a* (H3(i))
else
!
H = H0(i)+ a* (H1(i)+ a* (H2(i)+ a* (H3(i)+ a* H4(i) )))
end if
.
Unfortunately this change results in significantly slower performance, which I did not expect.
.
I did change your H function by:
-introducing a MODULE to store your parameter constants H0...H4, and
-using a 1-line statement for the general case.
.
Both "tidy" up the code (personal opinion) but do nothing to improve performance.
I also noted that your parameter array values are not very accurate, especially H4 and H3, which might imply a << 1
.
Compiling with /O2 will introduce vector instructions, which is the best option of improving a single thread.
.
If you want to improve performance, then you might want to look at other parts of the code.
While I don't understand your solution method, if the values of "a" and "u" are being interatively estimated, then a trending algorithm could help. By identifying if they are monotonically increasing or oscilating, this might help for predicting a better estimate, from past values.
.
If this assumption of convergence is relevant, then steping between values of i might also cause problems. Could using the value of U interpolating between H0(i) and H0(i+1) provide a smoother value to converge more quickly ?
r= (10*abs(u)) + 1
i = r
j = i+1
fj = (r-i) ! factor to apply to H0(i+1)
fi = 1 - fj ! factor to apply to H0(i)
!
H = a * (fi*H4(i)+fj*H4(i+1)
H = a * (fi*H3(i)+fj*H3(i+1) + H)
H = a * (fi*H2(i)+fj*H2(i+1) + H)
H = a * (fi*H1(i)+fj*H1(i+1) + H)
H = fi*H0(i)+fj*H0(i+1) + H
!
! or
H = ( H0(i)+ a* (H1(i)+ a* (H2(i)+ a* (H3(i)+ a* H4(i) ))) * fi + &
( H0(j)+ a* (H1(j)+ a* (H2(j)+ a* (H3(j)+ a* H4(j) ))) * fj
.
possibly another idea that could work
.
Sorry I could not be of more help. (still puzzled as to why i>41 fails!).
.
John

Bruce,

Your function (H) is declared as elemental. Implying your intended use is:
SomeWhere(i) = H(u,a)
Or more likely:
SomeWhere(i) = H(u(i),a)

A performance problem as I see it is the conditional branching in your elemental function H making it non-friendly to vectorization.
An optimization technique to explore is to create (perhaps a once-only thing) a pick list of regions of u(i) that qualify for each branch in your H functions. Then:

! run thru the .LE. 121 sections of U
DO I=1,maxPick
if(PickLE121low(i) .eq. 0) exit
Somewhere(PickLE121low(i):PickLE121high(i)) = HLE121(u(PickLE121low(i):PickLE121high(i)), a)
END DO

! run through the .GT. sections of U
DO I=1,maxPick
if(PickGT121low(i) .eq. 0) exit
Somewhere(PickGT121low(i):PickGT121high(i)) = HGT121(u(PickGT121low(i):PickGT121high(i)), a)
END DO

The idea being you have clusters of adjacent (vectorizable) like entries in u.

Jim Dempsey

www.quickthreadprogramming.com

Bruce,
Your function H uses a lookup table for U and a quartic polynomial for a.
An alternative might be to also use a lookup table for a, then use linear interpolation from the table.
You could also use a by-linear interpolation for a and u, although I do not know if u=>i is a continuous or step function.
Depending on the range of values for a, a lookup table could be a good alternative.
Transforming a to another value then using this can also help, such as b = log(a) or (exp, tan or sqrt) might transform to a table of b that better suits uniform spacing of the b table.
This way you can reduce the run time, and size the spacing of values in the table to achieve acceptable accuracy.
I would expect that a linear interpolation from this table would be faster than your quartic polynomial, while the generation of the table values could be an initialisation state or a set of data statements.
John

Hi,
The lookup table(s) for a is about 45 Mbytes and about to become much larger; u is calculated from quantum thermodynamics conditions under some otherwise unpredictable conditions and is continuous.. The tables for the a values are pre-calculated by another program for a specific set of conditions. I don't think that there are any interpolation schemes that will be profitable.
Bruce

Bruce,
I looked at the polynomials for H0-H4 and the values of H for various values of "a" and "u"(i). Depending on the range of a, I would expect that you could achieve similar accuracy with a look up table of values, much less than the 45mb you are using to calculate a.
You could populate the table of a/u interpolation, using the existing function for H, although it would be worth checking the accuracy of your H0-H4, especially if a can be large.
Your dividing u into steps of 0.1 also appears to be a significant comprimise for the H estimate, in comparison to the accuracy provided to "a" in the formula for H. It looks to me that there is much better accuracy for H in relation to "a", than you achieve for "u". (perhaps du could be less than 0.1)
Depending on the range of "a" values, you might be able to reduce the table size if you use a table for b = atan (a*factor), so that b is in the range -pi/2 to pi/2.
If you can reduce the look-up tables spacing (du and da) to use linear interpolation, and provide sufficient accuracy, then I think there would be some run time gains.
Hopefully some uninformed ideas that might provide a new look at your solution.
John
.
I roughed up a lookup table approach for u and a; for u in the range 0 to 12 and a in the range 0 to 1.2 (using du = 0.1 and da = 0.01)
I also changed the test program to report time as processor ticks per function call.
From testing, the Function H takes about 16 processor ticks, (which is fairly quick)
With H_table being a lookup for u and a, it takes about 6 ticks
If H_table does a linear interpolation for a (see below) it takes about 8.5 ticks, which is half the reported ticks for H
(I am attaching the latest version of the test program, which tries a few different ideas.)
.
This shows there is potential for this approach, depending on the range of u and a.
You could use H_table lookup if a and u are within range, else use H.
.
real function H_yy (u,a) !elemental
use H_tables
real, intent (in) :: u,a
!
! retain multi-line formula, but test for i > 41
!
integer iu,ia
real*4 :: du = 10 ! steps per U value ( U range 0:12.0 )
real*4 :: da = 100 ! steps per A value ( A range 0:1.2 )
logical :: first = .true.
real*4 H, ut, at, fa
external H
!
if (first) then
do ia = 1,120
at = (ia-1)/da
do iu = 1,121
ut = (iu-1)/du
ua_table(iu,ia) = H(ut,at)
end do
end do
first = .false.
end if
!
iu = (abs(u)*du)+1
!
fa = (abs(a)*da)+1
ia = fa
fa = fa-ia
!
h_yy = ua_table(iu,ia)*(1.0-fa) + ua_table(iu,ia+1)*fa
!
return
end function H_yy

Fichiers joints: 

Fichier attachéTaille
Télécharger halt.f9016.32 Ko

Thanks for the ideas. I guess I didn't explain the source of 'a' very completely...they are values from QM calculations for spectral lines. Each of thousands of lines has a series of these depending on the conditions at different altitudes in the atmosphere of a star. They are non-linear and typically range for 1e-3 to 1e-1. This doesn't keep me from bringing a & u into a 2-D table lookup as opposed to four 1-D lookups. I'll look into your suggestion. thanks.

Bruce,
.
I produced a surface chart for H as a fn of a and u. For what I anticipate is the range of a and u, H shows greater variation wrt u, than a, which makes your 121 step look-up table for u appear to be a relatively high source of inaccuracy. On this basis, having a look-up for a and an interpolation for u could provide a more accurate estimate of H on average. (The example I posted was a look-up for u and an interpolation for a).
Based on the performance target, doing a linear interpolation for both a and u from a 2D table would probably be similar/slower than your existing H function, so the better solution might be a look-up for a and a linear interpolation for u.
You might also want to review the accuracy of the H function, for populating the 2D table values and to use it when a or u are outside the range of the 2D table.
.
you could also speed up the interpolation formula by also having a first difference table(s):
h_yy = ua_table(iu,ia)*(1.0-fa) + ua_table(iu,ia+1)*fa ! original interpolation in a
rearranging, becomes
= ua_table(iu,ia) + fa*(ua_table(iu,ia+1)-ua_table(iu,ia)) ! only 1 multiplication
by providing a first difference table, removes a difference calculation
= ua_table(iu,ia) + fa*da_table(iu,ia) ! where difference table "da_table(iu,ia)" can be pre-calculated
.
To improve accuracy, and reduce the u spacing, two difference tables might help to improve performance.
( fa = (abs(a)*da)+1 ; ia = fa ; fa = fa-ia )
( fu = (abs(u)*du)+1 ; iu = fu ; fu = fu-iu )
h = ua_table(iu,ia) + fa*da_table(iu,ia) + fu*du_table(iu,ia) ! is a possibility for 2 way interpolation with 3 pre-calculated tables
.
This would probably save on storage and improve performance !
The table spacing du and da could be refined, depending on the accuracy required, although a finer spacing only affects the size of the table and not the run time.
.
John
.
I did some tests with the following code sample:
.
real function H_yy (u,a) !elemental
use H_tables
real, intent (in) :: u,a
!
! bi-linear interpolation from a table
! assumes a and u are in range
!
integer iu,ia
real*4 :: du = 5 ! steps per U value ( U range 0:12.0 )
real*4 :: da = 500 ! steps per A value ( A range 0:0.12 )
logical :: first = .true.
real*4 H, ut, at, fa, fu
external H
!
if (first) then
do ia = 1,60
at = (ia-1)/da
do iu = 1,61
ut = (iu-1)/du
ua_table(iu,ia) = H(ut,at)
end do
end do
do ia = 1,60-1
do iu = 1,61-1
da_table(iu,ia) = ua_table(iu,ia+1) - ua_table(iu,ia)
du_table(iu,ia) = ua_table(iu+1,ia) - ua_table(iu,ia)
end do
end do
first = .false.
end if
!
fu = (abs(u)*du)+1
iu = fu
fu = fu - iu
!
fa = (abs(a)*da)+1
ia = fa
fa = fa - ia
!
h_yy = ua_table(iu,ia) &
+ da_table(iu,ia)*fa &
+ du_table(iu,ia)*fu
!
return
end function H_yy
.
Based on modifying yesterday's test program :
The H function uses about 16.5 clock cycles
H_yy using bi-linear interpolation (above) uses 12.5 clock cycles
H_yy using linear in a or u and lookup in the other (omiting 1 difference table) uses 10 clock cycles
H_yy using lookup in a and u uses 6.2 clock cycles (yesterday's example)
.
If we knew why the "if(i>41)" did not work, then possibly we could improve it some more, as H_yy needs a test for ia or iu out of range !
.
I've found this an interesting problem and hopefully this might give you some options.

Fichiers joints: 

Fichier attachéTaille
Télécharger h-func.xlsx133.02 Ko

Hi,

I hope to examine the range of a next week. This is an interesting idea and I'll be able to explore it then.

thanks

Bruce,

I hope you had some success with your problem.

I have been addressing a similar problem, where I have a 2-dimensional interpolation, similar to the requirement of your function H(u,a). I have resisted the use of a look-up table of values, as the results from this approach produce a jagged plot of performance. I have persisted with the bi-linear interpolation, producing a smoother system response. This is easier than trying to explain the reason for rough chart results.

To improve the run-time performance, I have found an outer loop (do i = 1,7), which is independent of my H function, and shifted it to be an inner loop, and so reducing the number of times I have to call the interpolation functions by a factor of 7. This produces a vector of results, rather than a single value at the H loop level. There is a penalty to pay, as more information needs to be stored to support the new loop order, but there has been a significant performance improvement.

John

Hi John,

I've been off working on other enhancements to the code and just got back to looking at current hotspots.  It has gotten much faster so now H takes up 10% of my compute time.  I'm going to have a summer college intern start next week and the first thing I'm going to have him do is to compare the tabular-generated values to the computation of the exact (enough) value.

I'm not sure I completely understand your comments.  I can only use one H at a time in the calling code.  Are you saying that you generate several of your Hs at a time or are you saying that you get several interations of the interpolation at a time?  The former doesn't help me and the later seems hard to imagine.  I'm starting to wonder about a large table with very simple interpolation as the function is quite well behaved.  I did a plot of it about the itme this thread started and it seemed pretty smooth with very modest second derivatives.

After I see how the comparisons are that the student generates, I am going to set him to to work on trying your approach or something like it.  Since, for the entire program, I am getting huge differences between execution times in release and debug mode, I guess I will have to forgo Vtune and do old-fashioned timing.

--Bruce

John,

I haven't seen your code. What are your thoughts if your inner loop was 1,8 (add dummy result), and you work the code such that the inner loop is fully vectorized?

Jim Dempsey

www.quickthreadprogramming.com

@bruce,
The loop "do i = 1,7" is independent of the arguments of function H, so I was able to move this "i" loop to inside, but now have to store the results for the 7 values of I. Cost only about 10mb in extra storage for this change, although more arrays to kep track of.

@jim,
I have not fully optimised the code. There are 5 interpolation functions, 4 are linear and one is bi-linear. There was significnt savings in shifting all these outside the inner loop. This is a "working" code, so I don't have a lot of time to do fine optimisation. The first time I ran this latest approach, it took over 3 hours. I have now reduced it to 15 minutes with a number of changes. Prior to the need for this latest approach, the run took about 5 minutes, so I thought I needed to identify something. This will probably not be the final version in an evolving simulation, where as it adapts, I need to keep the run times under control. I will probably do more if this approach remains in use.

John

Good catch on reorganizing your loop order (3hr to 15min). There may be some other changes that get you the additional distance back down to ~5min (that won't break working code).

Jim Dempsey

www.quickthreadprogramming.com

Just trying to follow alogn. Not sure which loop (do i = 1,7) is being discussed. Is the loop outside the call to H in freep1, which is in the line

alphadL= 0.014974* (fn/dnudop) * H(u,a) * dL * fudge

I have an old version of the code - has H been converted to a bilinear interpolation function? Does stride become an issue with this approach? Stride length depends upon which index is incremented squentially.

Laisser un commentaire

Veuillez ouvrir une session pour ajouter un commentaire. Pas encore membre ? Rejoignez-nous dès aujourd’hui