tools for understanding large programs, and optimization

tools for understanding large programs, and optimization

I'm in the middle of trying to optimize a large F90 program and was wondering if anyone knows of tools that can help with the following:

1) Showing where values computed are used, or where values are coming from. i.e., I want to see the data flow analysis, especially across procedure and file boundaries.

2) Something that can give me any indication about how to restructure loops to optimize code. I was looking at some code, the inner most loops being:


sumrd = 0.0
DO k = 1,nkin
DO np = 1,nreactmin(k)
DO i2 = 1,ncomp+nexchange
sumrd(i2) = sumrd(i2) + &
mumin(np,k,i)*jac_rmin(i2,np,k)
END DO
END DO
END DO

And I thought that the loop nest was in a bad order so I moved the i2 loop just outside the k loop and it absolutely killed performance. The whole program ran 30% slower. This loop nest went from less than 5% to 30% of the time. What did the compiler do and how can I use that information to help speed up other parts of the code?

I have 800 lines of code that I need to speed up. There are very few if statements and it's all just crank and grind over a big grid. It all fits in L2 cache without a problem. But I'm getting only 10% of peak performance and I'd expect better.

I'm using the 9.0 compiler on a linux/itanium machine. I use

 
-O3 -fno-alias
-mP2OPT_hlo_loadpair=F
-mP2OPT_hlo_prefetch=F
-mP2OPT_hlo_loop_unroll_factor=2
-mP3OPT_ecg_mm_fp_ld_latency=8 -i4

as this is some magic I found on the net

Thanks

9 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

2. Remember that, for Fortran, subscripts are stored in the reverse order of C and other Algol-based languages. The loop, as written, should be quite efficient. Although -O3 gives the compiler permission to swap loops, it appears it may not have done so. You can get information on whether that is done in the -opt_report. For Itanium, swapping the np and i2 loops might improve performance, particularly as you have turned off the load-pair optimization, which is likely to be important for the loop as written.
With the loop nest order changed as you did, the software prefetch option, which you turned off, would likely have become more important.
The 9.0 compiler is much better than 8.1 at choosing good loop unroll parameters, so that option also is one you would not normally use, unless you could show a specific benefit.
-i4 is the default (sets Fortran default integers as 32-bit).
-fno-alias is not much more aggressive than standard Fortran rules. If it makes any difference to the code you have quoted, IMO that would be a bug.
Among the pitfalls of these -mP[23]OPT options are that you should devote some time to understanding them and how the implications change with compiler versions, as well as testing whether they are beneficial for your case.

I tried compiling with just -O3 -i4 and it ran probably 20% slower. You said swapping the np and i2 loops might speed thing up, but I tried that when I tried moving it outside the k loop and I got the same 30% slowdown. I can try playing with the flags but the question was how do I speed up _other_ parts of the code. I accidently found what works in this case but I want to know how to figure out the rest of the 370 lines of code that are the heart of this program. I have 370 lines of code with 5 IF conditions, over 50 DO loops, and everything fits in L2 cache, yet I can't get more than 10% of peak performance. Is that reasonable for this compiler and processor? I vectorized an lu solver to do multiple solves at a time and got close to 25% of peak.

Is there something like VAST I can use? Is there a way to explicitly control, loop by loop, different optimization strategies? Such as the loadpair, prefetch, and unrolling?

In certain applications, it is necessary to split the source down to an individual file for each function, and set compiler flags file by file.
There are directives/pragmas, documented in the .pdf file in the compiler /docs directory, giving directives for controlling prefetch and unrolling at the loop level.
If you want to deal with performance at this level, you will need to examine opt_report, which can give some information you will need about the unrolling, scheduling, and versioning of the loops.
On the loop you show, loadpair should help significantly. If it hurts instead, that may be an indication something has gone wrong. For example, you would check opt_report to see if the compiler has optimized all loop versions made for load-pair.
If it is a question of the compiler mis-guessing the favored loop trip count, there is a directive for that, or Profile Guided Optimization can collect and use the data automatically. Since you appear to be thorough, you would want to check loop by loop to see if PGO does useful things.
We have worked with VAST, and some of the optimizations identified there have already been incorporated in the compiler. If you can afford the expenditure, you may find it valuable.

This chunk of code can run for days on 2000 processors at a time. Speeding it up is worth it if there's a reasonable approach. I look at the opt_report and the docs directory.
I'm sure I'll have more questions.

Thanks

Hi,

You must use -x? (? means processor symbol) and probably -O3 switch to persuade the compiler to do loop interchange automatically. This is a little bit strange for me that processor independent transformations are dependent on processor used but this is Intel's way of optimization.

Regards,
Alberti

Do you have a url for the documentation you mentioned (that explains the opt-report). There isn't anything at our site. I'm mostly looking for a description of the SWP report.

It's in the "Optimizing Applications" manual provided in the on-disk documentation. You can also find it here.

Steve - Intel Developer Support

Thanks. I read the SWP report info and I still have some questions. If the report says it unrolled a loop by 2 and the Scheduled II value = 8 then does that mean it's doing 2 of the original loops every 8 cycles (after the pipeline fills)? I wrote a small program with:


do j=1,800
sum = sum + b(j)*c(j)
enddo

and the SWP report said it unrolled the loop by 2 and I got


Swp report for loop at line 8 in f_ in file t.F90

Loop at line 9: unrolled loadpair-ver-1

Resource II = 2
Recurrence II = 8
Minimum II = 8
Scheduled II = 8

Estimated GCS II = 11

Percent of Resource II needed by arithmetic ops=50%
Percent of Resource II needed by memory ops =100%
Percent of Resource II needed by floatpt ops = 50%

Number of stages in the software pipeline = 2

I changed the loop to


do j=1,800,2
sum = sum + b(j)*c(j) + b(j+1)*c(j+1)
enddo

and got that the loop was not unrolled and:


Resource II = 2
Recurrence II = 2
Minimum II = 2
Scheduled II = 2

Estimated GCS II = 15

Percent of Resource II needed by arithmetic ops=50%
Percent of Resource II needed by memory ops = 100%
Percent of Resource II needed by floatpt ops = 100%

Number of stages in the software pipeline = 11

Does this mean that once the pipe is full it's doing 1 iteration every 2 cycles? If so, what does the Percent of Resource II values mean? It would be doing 2 adds and 2 multiplies every 2 cycles, which would be using half the floating pt resources. Thanks.

Leave a Comment

Please sign in to add a comment. Not a member? Join today