Degraded Run Time

Degraded Run Time

In going from one version of my Fortran program to another, I've somehow increased the run time from 24:10 (24 minutes) to 35:34.  Looking over the code changes, there's a few more algebraic expressions, changed FORMAT statements, some extra assignments, a few DATA statements - nothing that we haven't done before.  This only happens in release mode:

 

    <Tool Name="VFFortranCompilerTool" SuppressStartupBanner="true" GenAlternateCodePaths="codeForAVX" IntegerKIND="integerKIND8" RealKIND="realKIND8" LocalSavedScalarsZero="true" FloatingPointExceptionHandling="fpe0" FlushDenormalResultsToZero="true" Traceback="true"/>

 

In debug mode, the two code versions have pretty much identical run times.  I tried to put major subroutines from the new version into the old version, one at a time, to figure out which one is causing the slow-down; apparently none of them are.  I tried to run it through a trial version of VTune, with flags like this:

 

    <Tool Name="VFFortranCompilerTool" SuppressStartupBanner="true" DebugInformationFormat="debugEnabled" GenAlternateCodePaths="codeForAVX" WarnInterfaces="true" IntegerKIND="integerKIND8" RealKIND="realKIND8" LocalSavedScalarsZero="true" FloatingPointExceptionHandling="fpe0" FlushDenormalResultsToZero="true" Traceback="true" RuntimeChecks="rtChecksAll" BoundsCheck="true" RuntimeLibrary="rtMultiThreadedDebug" Interfaces="true"/>

 

But I got identical run times between the two code versions again.  With 15 MB of source code, I'm worried if I may be hitting some kind of compiler limit - one of my configurations caused this error:

 

catastrophic error: **Internal compiler error: internal abort** Please report this error along with the circumstances in which it occurred in a Software Problem Report.  Note: File and line given may not be explicit cause of this error.

 

(I did report it btw)

 

Any ideas on what's the best way to figure out why my run time suddenly jumped up by 46%?  By the way, I tried the same experiment on OpenVMS, and the difference in run times in release mode is only 28%.  My test machine is Windows8/Xeon E5-2687W Intel(R) Visual Fortran Compiler XE 12.1.4.325 [Intel(R) 64]

20 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

Difficult to say without code (I know, your code may be too big or too private to post) but some ideas to consider :
(a) An increase in the creation of temporary arrays in your "few more algebraic statements"
(b) Increased paging - are you accessing matrices in the correct (most efficient) subscript order?
(c) Increased I/O
(d) Antivirus software interference.
(e) Networking traffic.
Les

A couple of other things I noticed.

First - the VTune run was with non-optimized (ie, debug enabled) code. You said there was little-to-no difference between the two program versions with non-optimized code, so finding no difference in VTune is probably expected.

Second - I find it interesting that you're seeing a degradation with your OpenVMS implementation as well. The Intel Fortan compiler and OpenVMS Fortran compiler have the same heritage, but they diverged about 10 years ago and have virtually nothing in common anymore, especially related to optimizations.

Therefore -- I think you should look at some of Les's suggestions in the previous reply. Are you accessing matrices differently now? Are there more temporary arrays? I *believe* that /check:arg_temp_created was supported in the OpenVMS version (it has been 10 years and all) so you may be able to use that to find any new temps.

I hope this helps --

--Lorri

Thanks everyone for the great suggestions!

The algebraic statements are of the kind:

IF (ZSGWRTAP(2,IL).GT.TINY) THEN
ZWSGWRL(IL)=MAX(0.E0,ZWDC2SG(IL)-ZSGWRTAP(1,IL))
FZWSGWRL(IL)=ZWSGWRL(IL)/MAX(TINY,ZSGWRTAP(2,IL)-ZSGWRTAP(1,IL))
ENDIF

One of the tests I tried was to compile the "old' code with the "new" common blocks (where I store all the arrays), and this hasn't changed the run time. We haven't added any big arrays - all in all, we've added about 150 REAL variables.

It's very possible that a small change can de-stabilize some large algorithm, but if that's the case I'm having trouble to find it.

In terms of accessing the matrices in the most efficient subscript order, most, we do - others we don't. But we haven't added any big matrix operations in this round.

The old code reads and writes about 205 MB of data, and the new one 193 MB.

It's all done on a local machine, just one user logged on, and using the "Console" build to minimize any other effects on anything else. I've repeated this result enough times that I doubt anti-virus or any such interference is causing the difference.

Maybe I'm not compiling the code correctly for VTune. When I try to run my Release code with VTune, I don't get any stack traces. That's why I started with the Debug version and added "Maximize Speed" and "Generate AVX Code." Is there a better way?

You don't show what TINY is. While the intrinsic names aren't reserved, this usage seems dangerous, particularly if you don't use implicit none. TINY(ZSGWRTAP(2,IL)) might make sense.

Sorry...we use statements like these very frequently:

REAL LARGE,TINY,SMALL
DATA LARGE/1.E6/,TINY/1.E-6/,SMALL/1.E-3/

Typical design intent is to ignore small physical quantities (eg. less than 1 mm or 1 g) more so than to ignore small numerical quantities. We have a "check for uninitialized variables" build; that should catch the type of error you're referring to, right?

Try commenting out just the:

IF (ZSGWRTAP(2,IL).GT.TINY) THEN
ZWSGWRL(IL)=MAX(0.E0,ZWDC2SG(IL)-ZSGWRTAP(1,IL))
FZWSGWRL(IL)=ZWSGWRL(IL)/MAX(TINY,ZSGWRTAP(2,IL)-ZSGWRTAP(1,IL))
ENDIF

statements. Leave the remainder of the code in place.

This particular test in a loop may interfere with vectorization. Should the commented code return the performance then you may need to look at a vector friendly conditional statement.

Also, if IL is your loop control variable, then consider swapping indicies on ZSGWRTAP such that it is

IF (ZSGWRTAP(IL,2).GT.TINY) THEN
ZWSGWRL(IL)=MAX(0.E0,ZWDC2SG(IL)-ZSGWRTAP(IL,1))
FZWSGWRL(IL)=ZWSGWRL(IL)/MAX(TINY,ZSGWRTAP(IL,2)-ZSGWRTAP(IL,1))
ENDIF

Jim Dempsey

www.quickthreadprogramming.com

Thanks! I tried to comment that out, and I tried to swap out various pieces of code. Now, this gets even more interesting; I tried to run the "old" code vs. the "new" code, varying 32-bit vs. 64-bit platform, and whether I set RealKIND="realKIND8" or not (SX vs DX). It's only the double-precision builds that are taking longer to run...

Old Build SX 32 29:43
Old Build DX 32 34:37
Old Build SX 64 20:41
Old Build DX 64 24:09

New Build SX 32 30:13
New Build DX 32 48:06
New Build SX 64 20:15
New Build DX 64 33:55

How does that happen?

I just realized this ... "old build" used ifort 12.0.4.196, "new build" used ifort 12.1.4.325. So I tried
Old Build DX 64 (new compiler) 36:03
New Build DX 64 (old compiler) 26:09
It's not just the clock time increasing, it's also the number of timesteps taken to converge increasing (mainly in the subroutine that produced the "catastrophic error" - see Premier issue 630226)
What's changed in 12.1.4.325 to make this worse? Has this been improved in the 2013 release? Should I wait for "Update 1" ?

In any code where you have convergence, the iterations are highly dependent upon the precision. The algorithm would affect the number of iterations to converge or fail dependent upon the precision... higher precision does not necessarily dictate faster/slower convergence.

It is important to note that the default precision is REAL(4) in IVF. Therefore you should be aware that 1.0 and 123.45E+6 are REAL(4) literals, whereas 1.0D0 and 123.45D+6 are REAL(8) literals (compiler options can change default real size). Therefore

IF(Real8Var .gt. 1.0) then

Is a comparison of a REAL(8) to a REAL(4), which depending upon compile time options may result in runtime code that loads the REAL(4) literal, promotes it to REAL(8), then performs the test. Promotion of literals is not guaranteed at compile time. This is due to maintaining compatibility of results where promotion was only performed at run time.

Jim Dempsey

www.quickthreadprogramming.com

We run a utility to convert things like 1.E0 to 1.D0 for production DX code. I also tried to run "un-converted" single-precision code as double-precision without the utility, by adding the RealKIND="realKIND8" flag alone, and the run-times between "un-converted" and "converted" code were pretty much the same (33:36 vs 33:55). Good job on that one Intel!
Now, here's the clock run-times, and number of time-steps for my different builds (keep in mind "old build" also has "old compiler" and "new build" also has "new compiler"):
Old Build SX 32 29:43 267762
Old Build DX 32 34:37 265440
Old Build SX 64 20:41 270827
Old Build DX 64 24:09 271496
New Build SX 32 30:13 272290
New Build DX 32 48:06 316374
New Build SX 64 20:15 271076
New Build DX 64 33:55 327141
Do you see why I'm very suspicious of the compiler here?

You must look closely at how you compute the convergence. Not only at the source code level but also in combination with the compiler options relating to expression evaluations (default may have changed). Check/experiment with the /fp:... switches.

You might consider inserting trace printouts during various sections of the convergence:
Iterations 1:100
Iterations 1000:1100
Iterations 10000:10100
Iterations 100000:100100
Iterations: 270000:270100
Iterations 316200:316300
Iterations 327000:327100

(or use spans of +10)

Seeing how the components of what determines convergence vary may yield information as to what is going on and how to fix it.

You may find that your old code was so wrong that correct code may have yielded convergence in a fraction of the number of iterations (and with better results).
Jim Dempsey

www.quickthreadprogramming.com

4 year old version of my code, today's version of my code - makes no difference. All single precision runs, and OpenVMS runs, G95/Linux runs, and double precision runs with the old compiler, take around 270,000 iterations to complete the calculation. Double-precision runs, with ifort 2011 update 6 or later take more iterations and more run time. Check out the file I attached, in release/win32. I compared run times with update 4 vs update 10.

On a Core 2 Quad Q9450, new ifort -> 17.55 seconds to run; old ifort ->9.45 seconds to run
On a Xeon E5-2687W, new ifort -> 14.96 seconds to run; old ifort -> 4.52 seconds to run

Tool Name="VFFortranCompilerTool" SuppressStartupBanner="true" OmitFramePointers="false" GenAlternateCodePaths="codeForAVX" IntegerKIND="integerKIND8" LocalSavedScalarsZero="true" FloatingPointExceptionHandling="fpe0" FlushDenormalResultsToZero="true" Traceback="true"

Now tell me still there is something wrong with my code?? What changed in Release 6???

附件: 

附件尺寸
下载 source1.txt1.63 KB

Look at the disassembly code for 32-bit Release
do i=1,10000000000 !1e10
00EC108E mov dword ptr [I],1
00EC1095 mov dword ptr [ebp-14h],0
a=a*b
00EC109C movsd xmm0,mmword ptr [A]
00EC10A1 movsd xmm1,mmword ptr [B]
00EC10A6 mulsd xmm0,xmm1
00EC10AA movsd mmword ptr [A],xmm0
enddo
00EC10AF mov eax,dword ptr [I]
00EC10B2 mov edx,dword ptr [ebp-14h]
00EC10B5 mov ecx,1
00EC10BA mov ebx,0
00EC10BF add eax,ecx
00EC10C1 adc edx,ebx
00EC10C3 mov dword ptr [I],eax
00EC10C6 mov dword ptr [ebp-14h],edx
00EC10C9 mov eax,540BE400h
00EC10CE mov edx,2
00EC10D3 mov ecx,dword ptr [I]
00EC10D6 mov ebx,dword ptr [ebp-14h]
00EC10D9 sub ecx,eax
00EC10DB mov dword ptr [ebp-3Ch],ecx
00EC10DE sbb ebx,edx
00EC10E0 mov dword ptr [ebp-38h],ebx
00EC10E3 jl MAIN+9Ch (0EC109Ch)

This being 32-bit code, and using a loop control variable I being 64-bit and using 32-bit regisers, the management of the loop control variables exceed that of the computation portion of the loop.
BTW this loop has an invariant content and therefore this loop should have been eliminated (content executed once).
In looking at the 64-bit code:
do i=1,10000000000 !1e10
000000013F2810A0 mov qword ptr [I],1
a=a*b
000000013F2810A8 movsd xmm0,mmword ptr [A]
000000013F2810AD movsd xmm1,mmword ptr [B]
000000013F2810B2 mulsd xmm0,xmm1
000000013F2810B6 movsd mmword ptr [A],xmm0
enddo
000000013F2810BB mov eax,1
000000013F2810C0 add rax,qword ptr [I]
000000013F2810C4 mov qword ptr [I],rax
000000013F2810C8 mov rax,2540BE400h
000000013F2810D2 mov rdx,qword ptr [I]
000000013F2810D6 cmp rax,rdx
000000013F2810D9 jge MAIN+0A8h (13F2810A8h)
The computation of the expression is the same, but there is less overhead in processing the loop control variable.
This too suffers the optimization issue of not eliminating the loop when the contents are invariant.
...
*** Both loops have an optimization issue where the actual variable I is updated at each iteration (its memory locatio is updated) as oppose to optimizing a temporary I into a register (64-bit) or register pair (32-bit).
..
The above has /O3 - Maximum Speed.
*** However, Intel in their non-Infinite wisdom, chooses not to fully optimize unless you also have enabled Interprocedural optimizations (IPO).
Code generated using IPO
32-bit
a=a*b
00021065 mulpd xmm1,xmm0
00021069 mulpd xmm1,xmm0
0002106D mulpd xmm1,xmm0
do i=1,10000000000 !1e10
00021071 add esi,8
00021074 mov edi,esi
00021076 adc ecx,0
00021079 sub edi,eax
0002107B mov edi,ecx
a=a*b
0002107D mulpd xmm1,xmm0
do i=1,10000000000 !1e10
00021081 sbb edi,edx
00021083 jb MAIN+65h (21065h)
a=1.d0
00021085 movaps xmm0,xmm1
do i=1,10000000000 !1e10
00021088 mov esi,eax
a=1.d0
0002108A unpckhpd xmm0,xmm1
b=1.000000001d0
0002108E mov ecx,2
a=1.d0
00021093 mulsd xmm1,xmm0
00021097 movsd mmword ptr [esp+28h],xmm1
b=1.000000001d0
0002109D movsd xmm0,mmword ptr [___xt_z+74h (0A62A0h)]
000210A5 movsd xmm1,mmword ptr [esp+28h]
do i=1,10000000000 !1e10
000210AB add esi,1
000210AE mov edi,esi
000210B0 adc edx,0
000210B3 sub edi,eax
000210B5 mov edi,edx
a=a*b
000210B7 mulsd xmm1,xmm0
do i=1,10000000000 !1e10
000210BB sbb edi,ecx
000210BD jb MAIN+0ABh (210ABh)
---------------------------------------------------------
64-bit
do i=1,10000000000 !1e10
000000013F24104E xor eax,eax
a=1.d0
000000013F241050 movaps xmm6,xmmword ptr [__xt_z+50h (13F2EB4C0h)]
b=1.000000001d0
000000013F241057 movaps xmm0,xmmword ptr [__xt_z+60h (13F2EB4D0h)]
a=a*b
000000013F24105E mulpd xmm6,xmm0
000000013F241062 mulpd xmm6,xmm0
000000013F241066 mulpd xmm6,xmm0
do i=1,10000000000 !1e10
000000013F24106A add rax,8
a=a*b
000000013F24106E mulpd xmm6,xmm0
do i=1,10000000000 !1e10
000000013F241072 cmp rax,540BE400h
000000013F241078 jb MAIN+5Eh (13F24105Eh)
a=1.d0
000000013F24107A movaps xmm0,xmm6
do i=1,10000000000 !1e10
000000013F24107D mov edx,540BE400h
a=1.d0
000000013F241082 unpckhpd xmm0,xmm6
b=1.000000001d0
000000013F241086 mov rax,2540BE400h
a=1.d0
000000013F241090 mulsd xmm6,xmm0
b=1.000000001d0
000000013F241094 movsd xmm0,mmword ptr [__xt_z+70h (13F2EB4E0h)]
do i=1,10000000000 !1e10
000000013F24109C inc rdx
a=a*b
000000013F24109F mulsd xmm6,xmm0
do i=1,10000000000 !1e10
000000013F2410A3 cmp rdx,rax
000000013F2410A6 jb MAIN+9Ch (13F24109Ch)
enddo
The above vectorized (two doubles (results) per xmm register) and untrolled 4x2 times), then performed remainder, and registerized the loop control.

This dummied up example is not necessarily representable of your application problem (unless your loop control variable is INTEGER(8).
If your application uses INTEGER(8), then I would suggest you code dfferently. Example
INTEGER(4) :: nOuter, nInner,iOuter,iInner
INTEGER(8) :: N, i
...
i=1
nOuter = N / 1000000000_8
do iOuter=1,nOuter
... call yourWork(...)
i=i+1
end do
nInner = MOD(N, 1000000000_8)
do iInnerr=1,nInner
... call yourWork(...)
i=i+1
end do

www.quickthreadprogramming.com

Sorry I could not embelish the above post with additional coments due to this $%&! forum editor going bonkers when your reply exceeds the size you can stretch this edit box.

Suggestions now are:

Enable IPO in the compiler (at least single file)
*** and enable IPO in the linker.

Jim Dempsey

www.quickthreadprogramming.com

BTW, in prior to last post (INTEGER(8) loop control variable)
Perform this optimization only if (when) IPO does not reduce the loop control overhead to an acceptible portion of the overall loop iteration.
Also note, in your exampe, i was not used inside the loop, therefore the i=i+1 could be eliminated.
Jim Dempsey

www.quickthreadprogramming.com

This is interesting...if I do 1e9 iterations, integer(4) or integer(8) doesn't make any difference, with either the new or the old compiler. It's the 1e10 iterations, that does make a difference.

If I re-write it as two loops (one loop inside the other loop) of 1e5 iterations - both versions of the compiler run at the same speed (fast), AND also give me digit for digit identical results!

I'll try my full-blown code with the new compiler and with changing integer(8) to integer(4) everywhere (though my "single-precision" build also has all integers as integer(8), so I'm not sure how much difference it will make. Changing all integers to integer(8) was just a quick cheap fix we did for a legacy hashing/integer packing algorithm).

Unfortunately, making all integers integer(4) didn't help my run time, or the number of iterations to converge in my release code.

So, is the issue identified above another (unrelated) bug, Intel?

>>Unfortunately, making all integers integer(4) didn't help my run time, or the number of iterations to converge in my release code.
If your integer(4) loop control variables are iterating 2^31 or more times, then this will cause havok in those loops (note, this may not cause a program crash). You will need to assure that should a loop require larger interations, that you consider modifying the code as discussed above (nested loops) or use the integer(8) loop control.

Jim Dempsey

www.quickthreadprogramming.com

引文:

jimdempseyatthecove 写道:
If your integer(4) loop control variables are iterating 2^31 or more times, then this will cause havok in those loops (note, this may not cause a program crash).

Understood...I didn't need integer(8) for loops, I needed it for a goofy utility integer-packing algorithm

发表评论

登录添加评论。还不是成员?立即加入