significant speed reduction when converting to dynamic code

significant speed reduction when converting to dynamic code

I have been updating some legacy code to make it dynamic. I have primarily converted COMMON blocks to modules. The code uses many very large multi-dimensional arrays, mostly of rank 3 but some of higher rank as well. I have seen about a 50% increase in run time when making this change. I am guessing that some of the optimization is now taking a hit now that the array sizes are not known at compile time. Is this type of speed reduction typical, and/or are there some flags I could set at compile time that would help. 

Thanks.

P.S. I am currently compiling for 64 bit windows but will also be compiling for 64 bit linux. I am using intel visual fortran composer xe 2013.1.119

44 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

Switching to modules ought not to prevent the compiler from knowing array sizes, although I agree I also have such concerns. Alignment will definitely be affected; you may want to try the 13.0 Fortran options like /align:array32byte. In the case of intel64 compilers, one would think the default should be like /align:array16byte, but it may be worth checking (e.g. with cloc()).
I would never have guessed that "making it dynamic" meant changing from COMMON to MODULE. Do you mean also changing arrays to allocatable? That might kill optimizations based on array length.

Switching to modules ought not to prevent the compiler from knowing array sizes, although I agree I also have such concerns. Alignment will definitely be affected; you may want to try the 13.0 Fortran options like /align:array32byte. In the case of intel64 compilers, one would think the default should be like /align:array16byte, but it may be worth checking (e.g. with cloc()).
I would never have guessed that "making it dynamic" meant changing from COMMON to MODULE. Do you mean also changing arrays to allocatable? That might kill optimizations based on array length, unless you add loop count directives.

Yes, I do also make the arrays alocatable and allocate them early on in the program.

Can you run VTune on both revsions of the code? This may give you an insight to how to make small changes to improve performance. 50% difference in performance is rather large, finding out what accounts for this difference should yield a solution.

Jim Dempsey

www.quickthreadprogramming.com

To be a bit more specific,
the code had many common blocks with a mix of arrays and scalars most of which were implicitly typed.
A representative example would be something like

PARAMETER(NX=170,NY=59,NZ=80,NPS=12,NSL=650)
COMMON /CBLOCK/ARRAY1(NX,NY,NZ),ARRAY2(NX,NY,NZ),NSCAL1,ARRAY3(NPS,NSL),NSCAL2,NARRAY4(NSL,NPS,NX,NY,NZ),SCAL3
DO K=1,NZ
DO J=1,NY
DO I=1,NX
ARRAY1(I,J,K)=something
END DO
END DO
END DO
(note that the limits of the do loops were parameters)

I made a module as follows

MODULE CBLOCK
REAL SCAL3
INTEGER NSCAL1,NSCAL2
REAL, ALOCATABLE :: ARRAY1(:,:,:),ARRAY2(:,:,:),ARRAY3(:,:)
INTEGER, ALOCATABLE :: NARRAY4(:,:,:,:,:)
CONTAINS
SUBROUTINE ALLOC_CBLOCK(NX,NY,NZ,NPS,NSL)
ALLOCATE(ARRAY1(NX,NY,NZ),ARRAY2(NX,NY,NZ),ARRAY3(NPS,NSL),NARRAY4(NSL,NPS,NX,NY,NZ))
END SUBROUTINE ALLOC_CBLOCK
END MODULE CBLOCK

I then replaced the COMMON block statement with "USE CBLOCK" and the PARAMETER statement with COMMON /PARAMS/NZ,NY,NZ,NPS,NSL in all of the routines which contained them.

In a NEW routine called near the start of the code, I have the following

USE CBLOCK
COMMON /PARAMS/NZ,NY,NZ,NPS,NSL

{read in some file info to determine NX, NY, NZ, NPS, and NSL}
CALL ALLOC_CBLOCK(NX,NY,NZ,NPS,NSL)

There were something like 120 different common blocks replaced in this manner with a total of several dozen different parameters.
These parameters have been historically set for each case and a new code compiled using them. We want the code to be able to be compiled once and used many times without re-compiling for each case. During any execution of the code the array sizes will be constant once allocated, however between runs the sizes may differ.
Was this the best way to replace these parameters with variables that can be determined at run time?
If so is a 50% increase in run time reasonable?

I think that the nested for loops that were limited by parameters and are now limited by variables may be where the optimization is hurting.
I guess that this may be where loop count directives come in.
I have never used these but it seems like they are an idea of the range of the do loop that is placed before each loop where a variable length has been listed, that way the compiler can have an idea on how to optimize the given loop. If that is actually the case, that's thousands of additional lines of code (which I would like to avoid if at all possible).
Is it possible to define a directive for a given variable once per routine and let the compiler apply it to each loop?
i.e. at the top of the routine I would give a directive stating that NX has a max of 300 a min of 20 and a mean of 70, with similar directive for each old parameter, then have the compiler apply that as a loop count directive wherever NX is used as a limit.

Would such a broad range even be helpful to the compiler if it were given as a loop count directive or otherwise?

In response to Jim. I have run VTune on both versions, but I am not real sure what to make of the difference between the two. (This is my first time using VTune so I am still learning how to understand its outputs. Also I have never done any form of code optimization or profiling before which is probably evidenced in my obvious lack of knowledge on the subject). What would be your first suggestion to look at in the VTune output? I can see which routines are running slower (many are) but I am not sure how to determine the cause.
Thanks

Have you disabled the runtime checks for array index out of bounds? If not do so and compare runtimes again.
If you have...

There is a runtime option to emit a diagnostic report as to when an array temporary was created. Often when rearranging code for module use, the new code may generate array temporaries. Identifying these sections of code, and reworking to eliminate the array temporary may improve performance.
Next issue can be the use of allocatable arrays require the code to generate and pass array descriptors (which are used to locate the array data) as oppose to the base address of the fixed array (when using the COMMON with fixed dimensions). Also, the use of allocatables might result in the creation of array temporaries where none were required before. This may be correctable with minor rework of code.

VTune

I suggest you configure a test program that runs 60 seconds or so on the old COMMON structured code. Then configure the new code to use the same size and data source (or copy thereof).

Using two instances of Visual Studio, run and collect the data for each configuration (do not run at the same time, and do not futz with Performance Monitor while running your data collections). Then look at the reports side-by-side (dual monitors can help). Usually the report is a sorted table in runtime order (routine by routine).

*** Note, run this test with at least O1 and IPO disabled. Also, remove runtime checks for array index out of bounds.

www.quickthreadprogramming.com

I included the /check:arg_temp_created option but did not see any output statements. Should they show up in the standard output or as some sort of diagnostic file.
Also Jim mentioned,
Zitat:

Next issue can be the use of allocatable arrays require the code to generate and pass array descriptors (which are used to locate the array data) as oppose to the base address of the fixed array (when using the COMMON with fixed dimensions).

Is there a way to detect and correct this, or is it just one of the costs of using allocatable arrays?

It's a cost of using deferred-shape arrays - not just allocatable. But the overhead for that itself is relatively low.

Steve - Intel Developer Support

>>If so is a 50% increase in run time reasonable?

How are you timing the 50% increase in runtime? (e.g. 0.5 seconds verses 0.75 seconds or 50 seconds verses 75 seconds)

Is the timing inside the program (or wall clock from command line)?

If your timing is for a relatively short duration, the COMMON block format loads in with the execuitable, whereas the allocatable loads later.
Note, allocatables can (often) have deferred load until first touch. IOW:

! using COMMON
T1 = omp_get_wtime()
DO K=1,NZ
DO J=1,NY
DO I=1,NX
ARRAY1(I,J,K)=something
END DO
END DO
END DO
T2 = omp_get_wtime()
ElapseTime = T2-T1
----
! using module
ALLOCATE(ARRAY1(NX,NY,NZ))
T1 = omp_get_wtime()
DO K=1,NZ
DO J=1,NY
DO I=1,NX
ARRAY1(I,J,K)=something
END DO
END DO
END DO
T2 = omp_get_wtime()
ElapseTime = T2-T1
--------------------
Comments
Although the above two are functionally the same, under the hood they are different.
In the COMMON program, the virtual memory of ARRAY1 has already been "touched" by the program load.
In the ALLOCATABLE program, the very first time the virtual memory of ARRAY1 has not been "touched" until

ARRAY1(I,J,K)=something

Virtual memory will be allocated in page size chunks (4KB or 4MB) at run time (not allocation time) for the first time the particular virtual memory is "touched". Should ARRAY1 be DEALLOCATed, then subsequently reALLOCATED, or something else be allocated to the same virtual address, then the virtual memory pages at that (those) addressed need not be allocated again. This first time allocation induces a latency into the program.

Due to this latency, you should consider making multiple passes of the timed sections of your application such that you can discard the first touched runs. Note, if your application has the characteristics of use once then end program, then consider looking at your O/S runtime library for a routine than can perform this virtual memory allocation/first touch in one step. This may also be a Linker option or compile time option as well.

Jim Dempsey

www.quickthreadprogramming.com

The program is one that typically runs for a week or more so a 50% increase in run time is an enormous cost. It performs many large iterations (made up of a hundred or more smaller iterations which are in turn made up of thousands of tiny iterations) that take on the order of a hour or more. I have compared the run time for running several of the largest iterations and that is what has increased by roughly 50%. From roughly 1 hr 30 min with the original code to 2 hrs 15 min with the dynamic version. I have run the same case with very small data sets which run much faster and gotten similar ratios. From 12 min to 17 min.

In comparing the profiles between the old and new code, it seems as though the lines that have the greatest increase in cpu time are the ones that contain references to multiple of the multidimensional arrays, particularly when they are in nested loops.

for example something like
R=A1(I,J,K)*B(I+1,J,K)+A2(I,J,K)*B(I-1,J,K)+A3(I,J,K)*B(I,J+1,K)+A4(I,J,K)*B(I,J-1,K)+A5(I,J,K)*B(I,J,K)+A6(I,J,K)
takes much longer
for reference, the A arrays are included by a "use" statement within this routine, the B array is a dummy argument passed to the routine from its caller, and is included in each of the calling routines by "use" statements. The B array does not seem to have as much difficulty as the A arrays.

Any thoughts

Without looking at the disassembly code (VTune can show dissassembly), I will guess that what's at issue here is register pressure.
In the module format the above statement could require 7 registers to hold the base of A1-A6,B (or address to descriptor for each), plus registers for R, etc...
In the COMMON format, A1-A6,B are at a fixed address, the base of which is a known offset (not requiring a register).

You can reduce the register pressure by colessing the An arrays into a single array
ALLOCATE(A(NX, NY, NZ*6))
...
R=A(I,J,K)*B(I+1,J,K)+A(I,J,K+NZ)*B(I-1,J,K)+A(I,J,K+NZ*2)*B(I,J+1,K)+A(I,J,K+NZ*3)*B(I,J-1,K)+A(I,J,K+NZ*4)*B(I,J,K)+A(I,J,K+NZ*5)

Or you might consider using the Fortran Preprocessor
#define A1(i,j,k) A(i,,j,k)
#define A2(i,j,k) A(i,j,k+NZ)
...
#define A6(i,j,k) A(i,j,k+(NZ*5))

Then use original statement

R=A1(I,J,K)*B(I+1,J,K)+A2(I,J,K)*B(I-1,J,K)+A3(I,J,K)*B(I,J+1,K)+A4(I,J,K)*B(I,J-1,K)+A5(I,J,K)*B(I,J,K)+A6(I,J,K)

Jim Dempsey

www.quickthreadprogramming.com

would this register pressure be an issue still on other lines where a single array is referenced?
I see similar increase in time from a given line with something like
G(I)=A(I,J,K)*X

Combining the arrays into a single array may be possible but its not simple.
would the preprocessor command work if placed within the module? or would it have to be in each routine which uses it?

The preprocessor statement processing happens at compile time. Also, the scope of the #define is from the #define to the end of the compilation unit. Use #include 'YourMacros.inc' and place your macros in there. Place the #include towards the top of each of the source files that you wish the macro to take effect

! fooBar - blabla
! Copyright(c)...
! ...
#include 'YourMacros.inc'

SUBROUTINE FOOBAR(...
...
END SUBROUTING FOOBAR

SUBROUTINE FEE(...
(macros still in effect here)
...
------------------------
>>I see similar increase in time from a given line with something like
G(I)=A(I,J,K)*X

That shouldn't be an issue.

DO I=1, NX
G(I)=A(I,J,K)*X
END DO

If NX is larger than a few, the above should run the same for both configurations.

Have you verified that the runtime check for subscript out of bounds is disabled?
Same for uninitialized variable checks
(and any other runtime checks)

Also, did your code change to module also include a feature change (e.g. to OpenMP)?
If so, and if you are misuing PRIVATE and/or REDUCE, you may have unecessary overhead.

Jim Dempsey

www.quickthreadprogramming.com

Zitat:

Have you verified that the runtime check for subscript out of bounds is disabled?
Same for uninitialized variable checks
(and any other runtime checks)

I believe so. I am compiling from the command line with the following:
ifort /fpp /O3 /align:array32byte /Qdiag-disable:8290,8291 *.f -o program.exe
(the two disabled warnings are about print statement sizes being smaller than a suggested size)
The bounds checking and uninitialized variable should both be off by default from my understanding.
I suppose I could disable them explicitly just in case.

In looking at the assembly, the line I discused before:
G(I)=A(I,J,K)*X
is blowing up. the assembly for this line in the original code was:
movss xmm10, dword ptr [rdi+r14*1+0x1f30f0fc]
mulss xmm10, xmm13
movss dword ptr [rdi+rsi*1+0x12660a7c], xmm10

now its:
mov r15, qword ptr [rip+0x34028f]
mov qword ptr [rbp+0x108], r15
mov rax, qword ptr [rip+0x3401fd]
mov qword ptr [rbp+0x278], rax
mov rdi, qword ptr [rbp+0x108]
imul rdi, rsi
mov rcx, qword ptr [rbp+0xb8]
add rdi, qword ptr [rbp+0x148]
mov qword ptr [rbp+0x1e0], rdi
mov qword ptr [rbp+0x1d8], r8
mov qword ptr [rbp+0x1d0], r9
mov qword ptr [rbp+0x1c8], r10
mov qword ptr [rbp+0x1c0], r11
mov qword ptr [rbp+0x1b8], r12
mov qword ptr [rbp+0x1b0], r13
mov qword ptr [rbp+0x1a8], r14
mov qword ptr [rbp+0x1a0], rbx
mov qword ptr [rbp+0x198], r15
mov qword ptr [rbp+0x190], rdx
mov qword ptr [rbp+0x1f8], rcx
mov qword ptr [rbp+0xd0], rsi
mov qword ptr [rbp+0x188], rax
mov qword ptr [rbp+0x200], r14
mov rdx, r14
mov r15, qword ptr [rbp+0x1e0]
lea r15, ptr [r15+rax*4]
mov qword ptr [rbp+0x208], r15
mov rax, qword ptr [rbp+0x208]
movss xmm10, dword ptr [rdi+rax*1]
mulss xmm10, xmm13
movss dword ptr [r15+r14*1+0x555405c], xmm10

I post this in hopes someone could see why the compiler would make all of these additional operation which would hopefully point to a more general solution. I myself am an engineer by trade and only program out of necessity. I know nearly nothing about assembly but it sure seems like a lot of extra work is going on in the new version of the code and I am assuming that is directly leading to the increased run time.

Thanks

Tim,
.
You stated:
"I then replaced the COMMON block statement with "USE CBLOCK" and the PARAMETER statement with COMMON /PARAMS/NZ,NY,NZ,NPS,NSL in all of the routines which contained them."
.
I would place the variables NZ,NY,NZ,NPS,NSL in the module CBLOCK and not in a new COMMON, if possible.
.
A problem you might be having is that with these paramaters now changeable, the ability to increase the size of the problem might be exceeding your available physical memory. I would place a memory usage report in SUBROUTINE ALLOC_CBLOCK, to identify if this is a problem. SIZEOF can report the size of these arrays. Make sure it reports as INTEGER(8) then divide by 2.**30 for giga bytes. (report each of the arrays)
.
For arrays that have a large memory footprint, if you are extending beyond the available physical memory, you should check the array index order to address memory sequentially as much as possible/practical.This can result in reduced performance.
.
Changes as Jim has described can increase the complexity of the code, especially when coming back to change it later. If you adopt this, make sure it is well documented.
.
For "G(I)=A(I,J,K)*X", you could try using array syntax, such as:
G(:)=A(:,J,K)*X or
G(i1:i2)=A(i1:i2,J,K)*X or even a simple F77 wrapper like
call vector_multiply ( G, A(1,j,k), X, N)
You might find the compiler can clean this up.
.
I have done simular restructures to transfer COMMON arrays to allocatable arrays in a MODULE and found the conversion worked well. The size looks to be the most likely problem.
.
Modern compilers are (claim to be) better at managing higher rank arrays, so I would expect the problem is somewhere else. If the problem is not related to what I have suggested, I'd use VTUNE or a profiler and try to isolare where the reduced performance is occuring.
.
Hope this might help.
.
John

John,

Zitat:

I would place the variables NZ,NY,NZ,NPS,NSL in the module CBLOCK and not in a new COMMON, if possible.

I could put them there, and they are passed to the allocate routine that is within the module; however, I have hundreds of different common blocks which all share about 20 or so parameters, so if they were in each of these modules they would be repeated many times in a routine that has many of these modules. What is the benefit of putting them in the modules?

Zitat:

A problem you might be having is that with these paramaters now changeable, the ability to increase the size of the problem might be exceeding your available physical memory

Why would my physical memory usage now be any greater than it was before. as far as I can tell it should now be less than or equal to as before the arrays only had to be equal to or greater than my needed space and now they should match my needed space exactly. Also would these memory issues explain the increase of assembly code for a given line of fortran. My same subroutine goes from 167 lines of assembly (with the modules moethod) to 317 (with the common block method) when no optimization has been performed and from around 650 to 1500 when optimized using /O3.

Both the optimized and un-optimized codes take around 50% longer to run when using allocatable arrays in modules vs. common blocks.
the un-optimized code went from 5357 sec to 7935 sec, a 53% increase
the optimized code went from 1613 sec to 2305 sec, a 43% increase
Could a memory issue still explain that?

Thanks

In the dissassembly list for "G(I)=A(I,J,K)*X" most of the "mov" instructions were for saving registers that will be used later on in the code sequence. These happen to get "billed" to the listed statement. Discounting the saving register mov's you can still see when required to access the array descriptor that a fair amount of code is generated. The question to ask then becomes: How can I ammortize the array descriptor work over several (many) statements?

Old code tened to be written to reduce loop overhead:

DO I=1,NX
(tens or hundreds of statements)
END DO

Due to a finite number of GP registers (16, ~14 usable), the body of the loop may exceed the ability to registerize all frequently referenced data (information from within array descriptors). To correct for this, where possible, we code to reduce the number of registers required. Consider

DO I=1,NX
(part of statements)
END DO

DO I=1,NX
(other part of statements)
END DO
...
DO I=1,NX
(remaining part of statements)
END DO

The compiler can do some of this for you, but the compiler cannot do this when it is questionable.

These loops may also be replaced by coding as John Campbell suggests G(:)=A(:,J,K)*X (implied loop)
Of course you will have to check the code to see if there are dependencies and code accordingly.

Jim Dempsey

www.quickthreadprogramming.com

I think I can see a bit better what the problem may be related to.
In the un-optimized code, the fortran lines which refer only to local, or common variables are translated to essentially identical assembly code.
this is true even for an array which was passed to the routine, and given dimensions in the routine by parameters in this case.
eg
SUBROUTINE FOO ((ISTART,JSTART,KSTART,NI,NJ,NK,B)
use CBLOCK
PARAMETER( ... declarations of NX NY NZ and NBIG)
DIMENSION B(NX,NY,NZ)
COMMON /CBLOCK2/E(NBIG),F(NBIG)

any lines such as
F(I) = B(I,J,K)
or
R= R-D*B(I,J,K)
come out in assembly essentially identical to how they did before the use of the modules

but a line like
E(I)=AE(I,J,K)*X (where AE came from the common block before but now a module)
now goes from 27 lines of assembly to 57

the actual line of code (so you can mactch up spacing) is
E(I) = AE(I,J,K)*DENOM
and COEF is the name of the common block/module
from
mov eax, DWORD PTR [40+rbp] ;58.15
movsxd rax, eax ;58.15
imul rax, rax, 13924 ;58.23
lea rdx, QWORD PTR [COEF] ;58.23
add rdx, 6182256 ;58.23
add rdx, rax ;58.15
add rdx, -13924 ;58.15
mov eax, DWORD PTR [52+rbp] ;58.15
movsxd rax, eax ;58.23
imul rax, rax, 236 ;58.23
add rdx, rax ;58.15
add rdx, -236 ;58.15
mov eax, DWORD PTR [68+rbp] ;58.15
movsxd rax, eax ;58.23
imul rax, rax, 4 ;58.23
add rdx, rax ;58.15
add rdx, -4 ;58.15
movss xmm0, DWORD PTR [rdx] ;58.23
movss xmm1, DWORD PTR [112+rbp] ;58.23
mulss xmm0, xmm1 ;58.15
mov eax, DWORD PTR [68+rbp] ;58.32
movsxd rax, eax ;58.32
imul rax, rax, 4 ;58.15
lea rdx, QWORD PTR [COEFL] ;58.15
add rdx, rax ;58.15
add rdx, -4 ;58.15
movss DWORD PTR [rdx], xmm0 ;58.15

to
lea rax, QWORD PTR [COEF_mp_AE] ;58.23
add rax, 56 ;58.23
mov edx, 48 ;58.15
add rax, rdx ;58.15
mov edx, DWORD PTR [40+rbp] ;58.15
movsxd rdx, edx ;58.23
imul rdx, QWORD PTR [rax] ;58.23
add rdx, QWORD PTR [COEF_mp_AE] ;58.15
lea rax, QWORD PTR [COEF_mp_AE] ;58.23
add rax, 56 ;58.23
mov ecx, 48 ;58.15
add rax, rcx ;58.15
mov rax, QWORD PTR [rax] ;58.23
lea rcx, QWORD PTR [COEF_mp_AE] ;58.23
add rcx, 64 ;58.23
mov ebx, 48 ;58.15
add rcx, rbx ;58.15
imul rax, QWORD PTR [rcx] ;58.15
sub rdx, rax ;58.15
lea rax, QWORD PTR [COEF_mp_AE] ;58.23
add rax, 56 ;58.23
mov ecx, 24 ;58.15
add rax, rcx ;58.15
mov ecx, DWORD PTR [52+rbp] ;58.15
movsxd rcx, ecx ;58.23
imul rcx, QWORD PTR [rax] ;58.23
add rdx, rcx ;58.15
lea rax, QWORD PTR [COEF_mp_AE] ;58.23
add rax, 56 ;58.23
mov ecx, 24 ;58.15
add rax, rcx ;58.15
mov rax, QWORD PTR [rax] ;58.23
lea rcx, QWORD PTR [COEF_mp_AE] ;58.23
add rcx, 64 ;58.23
mov ebx, 24 ;58.15
add rcx, rbx ;58.15
imul rax, QWORD PTR [rcx] ;58.15
sub rdx, rax ;58.15
mov eax, DWORD PTR [68+rbp] ;58.15
movsxd rax, eax ;58.23
imul rax, rax, 4 ;58.23
add rdx, rax ;58.15
lea rax, QWORD PTR [COEF_mp_AE] ;58.23
add rax, 64 ;58.23
mov rax, QWORD PTR [rax] ;58.23
imul rax, rax, 4 ;58.15
sub rdx, rax ;58.15
movss xmm0, DWORD PTR [rdx] ;58.23
movss xmm1, DWORD PTR [112+rbp] ;58.23
mulss xmm0, xmm1 ;58.15
mov eax, DWORD PTR [68+rbp] ;58.32
movsxd rax, eax ;58.32
imul rax, rax, 4 ;58.15
lea rdx, QWORD PTR [COEFL] ;58.15
add rdx, rax ;58.15
add rdx, -4 ;58.15
movss DWORD PTR [rdx], xmm0 ;58.15

like I said I don't know anything about assembly. These lines are from the file resulting from the command
ifort /fpp /Od /S file.f

to Steve Lionel,
Is this additional assembly just the added overhead for the deferred shape arrays?
It does not seem relatively low in this case.

Thanks

the structure of this protion of the code is of the form
01DO K = KSTART,KEND
02 DO J = JSTART,JEND
03 E(IST) = 0.0
04 F(IST) = B(IST,J,K)
05 DO I = ISTART,IEND
06 IF(AP(I,J,K).GT.1.0E+19) THEN
07 E(I) = 0.0
08 F(I) = B(I,J,K)
09 ELSE
10 R = A1(I,J,K)*B(I,J+1,K)+A2(I,J,K)*B(I,J-1,K)+A3(I,J,K)*B(I,J,K+1)+A4(I,J,K)*B(I,J,K-1)+C(I,J,K)
11 Y = T*(A1(I,J,K)+A2(I,J,K)+A3(I,J,K))
12 R = R - Y*B(I,J,K)
13 X = 1.0/(A6(I,J,K)-F-A6(I,J,K)*E(I-1))
14 E(I) = A5(I,J,K)*X
15 F(I) = (R+A6(I,J,K)*F(I-1))*X
16 ENDIF
17 END DO
18 DO II = ISTART,IEND
19 I = NI+IEND-II
20 B(I,J,K) = (E(I)*B(I+1,J,K))+F(I)
21 C = C + MIN(ABS(E(I)*B(I+1,J,K)),ABS(F(I)))/MAX(1.0E-25,ABS(E(I)*B(I+1,J,K)),ABS(F(I)))
22 END DO
23 END DO
24END DO

lines like 3,4,7,8,12,19,20,21 have identical assembly (at least when not optimized)
while lines like 6,10,11,13,14,15 are much larger
the variables in the first group of lines are all local, common, or passed as an argument and the size is declared locally.
the second group contain variables from the module.

Jim,
you mentioned before that if I could combine the A1-A5 arrays that this may help.
It seems as though the original code does not have the same issue as it treats all of the common block indexed from a single memory location and now the code needs to index from many different locations, 1 per module variable/array.

I am now inclined to pursue this option for this module(although others may not work nearly as well).
I would prefer to not change the main body of the code, as then I am much less likely to inadvertently change the results.
(I am not working with a small piece of code but with hundreds of routines, many of which are quite complex, so any code changes will have to take place in a very general manner, like in each of the modules but not the code body.)

In my limited experience it seems as though some changes will need to be made within the routines themselves and not just the module, or the program will still thing of each of the individual A1-A6 arrays as individual blocks of memory and not 1 large contiguous one.
If I change the module such that the only variable in it is lets say ARRAY(NX,NY,NZ,6) then in the routine put a preprocessor statement like
#define A1(i,j,k) A(i,j,k,1:1)
#define A2(i,j,k) A(i,j,k,2:2)
#define A3(i,j,k) A(i,j,k,3:3)
etc.

should that work?

thanks again

Ok I ran a quick test using my last idea
I modified #define A1(i,j,k) A(i,j,k,1:1) to #define A1(i,j,k) A(i,j,k,1)
-----I got the exact same number of assembly lines for both versions
now to see if it runs faster.---- THIS WAS WRONG (how come we can't use a del html tag?)
thanks
i'll let you know how it goes

Tim s,

I used the preprocessor trick on an F77 conversion to F90 on a solution with 13 projects, ~750 files, 600,000 lines of code. The #define and USE were conditionalized at the top of the files such that for most of the conversion process the same source files could compile using modules or COMMON. This helped immensely in the first phase of the conversion. Once all was running as it should, then new features were added into the source files (making code no longer compileable as F77).

I think you can start slowly, such as with the A1...A6 hack. This will be a diminishing returns type of thing. Hopefully a few such changes will be all that is required.

Don't give up on modules too soon. On my conversion from F77 COMMONs to F90 ALLOCATE'ables, principally for conversion from serial to parallel programming with OpenMP, I was able to get the serial program to run 10x faster (finding and correcting some serious performance issues), added to that was the scaling from parallel coding. In end, on 4-core system I attained ~40x performance boost for my efforts. Your Mileage May Vary (this was non-typical).

Jim Dempsey

www.quickthreadprogramming.com

Tim s,

In your post beginning with: the structure of this protion of the code is of the form...

If the frequency of IF(AP(I,J,K).GT.1.0E+19) is very low, then consider removing the test from the DO I loop, and add a second DO I loop following the current DO I loop, which (seldom) overstrikes the results generated in the first loop.

This will improve vectorization.

Jim Dempsey

www.quickthreadprogramming.com

Ok my last post was wrong
I am not sure what happened but I must have looked at the wrong file
give me a minute to sort things out.

OK so the number of assembly lines actually went up?
I will try the new version though and see what the difference is.

Jim,
I would like to understand the #define command better
is using #define A1(i,j,k) A(i,j,k,1)
somewhat like a find and replace done at compile time and each instance of A1(i,j,k) replaced with A(i,j,k,1), or is it smarter than that and will also replace A1(i,j-1,k) with A(i,j-1,k,1)
or A1(3,j,k) with A(3,j,k,1), or A(l,m,n) with A1(l,m,n)
also is it case sensitive?
thanks

You are right (BTW this is almost the same as the C PreProcessor).

There is a "gocha" though is that the substitution is "as-is" (or "as-was") for the token.
What this means is you may require to enclose the dummy argument with ()'s in order to parsing priority issues.
Example:

#define myScale(a,b) a * b
Now consider
myScale(x + y, z + q)

This will expand to "x + y * z + q"
Not what you expect
Whereas:

#define myScale(a,b) (a) * (b)

the above ()'s looks unnecessary, but is necessary

it expands to "(x + y) * (z + q)"

What you want

Sometimes you may need additional ()'s

#define myScale(a,b) ((a) * (b))

As to what is required, this will depend on your code requirements. The extra parens will most always work "((a) * (b))"

Note, use #include for preprocessor directive files
The FORTRAN "INCLUDE" may mislead you into thinking anything enclosed in the file has the scope of the subroutine/function. The #defines scope will extend to end of compilation unit (or until #undef yourMacroName).

FORTRAN also has an ASSOCIATE / END ASSOCIATE

I suggest you experiment with that before the #define
I used the #define hack prior to ASSOCIATE / END ASSOCIATE

Jim Dempsey

www.quickthreadprogramming.com

You're lucky that you don't know anything about assembly language because otherwise looking at the way the unoptimized code generated by the compiler crawls along would cause debilitating physical pain. The explosion of code is due to the compiler having to walk through the descriptor for allocatable array AE(59,59,*), but all that should be absent in optimized code.
There are some who don't like global variables at all, whether in common or in modules. There is always the possibility that another invisible part of the program could touch your global variables. For example if your program does I/O in a loop, you might have installed code to hook the I/O which could result in your global variables being modified on return, so they would have to be reloaded every iteration of the loop. This could cause the compiler to recompute the address of the next array element from scratch at every iteration rather than just adding a constant offset to the address of the last array element.
You could avoid this if it were a problem for optimized code by placing the compute-intensive part in a subroutine and passing in the global variables to be used as dummy arguments. Then the Fortran aliasing rules forbid changing or reallocating the global variables associated with dummy arguments, assuming the subroutine references only the dummy arguments, not the module variables.
Your code doesn't seem to do I/O or invoke external procedures in the inner loop, though, so you probably don't have to go through the contortions of the last paragraph. Try posting the optimized assembly language because it gives a clearer picture of what your program may be doing to hinder optimization and is also easier to read.

ALLOCATE(A(NX,NY,NZ*6))

...

DO K = KSTART,KEND

  DO J = JSTART,JEND

    E(IST) = 0.0

    F(IST) = B(IST,J,K)

    ! ASSOCIATE partition A(NX,NY,NZ*6) into 1D slices

    ASSOCIATE A1 => A(:,J,K), A2 => A(:,J,K+NZ), A3 => A(:,J,K+NZ*2), &

            & A4 => A(:,J,K+NZ*3), A5 => A(:,J,K+NZ*4), A6 => A(:,J,K+NZ*5)

    DO I = ISTART,IEND

      R = A1(I)*B(I,J+1,K)+A2(I)*B(I,J-1,K)+A3(I)*B(I,J,K+1)+A4(I)*B(I,J,K-1)+C(I,J,K)

      Y = T*(A1(I)+A2(I)+A3(I))

      R = R - Y*B(I,J,K)

      X = 1.0/(A6(I)-F-A6(I)*E(I-1))

      E(I) = A5(I,J,K)*X

      F(I) = (R+A6(I)*F(I-1))*X

    END DO

    END ASSOCIATE ! end of ASSOCIATED A1,... A6

    DO I = ISTART,IEND

      IF(AP(I,J,K).GT.1.0E+19) THEN

        E(I) = 0.0

        F(I) = B(I,J,K)

      ENDIF

    END DO

    DO II = ISTART,IEND

      I = NI+IEND-II

      B(I,J,K) = (E(I)*B(I+1,J,K))+F(I)

      C = C + MIN(ABS(E(I)*B(I+1,J,K)),ABS(F(I)))/MAX(1.0E-25,ABS(E(I)*B(I+1,J,K)),ABS(F(I)))

    END DO

 END DO

END DO

Jim Dempsey

www.quickthreadprogramming.com

In looking at the ASSOCIATE method (without looking at dissassembly) it looks as if the array descriptor accesses would be reduced, but the loop would not be able to keep the necessary data in registers

6 for A1-A6
5 for the variations on B
1 for C
1 for E
1 for F
n for misc.

Fully optimizing this loop may be a bit more difficult.

Jim Dempsey

www.quickthreadprogramming.com

You might improve vectorization by the following:


ALLOCATE(A(NX,NY,NZ*6))

...

! local variables

REAL :: R(IEND)

DO K = KSTART,KEND

  DO J = JSTART,JEND

    E(IST) = 0.0

    F(IST) = B(IST,J,K)

    ! ASSOCIATE partition A(NX,NY,NZ*6) into 1D slices

    ASSOCIATE A1 => A(:,J,K), A2 => A(:,J,K+NZ), A3 => A(:,J,K+NZ*2), &

            & A4 => A(:,J,K+NZ*3), A5 => A(:,J,K+NZ*4), A6 => A(:,J,K+NZ*5)

    DO I = ISTART,IEND

      RTEMP = A1(I)*B(I,J+1,K)+A2(I)*B(I,J-1,K)+A3(I)*B(I,J,K+1)+A4(I)*B(I,J,K-1)+C(I,J,K)

      Y = T*(A1(I)+A2(I)+A3(I))

      R(I) = RTEMP - Y*B(I,J,K)

    END DO

    DO I = ISTART,IEND

      IF(AP(I,J,K).GT.1.0E+19) THEN

        E(I) = 0.0

        F(I) = B(I,J,K)

      ELSE

        X = 1.0/(A6(I)-F-A6(I)*E(I-1))

        E(I) = A5(I)*X

        F(I) = (R(I)+A6(I)*F(I-1))*X

      ENDIF

    END DO

    END ASSOCIATE ! end of ASSOCIATED A1,... A6

    DO II = ISTART,IEND

      I = NI+IEND-II

      B(I,J,K) = (E(I)*B(I+1,J,K))+F(I)

      C = C + MIN(ABS(E(I)*B(I+1,J,K)),ABS(F(I)))/MAX(1.0E-25,ABS(E(I)*B(I+1,J,K)),ABS(F(I)))

    END DO

 END DO

END DO

Jim Dempsey

www.quickthreadprogramming.com

So the optimized modified code with the preprocessor assignments was faster. Not fast enough yet, so I will have to look at other modules which may be slowing things down.
Can the the define statement be used to adjust the shape of an array.
i.e. treat the array Z(L,N,M) as a 1D array of size (L*N*M) or vice versa (Assume that Z is in a module and I can define it as an allocatable array of either shape)
I am thinking it might not be too hard to go from 3D to 1D but the other way around might not be so easy.
would #define Z(i,j,k) Z(i+((j)-1)*(L)+((k)-1)*(N)*(L)) work?
If so would it kill any auto vectorization?
Assume that the Z array is in a simple three tier nested for loop.

I also just realized another issue
in the code the module variables are sometimes passed to other routines. Something like
call foo(A3,A4,r,t,j)
where A3&A4 are included in the module in the calling routine but not the called routine.
when using the #define
A3 and A4 now seem to be treated as an uninitialized scalar passed to the called routine instead of the beginning memory location of an initialized array.
I tried adding #define A3 A(:,:,:,3)
but now it gives a warning that A3 macro is redefined and the intermediate preprocessed file has A(:,:,:,3)(I,J,K) in place of an A3(I,J,K)
do I have to limit the code which each define works on (using #undef A3)
or would it be better to change the original code to use
call foo(A3(:,:,:),A4(:,:,:),r,t,j)

I like the second option best as it seems to be less intrusive.
Thanks again

Per my last post, loops containing the statements E(I-1) and F(I-1) will thwart vectorization.
Breaking appart the loop into two loops, with the addition of the R(IEND) array, is designed to permit the first portion (more complex) loop execute with vectorization at the expense of writing to new array R(I). Should vectorization be attained the number of reads and arithmetic operations are reduced by 1/2, 1/4. 1/8 (dependent on float/double and on vector width and SSE or AVX). Your NX was stated as having a mean of 70. The mean of writes to R is a subset of 70 (ISTART, IEND). You did not disclose the relationship of NX to ISTART,IEND. Possibly this is 68. 68 floats is 272 bytes, or if variables are doubles 544 bytes, either case it will easily fit within L1 cache. The second loop, the reads of R(I) are almost free.

BTW I just noticed your original code and my copy/paste rework has an error relating to the original statement:

X = 1.0/(A6(I,J,K)-F-A6(I,J,K)*E(I-1))

(missing subscript on F?)

Note, you should be able to partition the inner I loop into two inner loops when using the #define route.

Jim Dempsey

www.quickthreadprogramming.com

Thanks for all of your help to this point.
I now have that routine that was causing significant slow down tackled. That routine is now running just as fast as before (which is all I was hoping for at this point of the project i am working on. additional optimization will have to come later)

I have what to me is an even more baffling problem
a separate routine that was untouched (at least directly) by the common to module conversion is running about twice as slow as before.
the source file for the routine was unchanged, and the resultant assembly is identical if i compile it alone and essentially identical when I look at the VTune "assembly" (I am guessing that this is what Jim refers to as disassembly, and that it is converted from the binary back into assembly)

The routine itself is passed 60+ arguments, most of which are 3D arrays but only the name is passed and the size is defined in the routine.
Many of these arrays are used from a module in a routine which passes them to a routine which pass them to another routine which then pass them to the routine in question.
i.e.


subroutine sub1

 use cblock !this module and many others have many arrays scalars and logicals. lets assume that for this example they contain A1(NX,NY,NZ),A2(NX,NY,NZ),A3(NX,NY,NZ), ... , A60(NX,NY,NZ)

 some useful code

 call sub2(A1,A2,A3,...,A60)

 more useful code

end
subroutine sub2(A1,A2,A3,...,A60)

 some useful code

 call sub3(A1,A2,A3,...,A60)

 more useful code

end
subroutine sub2(A1,A2,A3,...,A60)

 some useful code

 call sub3(A1,A2,A3,...,A60)

 more useful code

end
subroutine sub3(A1,A2,A3,...,A60)

 lots of useful code

 more useful code

end

in this example it is sub3 causing all of the trouble.
As far as I can tell it should behave identically because it sees all of its arguments in the exact same way
But it runs half as fast. There are about 10 lines that slow down surrounded by a bunch of other lines which run at normal speed. Most of the other other lines are conditionals or potentially bypassed by the conditionals so the fact that I don't see slow down throughout might be irrelevant.

Any ideas on how this could be happening?

Thanks

Are the names of the arguments on the top-level call the same names as in the module cblock?
Are the dummies names in sub2 and sub3 the same as in cblock?
Can you show at least the SUBROUTINE sub3(...) together with the dummy declarations?
Can you show your loops with 10 lines that slow down?

Is there anything your learned from the first issue that can apply to this issue?

Jim Dempsey

www.quickthreadprogramming.com

Passing 60 arguments in itself is time-consuming. If you can reduce that, perhaps by passing a derived type with the variables in it, or using module variables, that may help.

Steve - Intel Developer Support

I would not have chosen to write the code the way that it was. I think that whoever did wanted to avoid global variables.
Steve, So passing 60 arguments may be slow but why slower now than before?
Jim, nothing I learned before seems to apply now as it seemed like using multiple variables where a common block was used added to register pressure. This routine should not have any difference as the same assembly is used. As far as I know the same variable names are used throughout from the module down through the three routines.

I think the main problem is the loops in the last level.

The 60 args per call level may be a secondary issue. This (60 args) can be addressed by using a derived type to package the scalars (Steve's suggestion) and either use the arrays directly in module (should arrays be same on all calls) or package pointers to the arrays in a seperate derived type (module arrays will require TARGET attribute. The reasoning for the two packaging is that at some time in the near future you may wish to paralllelize this code, and the further out of the call stack you perform the parallelization, generally the better performance. The scalars would be used for the slice-and-dice (sectioning) of the arrays. These parameters would vary thread-by-thread.

An alternate way is to take Repete Offender's advice and pass the arrays via F77 style call (pass a cell reference to subroutine with unknown interface). This works well when the dummy array can be expressed using lower rank. In the example you listed earlier, all three indicies of (some of) the arrays were being manipulated thus the dummy would have to construct a near duplicate of the original array descriptor (no advantage).

As for optimizing the last level, none of us on this forum can offer advice without seeing the problem code.

Jim Dempsey

www.quickthreadprogramming.com

I guess I am not looking for how to optimize the lowest level of the code, at least not yet. Right now I am really just trying to track down the cause of the slow down, which seems to have happened without any modification to the code.
I could post the whole code but it is ugly, really ugly, and I am afraid that no one would focus on the real issue I am facing and only on how to rework the code.

The real issue I am having with this routine is that it takes 2 times as long to run and I have not changed the routine at all. the only thing that has changed is how a routine several levels up gets access to the variables which get passed down to this routine. Once I figure the cause of that out I may have time to move on to optimizing the routine some.

I guess what I really need is someone who understands the internal workings of how arguments are passed to a routine to explain how that could be changing and why that may cause slow down.

I just added a counter and verified that the routine is called the same number of times (57000) in both cases.
Thanks

Look in the IVF Document

Start | All Programs | Intel Parallel Studio XE | Documentation | Visual Fortran ... | (click on Link to html documentation)
(I suggest to Intel that the VS Help have a direct link to this in addition to (or in lieu of) the link to the comingled documentation)

Then select Index tab | declarations | for arrays

This will give you the specifics on the different way arrays are passed as arguments:


SUBROUTINE SUB(N, C, D, Z)

  REAL, DIMENSION(N, 15) :: IARRY       ! An explicit-shape array

  REAL C(:), D(0:)                      ! An assumed-shape array

  REAL, POINTER :: B(:,:)               ! A deferred-shape array pointer

  REAL, ALLOCATABLE, DIMENSION(:) :: K  ! A deferred-shape allocatable array

  REAL :: Z(N,*)                        ! An assumed-size array

The older documents had a table illustrating the performance impact (in relative terms) on each of the calling conventions (faster to slowest).
I am unable to locate this table in the current IVF document.
Generally the order (faster to slower) is: explicit-shape, assumed size, assumed-shape, deferred-shape

Steve may be able to locate the table in the reference

Jim Dempsey

www.quickthreadprogramming.com

I don't remember such a table, and I don't think the issue can be reduced to a table or list. It depends a lot on what the called routine does. But, seriously, speculation without hard evidence, such as VTune Amplifier XE analysis, is a waste of time. Once you identify the sections of code that are dragging down the performance, only then can you fruitfully think about ways to improve it.

Steve - Intel Developer Support

On a system having IVF ca 2007: Contents | Optimizing Applications | Programming Guidelines | Using Arrays Efficiently


Using Arrays Efficiently

This topic discusses how to efficiently access arrays and pass array arguments.

Accessing Arrays Efficiently

Many of the array access efficiency techniques described in this section are applied automatically by the Intel Fortran loop transformation optimizations. Several aspects of array use can improve run-time performance; the following sections discuss the most important aspects.

Perform the fewest operations necessary

The fastest array access occurs when contiguous access to the whole array or most of an array occurs. Perform one or a few array operations that access all of the array or major parts of an array instead of numerous operations on scattered array elements. Rather than use explicit loops for array access, use elemental array operations, such as the following line that increments all elements of array variable A:

Example

A = A + 1

When reading or writing an array, use the array name and not a DO loop or an implied DO-loop that specifies each element number. Fortran 95/90 array syntax allows you to reference a whole array by using its name in an expression.

For example:

Example

REAL :: A(100,100)

A = 0.0

A = A + 1 ! Increment all elements

! of A by 1

...

WRITE (8) A ! Fast whole array use

Similarly, you can use derived-type array structure components, such as:

Example

TYPE X

INTEGER A(5)

END TYPE X

...

TYPE (X) Z

WRITE (8)Z%A ! Fast array structure

! component use

Access arrays using the proper array syntax

Make sure multidimensional arrays are referenced using proper array syntax and are traversed in the natural ascending storage order, which is column-major order for Fortran. With column-major order, the leftmost subscript varies most rapidly with a stride of one. Whole array access uses column-major order.

Avoid row-major order, as is done by C, where the rightmost subscript varies most rapidly. For example, consider the nested DO loops that access a two-dimension array with the J loop as the innermost loop:

Example

INTEGER X(3,5), Y(3,5), I, J

Y = 0

DO I=1,3 ! I outer loop varies slowest

DO J=1,5 ! J inner loop varies fastest

X (I,J) = Y(I,J) + 1 ! Inefficient row-major storage order

END DO ! (rightmost subscript varies fastest)

END DO

...

END PROGRAM

Since J varies the fastest and is the second array subscript in the expression X (I,J), the array is accessed in row-major order. To make the array accessed in natural column-major order, examine the array algorithm and data being modified. Using arrays X and Y, the array can be accessed in natural column-major order by changing the nesting order of the DO loops so the innermost loop variable corresponds to the leftmost array dimension:

Example

INTEGER X(3,5), Y(3,5), I, J

Y = 0

DO J=1,5 ! J outer loop varies slowest

DO I=1,3 ! I inner loop varies fastest

X (I,J) = Y(I,J) + 1 ! Efficient column-major storage order

END DO ! (leftmost subscript varies fastest)

END DO

...

END PROGRAM

The Intel Fortran whole array access ( X = Y + 1 ) uses efficient column major order. However, if the application requires that J vary the fastest or if you cannot modify the loop order without changing the results, consider modifying the application to use a rearranged order of array dimensions. Program modifications include rearranging the order of:

•	Dimensions in the declaration of the arrays X(5,3) and Y(5,3)

•	The assignment of X(J,I) and Y(J,I) within the DO loops

•	All other references to arrays X and Y

In this case, the original DO loop nesting is used where J is the innermost loop:

Example

INTEGER X(5,3), Y(5,3), I, J

Y = 0

DO I=1,3 ! I outer loop varies slowest

DO J=1,5 ! J inner loop varies fastest

X (J,I) = Y(J,I) + 1 ! Efficient column-major storage order

END DO ! (leftmost subscript varies fastest)

END DO

...

END PROGRAM

Code written to access multidimensional arrays in row-major order (like C) or random order can often make inefficient use of the CPU memory cache. For more information on using natural storage order during record, see Improving I/O Performance.

Use available intrinsics

Whenever possible, use Fortran array intrinsic procedures instead of creating your own routines to accomplish the same task. Fortran array intrinsic procedures are designed for efficient use with the various Intel Fortran run-time components.

Using the standard-conforming array intrinsics can also make your program more portable.

Avoid leftmost array dimensions

With multidimensional arrays where access to array elements will be noncontiguous, avoid leftmost array dimensions that are a power of two (such as 256, 512).

Since the cache sizes are a power of 2, array dimensions that are also a power of 2 may make inefficient use of cache when array access is noncontiguous. If the cache size is an exact multiple of the leftmost dimension, your program will probably make use of the cache less efficient. This does not apply to contiguous sequential access or whole array access.

One work-around is to increase the dimension to allow some unused elements, making the leftmost dimension larger than actually needed. For example, increasing the leftmost dimension of A from 512 to 520 would make better use of cache:

Example

REAL A (512,100)

DO I = 2,511

DO J = 2,99

A(I,J)=(A(I+1,J-1) + A(I-1, J+1)) * 0.5

END DO

END DO

In this code, array A has a leftmost dimension of 512, a power of two. The innermost loop accesses the rightmost dimension (row major), causing inefficient access. Increasing the leftmost dimension of A to 520 (REAL A (520,100)) allows the loop to provide better performance, but at the expense of some unused elements.

Because loop index variables I and J are used in the calculation, changing the nesting order of the DO loops changes the results.

For more information on arrays and their data declaration statements, see the Intel® Fortran Language Reference.

Passing Array Arguments Efficiently

In Fortran, there are two general types of array arguments:

•	Explicit-shape arrays (introduced with Fortran 77); for example, A(3,4) and B(0:*)

These arrays have a fixed rank and extent that is known at compile time. Other dummy argument (receiving) arrays that are not deferred-shape (such as assumed-size arrays) can be grouped with explicit-shape array arguments.

•	Deferred-shape arrays (introduced with Fortran 95/90); for example, C(:.:)

Types of deferred-shape arrays include array pointers and allocatable arrays. Assumed-shape array arguments generally follow the rules about passing deferred-shape array arguments.

When passing arrays as arguments, either the starting (base) address of the array or the address of an array descriptor is passed:

•	When using explicit-shape (or assumed-size) arrays to receive an array, the starting address of the array is passed.

•	When using deferred-shape or assumed-shape arrays to receive an array, the address of the array descriptor is passed (the compiler creates the array descriptor).

Passing an assumed-shape array or array pointer to an explicit-shape array can slow run-time performance. This is because the compiler needs to create an array temporary for the entire array. The array temporary is created because the passed array may not be contiguous and the receiving (explicit-shape) array requires a contiguous array. When an array temporary is created, the size of the passed array determines whether the impact on slowing run-time performance is slight or severe.

The following table summarizes what happens with the various combinations of array types. The amount of run-time performance inefficiency depends on the size of the array.

	Dummy Argument Array Types

(choose one)

Actual Argument Array Types

(choose one)	Explicit-Shape

Arrays	Deferred-Shape and Assumed-Shape

Arrays

Explicit-Shape

Arrays	Result when using this combination: Very efficient. Does not use an array temporary. Does not pass an array descriptor.

Interface block optional. 	Result when using this combination: Efficient. Only allowed for assumed-shape arrays (not deferred-shape arrays).

Does not use an array temporary. Passes an array descriptor.

Requires an interface block.

Deferred-Shape and Assumed-Shape

Arrays	Result when using this combination: When passing an allocatable array, very efficient. Does not use an array temporary. Does not pass an array descriptor. Interface block optional.

When not passing an allocatable array, not efficient. Instead use allocatable arrays whenever possible.

Uses an array temporary. Does not pass an array descriptor. Interface block optional. 	Result when using this combination: Efficient. Requires an assumed-shape or array pointer as dummy argument.

Does not use an array temporary. Passes an array descriptor.

Requires an interface block. 

Jim Dempsey

www.quickthreadprogramming.com

Sorry the table formatting got blown, results are somewhat readable though.

Jim Dempsey

www.quickthreadprogramming.com

Kommentar hinterlassen

Bitte anmelden, um einen Kommentar hinzuzufügen. Sie sind noch nicht Mitglied? Jetzt teilnehmen