Global Variables with ifort

Global Variables with ifort

Hello,

  I'm using Fortran compiler from intel and I'm getting performance issues when I use global variables (arrays). Basically, I have a four dimension array, and a loop processing all of its elements. When I pass these array as a parameter for the subroutine, I have an execution time. When I use it directly inside the routine as a global variable, the execution time is the double from previous one. I'm guessing the compiler disables some optimizations when I use a global variable. Is it the case? If yes, how can I enable it again? If no, does anyone have any idea why this slow down?

The arrays are declared on a global module like this:

real, allocatable, target :: ux(:,:,:,:), uy(:,:,:,:)

And allocated with:

allocate(ux(nmin1-4:nmax1+4+u_pad, nmin2-4:nmax2+4, nmin3-4:nmax3+4,-1:3))

Any idea what is happening?

8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

If you've read the ifort documentation but still are having difficulty understanding the compiler's optimization reports, you could follow up with an actual code sample on the Fortran forum appropriate to your platform (Windows or linux and MAC).
Your description leaves too much up to the imagination.

I've read the documentation and didn't found anything specific to the globals. Considering, this is related to optimization I thought this was the correct place to post. I can't see what is missing on the description: the central point is: I changed a array from parameter to global and it slown down the code doubling the execution time (the same code, same allocation, same types, same names). I would like to know, what is missing to describe...

If you believe there's a difference in compilation, comparing the results of opt-report-file option should verify it. If you don't want to give a working example, you could at least quote the differences in the reports, in case they mean more to us than to you.
Again, you'd get more expert opinions on the relevant Fortran forum, in case that's what you are interested in.

I've posted on Intel Fortran Compiler forum, but got no answer. If you think is more accurate to move this topic there, please do it (or someone who can do that).

This is the loop:

do k=nmin3-4,nmax3+4
do j=nmin2-4,nmax2+4
do i=nmin1-4,nmax1+4
ux(i,j,k,3) = (20.*ux(i,j,k,2) - 6.*ux(i,j,k, 1) - 4.*ux(i,j,k,0) + ux(i,j,k,-1) + 12.*ux(i,j,k,3)*dt2)*ctt

uy(i,j,k,3) = (20.*uy(i,j,k,2) - 6.*uy(i,j,k, 1) - 4.*uy(i,j,k,0) + uy(i,j,k,-1) + 12.*uy(i,j,k,3)*dt2)*ctt

uz(i,j,k,3) = (20.*uz(i,j,k,2) - 6.*uz(i,j,k, 1) - 4.*uz(i,j,k,0) + uz(i,j,k,-1) + 12.*uz(i,j,k,3)*dt2)*ctt
end do
end do
end do

This is the declaration:
real, allocatable, target :: ux(:,:,:,:), uy(:,:,:,:)
real, allocatable, target :: uz(:,:,:,:)

This is the allocation:
allocate(ux(nmin1-4:nmax1+4, nmin2-4:nmax2+4, nmin3-4:nmax3+4,-1:3))
allocate(uy(nmin1-4:nmax1+4, nmin2-4:nmax2+4, nmin3-4:nmax3+4,-1:3))
allocate(uz(nmin1-4:nmax1+4, nmin2-4:nmax2+4, nmin3-4:nmax3+4,-1:3))

Case 1:

They are declared/allocated inside a subroutine called: Source. This subroutine calls another one, called
Update, passing ux,uy,uz as parameters, which are "defined" by Update subroutine like this:

real :: ux(nmin1-4:nmax1+4, nmin2-4:nmax2+4, nmin3-4:nmax3+4, -1:3)
real :: uy(nmin1-4:nmax1+4, nmin2-4:nmax2+4, nmin3-4:nmax3+4, -1:3)
real :: uz(nmin1-4:nmax1+4, nmin2-4:nmax2+4, nmin3-4:nmax3+4, -1:3)

In fact, it is the same definition from the caller subroutine, Source.

Case 2:

They are still allocated by Source subroutine, but declared as globals on
another module accessible to Source and Update subroutines. So, the subroutine
update just access them as global variables, with no need to define or pass as
parameter.

My problem is: case 2 is two times slower than case 1.

In case 1, the line numbers are:
19: do k=nmin3-4,nmax3+4
20: do j=nmin2-4,nmax2+4
21: do i=nmin1-4,nmax1+4

The HLO report for case 1 on these lines is:

LOOP DISTRIBUTION in update_3d_ at line 21
LOOP DISTRIBUTION in update_3d_ at line 21
LOOP DISTRIBUTION in update_3d_ at line 21

Loop Interchange not done due to: Original Order seems proper
Advice: Loop Interchange might help Loopnest at lines: 19 20 21
: Original Order found to be proper, but by a close margin

In case 2, the line numbers are:
12: do k=nmin3-4,nmax3+4
13: do j=nmin2-4,nmax2+4
14: do i=nmin1-4,nmax1+4

and the report says:

LOOP DISTRIBUTION in update_3d_ at line 14

Loop Interchange not done due to: Original Order seems proper
Advice: Loop Interchange might help Loopnest at lines: 12 13 14
: Original Order found to be proper, but by a close margin

I can see that, there is two more LOOP DISTRIBUTION lines
on the first case. This the only difference I see. Any idea what
is causing this and why?

Are you saying that neither case (or maybe both) show full vectorization, so you don't see a difference there? The compiler is more likely to perform distribution on vectorized loops, but I certainly wouldn't like to rely on that as the only indicator.

You're right, I was not enabling the full report.
The faster version (case 1) says:

(629:13-629:13):VEC:sourcewave_: PARTIAL LOOP WAS VECTORIZED

The slower version (case 2) says:

HPO Vectorizer Report (update_3d_)

src/Update.f90(14): (col. 13) remark: loop was not vectorized: existence of vector dependence.
src/Update.f90(21): (col. 15) remark: vector dependence: assumed ANTI dependence between (unknown) line 21 and (unknown) line 15.
src/Update.f90(15): (col. 15) remark: vector dependence: assumed FLOW dependence between (unknown) line 15 and (unknown) line 21.
src/Update.f90(20): (col. 15) remark: vector dependence: assumed ANTI dependence between (unknown) line 20 and (unknown) line 15.
(...)

But, why? Considering it is the same code, why the compiler consider one code with dependence and the other not?

Forcing the vectorization, using the directive to say there is no loop carried dependence, makes the execution time almost the same, for the both cases.

But I still can't understand why the compiler consider one case with dependences and the other not. The code is the same...I just change the declarations from local to global.

Leave a Comment

Please sign in to add a comment. Not a member? Join today