# Function returning an array of large size: performance issue with Intel Fortran

## Function returning an array of large size: performance issue with Intel Fortran I am helping a colleague who has written an engineering program that involves a function returning an array of large size (0.5 to 2 GB typically on Windows x64 OS with 16 GB RAM) and who is noticing significantly slower performance with Intel Fortran compared to gfortran.  We are trying to understand why this is the case.

Toward this, consider the following simplification of the program that reduces to a function instruction equivalent to the formula y = f(x) where x and y happen to be arrays of significant size; in this example, they are on the order of 0.5 GB.  The function has been made trivial for this example as a simple assignment: y = x.

What one notices then is the performance of the program using Intel Fortran with /O2 optimization is significantly worse compared to that required for simply copying all the data.  When looked at in the form of a ratio, program using Intel Fortran shows performance that is more than 4 times slower than that compared to the exact same code run with gfortran.

Can someone from the Intel Fortran team please provide some insight and guidance with this?

Thanks,

```program p

use, intrinsic :: iso_fortran_env, only : R8 => real64

implicit none

integer, parameter :: N = 2**27
integer, parameter :: NUM_SAMPLES = 100
integer :: i
real :: x(N)  ! 0.5 GB storage size
real :: y(N)
real(R8) :: t1
real(R8) :: t2
real(R8) :: t_assign
real(R8) :: t(NUM_SAMPLES)

call random_number( x )

print *, "Checking y=x Assignment:"
print "(a,t10,a,t25,a)", "i", "x", "y"
do i = 1, NUM_SAMPLES
if ( mod(i,20) == 0 ) x = x + epsilon(x) ! For compiler not to optimize away the sampling loop
call cpu_time( t1 )
y = x
call cpu_time( t2 )
t(i) = t2 - t1
if ( mod(i,20) == 0 ) then
print "(g0,t10,g0,t25,g0)", i, x(i), y(i)
end if
end do
t_assign = sum(t)
print "(*(g0,1x))", "Average CPU Time = ", sum(t)/real(NUM_SAMPLES,kind=R8)
print *

print *, "Checking y=equals(x) Function Call:"
print "(a,t10,a,t25,a)", "i", "x", "y"
do i = 1, NUM_SAMPLES
if ( mod(i,20) == 0 ) x = x + epsilon(x) ! For compiler not to optimize away the sampling loop
call cpu_time( t1 )
y = equals( x )
call cpu_time( t2 )
t(i) = t2 - t1
if ( mod(i,20) == 0 ) then
print "(g0,t10,g0,t25,g0)", i, x(i), y(i)
end if
end do
print "(*(g0,1x))", "Average CPU Time = ", sum(t)/real(NUM_SAMPLES,kind=R8)
print *

print "(g0,f10.2)", "Ratio of the two instructions: ", sum(t)/t_assign

stop

contains

pure function equals( a ) result( r )

real, intent(in) :: a(:)
! Function result
real :: r( size(a) )

r = x

return

end function equals

end program p
```

Compilation and execution:

```C:\Fortran>ifort /heap-arrays:0 /standard-semantics p.f90
Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R
) 64, Version 18.0.0.065 Beta Build 20170320

ifort: NOTE: The Beta evaluation period for this product ends on 12-oct-2017 UTC
.
Microsoft (R) Incremental Linker Version 14.00.24215.1

-out:p.exe
-subsystem:console
p.obj

C:\Fortran>p.exe
Checking y=x Assignment:
i        x              y
20       .9765872E-01   .9765872E-01
40       .4800636       .4800636
60       .6310107       .6310107
80       .8096844       .8096844
100      .4880919       .4880919
Average CPU Time =  .4383628099999999E-01

Checking y=equals(x) Function Call:
i        x              y
20       .9765931E-01   .9765931E-01
40       .4800642       .4800642
60       .6310113       .6310113
80       .8096850       .8096850
100      .4880925       .4880925
Average CPU Time =  .2020212949999999

Ratio of the two instructions:       4.61

C:\Fortran>```

On the Windows based computer system I tried, the ratio shown above is consistently on the order of 4.5.  Note using /O3 with Intel Fortran only makes a small difference.  Whereas the same program compiled with gfortan (with their /O2 or /O3 optimization option) on the same system, the ratio is typically below 1.3.

P.S.> I've taken a peek at the assembler instructions and there is something with Intel Fortran that appears bothersome, but I'll hold my thoughts to myself and allow the Intel team to followup on this.

6 posts / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.  What happens with:

```   pure subroutine equalsSub( r, a )
real, intent(in) :: a(:)
real :: r( size(a) )
r = x
return
end subroutine equalsSub
```

And performing call equalsSub(y, x) in your timed loop.

Jim Dempsey jimdempseyatthecove 写道：

What happens with .. performing call equalsSub(y, x) in your timed loop.

Jim Dempsey

Jim,

Yes, replacing the function subprogram with a subroutine clearly helps with Intel Fortran, it brings the ratio I show above to around unity, implying with the optimization in effect, it is effectively an inllning of the subroutine subprogram resulting in essentially no procedure invocation overhead.  This is indeed an option we have already considered, it's just that it will be a lot of change for my colleague to refactor the code.

But, of course, the question for the Intel team is why such a difference relative to gfortran for the code in the original post!

P.S.> With gfortran, there was hardly any change between function and subroutine subprograms (I think I either need some other compiler option or a different compiler version to notice a difference).  Can you tell if an array temporary is created in the function call...
... or if the generated code is using scalar copy verses vector copy?

With a really old version of IFV I had an issue of scalar array copy being performed when vector copy should have been chosen. The fix was to use

!DIR\$ VECTOR ALWAYS
r = x

You might give that a try. (there are other clauses to VECTOR that may be of interest too).

Jim Dempsey It appears ifort allocates/deallocates an array temporary for the function variant but not for the array assignment or subroutine variants. I submitted this to Development for their analysis.

(Internal tracking id: CMPLRS-43227) Kevin D (Intel) 写道：

It appears ifort allocates/deallocates an array temporary for the function variant but not for the array assignment or subroutine variants. I submitted this to Development for their analysis.

(Internal tracking id: CMPLRS-43227)

Thanks much , Kevin - that's exactly what I noticed and which seems to degrade performance.  My hope is Intel Development will followup soon with an approach that greatly enhances performance, as good as or better than gfortran for the use case shown in the original post.  I look forward to your feedback from Development analysis.