I am helping a colleague who has written an engineering program that involves a function returning an array of large size (0.5 to 2 GB typically on Windows x64 OS with 16 GB RAM) and who is noticing significantly slower performance with Intel Fortran compared to gfortran. We are trying to understand why this is the case.
Toward this, consider the following simplification of the program that reduces to a function instruction equivalent to the formula y = f(x) where x and y happen to be arrays of significant size; in this example, they are on the order of 0.5 GB. The function has been made trivial for this example as a simple assignment: y = x.
What one notices then is the performance of the program using Intel Fortran with /O2 optimization is significantly worse compared to that required for simply copying all the data. When looked at in the form of a ratio, program using Intel Fortran shows performance that is more than 4 times slower than that compared to the exact same code run with gfortran.
Can someone from the Intel Fortran team please provide some insight and guidance with this?
program p use, intrinsic :: iso_fortran_env, only : R8 => real64 implicit none integer, parameter :: N = 2**27 integer, parameter :: NUM_SAMPLES = 100 integer :: i real :: x(N) ! 0.5 GB storage size real :: y(N) real(R8) :: t1 real(R8) :: t2 real(R8) :: t_assign real(R8) :: t(NUM_SAMPLES) call random_number( x ) print *, "Checking y=x Assignment:" print "(a,t10,a,t25,a)", "i", "x", "y" do i = 1, NUM_SAMPLES if ( mod(i,20) == 0 ) x = x + epsilon(x) ! For compiler not to optimize away the sampling loop call cpu_time( t1 ) y = x call cpu_time( t2 ) t(i) = t2 - t1 if ( mod(i,20) == 0 ) then print "(g0,t10,g0,t25,g0)", i, x(i), y(i) end if end do t_assign = sum(t) print "(*(g0,1x))", "Average CPU Time = ", sum(t)/real(NUM_SAMPLES,kind=R8) print * print *, "Checking y=equals(x) Function Call:" print "(a,t10,a,t25,a)", "i", "x", "y" do i = 1, NUM_SAMPLES if ( mod(i,20) == 0 ) x = x + epsilon(x) ! For compiler not to optimize away the sampling loop call cpu_time( t1 ) y = equals( x ) call cpu_time( t2 ) t(i) = t2 - t1 if ( mod(i,20) == 0 ) then print "(g0,t10,g0,t25,g0)", i, x(i), y(i) end if end do print "(*(g0,1x))", "Average CPU Time = ", sum(t)/real(NUM_SAMPLES,kind=R8) print * print "(g0,f10.2)", "Ratio of the two instructions: ", sum(t)/t_assign stop contains pure function equals( a ) result( r ) real, intent(in) :: a(:) ! Function result real :: r( size(a) ) r = x return end function equals end program p
Compilation and execution:
C:\Fortran>ifort /heap-arrays:0 /standard-semantics p.f90 Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R ) 64, Version 18.0.0.065 Beta Build 20170320 Copyright (C) 1985-2017 Intel Corporation. All rights reserved. ifort: NOTE: The Beta evaluation period for this product ends on 12-oct-2017 UTC . Microsoft (R) Incremental Linker Version 14.00.24215.1 Copyright (C) Microsoft Corporation. All rights reserved. -out:p.exe -subsystem:console p.obj C:\Fortran>p.exe Checking y=x Assignment: i x y 20 .9765872E-01 .9765872E-01 40 .4800636 .4800636 60 .6310107 .6310107 80 .8096844 .8096844 100 .4880919 .4880919 Average CPU Time = .4383628099999999E-01 Checking y=equals(x) Function Call: i x y 20 .9765931E-01 .9765931E-01 40 .4800642 .4800642 60 .6310113 .6310113 80 .8096850 .8096850 100 .4880925 .4880925 Average CPU Time = .2020212949999999 Ratio of the two instructions: 4.61 C:\Fortran>
On the Windows based computer system I tried, the ratio shown above is consistently on the order of 4.5. Note using /O3 with Intel Fortran only makes a small difference. Whereas the same program compiled with gfortan (with their /O2 or /O3 optimization option) on the same system, the ratio is typically below 1.3.
P.S.> I've taken a peek at the assembler instructions and there is something with Intel Fortran that appears bothersome, but I'll hold my thoughts to myself and allow the Intel team to followup on this.