overhead when using subroutine - how to use inline

overhead when using subroutine - how to use inline

Hi all,

I have tried to test if there was an overhead in using a subroutine compared to not doing so. It was my impression that for small functions etc. the compiler would automatically inline the functions having the meaning that there was no overhead in using a subroutine call or function. However I have made a test example that shows otherwise (or perhaps I am not using the right compiler settings). The test is this (see also attached source):

do i=1,nLoop

  call MySubroutine(Var1,var2,...Var20)

end do


do i=1,nLoop

Var1 = Var1 + 1.5

Var2 = Var2 + 1.5


Var20 = Var20 + 1.5

end do


Where MySubroutine performs the same calculations as below. The code that is not placed in a subroutine only takes approximately 50% of the time for the code placed in a subroutine.

I have tried to use /Ob2, /Qinline-forceinline /Qipo and other settings, however the difference in time consumption remains.

Is this general behaviour and the lesson to learn do not place code in subroutines if you want fastest code or am I doing something wrong?

I am using Intel(R) Visual Fortran Compiler XE [IA-32]



Downloadapplication/zip speedtestsubroutinecall.zip6.49 KB
10 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Your test is deficient in one regard, and that is that when you call the subroutine you pass it twenty arguments where a single array argument vec(1:20) comprising vec1,...,vec20 would do. Making that change, and compiling with /fast and running on a laptop with an i7-2720QM CPU, I get

Subroutine = 2.31 2.32
Without subroutine = 2.29 2.28
L2Norm in seperate subroutine = 0.98 0.98
L2Norm inline = 0.97 0.98

The conclusion I would draw is that subroutine/function argument lists should be kept as short as possible, rather than that subroutines should be avoided altogether.

This becomes even more important if the ABI involves passing arguments through registers rather than on the stack. On AMD/Intel X64, for example, it is impossible to pass twenty double-precision reals through the XMM registers. The code to perform a read-modify-write cycle on the twenty arguments would consist of a substantial part that simply moves values between memory and the register file.

Hi mecej4,

Thanks for the reply. I intentionally wanted to test with a large number of values and not an array (or derived type for that matter), however as you show there might be some advantage in using arrays.

I also see a difference in L2Norm of a factor of two when I run it (and there is almost no difference in your run)- did you change anything in the call to L2Norm?

Edit: I tried to recompile using the /fast compiler option and now my results are:
Subroutine = 8.98 8.75
Without subroutine = 7.25 7.05
L2Norm in seperate subroutine = 2.38 2.30
L2Norm inline = 1.02 0.98


did you change anything in the call to L2Norm?
I merely changed
Var = L2Norm(vec)
Var(1) = L2Norm(vec)

When you are depending on in-lining for loop optimization, it may be helpful to use internal subroutines, or follow the ancient principle of pushing the inner loop inside the subroutine.
If you've looked at the ifort documentation, there are so many options with ATTRIBUTES and command line options to raise threshold limits that you may well conclude that the old methods are preferable.


I think some of the optimizations are not performed unless you enable IPO (InterProcedural Optimizaitons). This has to be enabled in both the compiler and the linker. If this improves your situation, then please report back so others reading this thread can be informed.

Jim Dempsey

Also consider:

do i=1,nLoop
call MySubroutine((/Var1,var2,...Var20/))
end do

subroutine MySubroutine( args )
real :: args(20)


do i=1,nLoop
argsOfYourType = YourType(iVar,fVar,dVar,'TextVal'...)
call MySubroutine(argsOfYourType )
end do

Jim Dempsey

Hi Jim,
Thank you for the replies.

I tried to use /QIpo in both the compiler and the linker, but this does not change my results. I think however my original test case is flawed in two cases:
1) In the code that I posted initially the values computed by my function L2Norm was not used for anything. When I rewrote the code to print the value to the screen I get almost the same time consumption when using the subroutine and the manually "inline" written code.
2) The order in which I do the computation also seem to influence the results i.e. I tried to interchange the loop with the call to MySubroutine and the loop doing the same thing but without the subroutine and now my results are different. I changed the test so that different variables are used for each test case (previously the same variables was reused) and now the results are more constant. My latest results are

WC-time CPU-time

Subroutine = 11.75 11.73
Without subroutine = 11.73 11.72
L2Norm inline = 2.61 2.61
L2Norm in seperate subroutine = 3.05 3.05

There still is a small difference between computing the L2Norm "inline" or using a subroutine (in the above approximately 17%).



I downloaded your project and had an issue with MS VS

1>ipo: error #11034: Il version for C:\Downloads\speedtestsubroutinecall\SpeedTestSubroutineCall\Release\Subroutine.obj (216458) does not match compiler's il version (213490), please regenerate

I created a new solution file and this fixed the issue.

Noticing L2Norm uses Assumed Shape to pass the array, I took the liberty to create L2NormN

! pass arg dimension(:)

function L2Norm(a)

    real(8), dimension(:), intent(in) :: a

    real(8)                           :: L2Norm

    L2Norm = sqrt(dot_product(a,a))

end function L2Norm
! add n, pass arg dimension(n)

function L2NormN(a, n)

    integer :: n

    real(8), dimension(n), intent(in) :: a

    real(8)                           :: L2NormN

    L2NormN = sqrt(dot_product(a,a))

end function L2NormN

Running the test (x64 Release)

Subroutine = 11.21 11.20
Without subroutine = 5.75 5.74
L2Norm in seperate subroutine = 6.85 6.85
L2NormN in seperate subroutine = 5.90 5.90
L2Norm inline = 5.90 5.91

You can see the L2NormN in seperate subroutine is the same speed as L2Norm inline.
The difference being how you pass the arguments

Hi Jim,
So in this case the overhead is due to the assumed shape rather than an issue of inlining the function.

Leave a Comment

Please sign in to add a comment. Not a member? Join today