subroutine with many arguments - how to ensure maximum performance?

subroutine with many arguments - how to ensure maximum performance?

Hello,

I have a performance-critical peice of code that currently sits in a nested loop, e.g.:

do kk=1,k_max
  do jj=1,j_max
    do ii=1,i_max
      function_with_11_inputs_4_outputs
    enddo
  enddo
enddo

For readability, I'd like this to appear in a separate subroutine. (This gets rid of all of the array indices and makes things look much cleaner.) Assuming that I do this, to achieve maximum performance, do I need to create a subroutine with 15 arguments and only operate on the direct inputs and outputs? Or, can I do something like the following and count on the compiler to get rid of the intermediate steps?

real :: inputs(11), outputs(4)

do kk=1,k_max
  do jj=1,j_max
    do ii=1,i_max
      inputs(1) = varA
      inputs(2) = varB
      inputs(3) = varC
      ...

      function(inputs, outputs)

      varX = outputs(1)
      varY = outputs(2)
      varZ = outputs(3)
      ...
    enddo
  enddo
enddo

(Presumably, there would be a similar translation inside the function, itself.)  Alternatively, is there any other "cleaner" way to efficiently pass a lot of arguments to a subroutine and still achieve maximum performance?

Thanks,
Greg

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

The traditional way to see maximum performance is to push a sufficient number of inner loops inside the function. Otherwise, you depend on in-lining (as ifort attempts to do by default, if permitted to do so, subject to the limits, for which many options are available). Satisfactory vectorization reports are an excellent first step.
Current CPU generations are less susceptible than past ones to fill buffer thrashing associated with pushing too many arguments. If your cases are marginal you will likely need to analyze the actual cases under VTune.

There was a similar query here where user defined type was attempted. This yielded lesser performance.

The list of input arguments, on call "costs" a LEA and PUSH (Load effective address, and push that address on stack). This is highly efficient. Copying the arg "costs" the equivalent of an LEA plus read plus write plus PUSH. In effect passing the argument saves a read and a write and it saves potential cache line evictions.

On the receiving end (called subroutine) the costs are about the same.

When you return from call, the secondary copy is additional overhead for the use of the outputs.

It looks like passing the args is the faster way to go, ... unless when the args are already in a user defined type (then pass the type reference).

Jim Dempsey

www.quickthreadprogramming.com

Can you declare all the variables inside a MODULE, and then USE this module inside the subroutines.  You would not have to pass any arguments to the subroutine.  This might improve performance.

Roman

 

Roman

Threadprivate variables is an additional option to investigate.

Jim Dempsey

www.quickthreadprogramming.com

Leave a Comment

Please sign in to add a comment. Not a member? Join today