Huge slowdown when packing arguments into derived type with pointers

Huge slowdown when packing arguments into derived type with pointers

Hello everybody,

I have a complicated structure which a driver subroutine calls several user libraries. Many arguments need to be passed. To avoid the possibility of making a mistake, I decided to pack the arguments in a derived type structure and use pointers. I've noticed that this packing makes a big slow-down. I've made a working example with the two alternatives. working_example.f90 uses the derived type while the working_example_alt.f90 passes the arguments directly. The real system is much more complicated than this but this is representative of what happens.

I compile both with -O2 and use time to measure their performance:

p3tris@Odysseus:~/Desktop$ ifort -O2 -o test working_example_alt.f90
p3tris@Odysseus:~/Desktop$ time ./test 
real    0m0.003s
user    0m0.000s
sys    0m0.000s

p3tris@Odysseus:~/Desktop$ ifort -O2 -o test working_example.f90
p3tris@Odysseus:~/Desktop$ time ./test 
real    0m0.210s
user    0m0.204s
sys    0m0.004s

I expected some overhead but not that significant... Is this normal? Is it due to the association or some optimizations are disabled or simply the passing of a derived type?

Thanks in advance,

Petros

12 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

In this example, you might look at how the shortcuts taken in your top level loop differ, and how the ability of the compiler to take them is affected by your changes.

Considering the key loop in main:

   do i=1, 100000000
      f=0.d0
      x=1.d0
      eq=0.d0
      z=0
     
      call driver1(1,4,2,eq,x,z,f)
      call driver1(2,4,2,eq,x,z,f)
      call driver1(3,4,2,eq,x,z,f)
   enddo

The example passing the arguments directly (in lib1 & lib2) is much, much more efficient -- only a dozen or so instructions to implement the loop. 

In comparison, the example passing the arguments through the derived type takes hundreds of instructions to implement the loop. There is a huge penalty, apparently coming from the associate statement, for example, in lib1:

      associate(mode => tdata%mode, indexf => tdata%indexf, indexx => tdata%indexx, indexz => tdata%indexz, &
            indexmsg => tdata%indexmsg, msg=> tdata%msg, eq => tdata%eq, z => tdata%z, x => tdata%x, f => tdata%f)

Checking the generated code, here is a small sample

        movq      %rdi, 8040+drivers_mp_driver1_$TDATA.0.5(%rip) #118.12
        movq      %rsi, 8024+drivers_mp_driver1_$TDATA.0.5(%rip) #118.12
        movq      %rcx, 8048+drivers_mp_driver1_$TDATA.0.5(%rip) #118.12
        movq      %rdx, 8032+drivers_mp_driver1_$TDATA.0.5(%rip) #118.12
        movq      %rsi, 8072+drivers_mp_driver1_$TDATA.0.5(%rip) #118.12
        movq      %rcx, 8080+drivers_mp_driver1_$TDATA.0.5(%rip) #118.12
        movq      %r8, 8064+drivers_mp_driver1_$TDATA.0.5(%rip) #118.12
        movq      %r13, 8016+drivers_mp_driver1_$TDATA.0.5(%rip) #118.12
        movq      %rdi, 8112+drivers_mp_driver1_$TDATA.0.5(%rip) #118.12
        movq      %rsi, 8096+drivers_mp_driver1_$TDATA.0.5(%rip) #118.12
        movq      %rcx, 8120+drivers_mp_driver1_$TDATA.0.5(%rip) #118.12
        movq      %rdx, 8104+drivers_mp_driver1_$TDATA.0.5(%rip) #118.12
        movq      %rsi, 8144+drivers_mp_driver1_$TDATA.0.5(%rip) #118.12
        movq      %rcx, 8152+drivers_mp_driver1_$TDATA.0.5(%rip) #118.12

It is also the case that the alternate example is able to move some stores out of the loop (which occur in the lib1 & lib2 subroutines); the original example cannot.

LOOP BEGIN at working_example_alt.f90(95,4)
   remark #25087: Preprocess Loopnests <MAIN__>: Moving Out Store @Line<22> in Loop @Line<95>
   remark #25087: Preprocess Loopnests <MAIN__>: Moving Out Store @Line<44> in Loop @Line<95>
   remark #25087: Preprocess Loopnests <MAIN__>: Moving Out Store @Line<45> in Loop @Line<95>

Perhaps we can do better -- I will discuss with the developers.

Patrick

 

 

Thanks for the answer Patrick. Actually I ran a profiling with Vtune which flagged the associate clause. But I considered it as a mistake because I always thought that the associate is a simple "alias" and doesn't have any actual effect on the performance!!! I thought the compiler was performing a simple text substitution before compiling. I'll try the two examples without the associate to see whether it performs better.

Petros

Quote:

Patrick Kennedy (Intel) wrote:

Considering the key loop in main:

   do i=1, 100000000

      f=0.d0

      x=1.d0

      eq=0.d0

      z=0

     

      call driver1(1,4,2,eq,x,z,f)

      call driver1(2,4,2,eq,x,z,f)

      call driver1(3,4,2,eq,x,z,f)

   enddo

The example passing the arguments directly (in lib1 & lib2) is much, much more efficient -- only a dozen or so instructions to implement the loop. 

In comparison, the example passing the arguments through the derived type takes hundreds of instructions to implement the loop. There is a huge penalty, apparently coming from the associate statement, for example, in lib1:

      associate(mode => tdata%mode, indexf => tdata%indexf, indexx => tdata%indexx, indexz => tdata%indexz, &

            indexmsg => tdata%indexmsg, msg=> tdata%msg, eq => tdata%eq, z => tdata%z, x => tdata%x, f => tdata%f)

Checking the generated code, here is a small sample

        movq      %rdi, 8040+drivers_mp_driver1_$TDATA.0.5(%rip) #118.12

        movq      %rsi, 8024+drivers_mp_driver1_$TDATA.0.5(%rip) #118.12

        movq      %rcx, 8048+drivers_mp_driver1_$TDATA.0.5(%rip) #118.12

        movq      %rdx, 8032+drivers_mp_driver1_$TDATA.0.5(%rip) #118.12

        movq      %rsi, 8072+drivers_mp_driver1_$TDATA.0.5(%rip) #118.12

        movq      %rcx, 8080+drivers_mp_driver1_$TDATA.0.5(%rip) #118.12

        movq      %r8, 8064+drivers_mp_driver1_$TDATA.0.5(%rip) #118.12

        movq      %r13, 8016+drivers_mp_driver1_$TDATA.0.5(%rip) #118.12

        movq      %rdi, 8112+drivers_mp_driver1_$TDATA.0.5(%rip) #118.12

        movq      %rsi, 8096+drivers_mp_driver1_$TDATA.0.5(%rip) #118.12

        movq      %rcx, 8120+drivers_mp_driver1_$TDATA.0.5(%rip) #118.12

        movq      %rdx, 8104+drivers_mp_driver1_$TDATA.0.5(%rip) #118.12

        movq      %rsi, 8144+drivers_mp_driver1_$TDATA.0.5(%rip) #118.12

        movq      %rcx, 8152+drivers_mp_driver1_$TDATA.0.5(%rip) #118.12

It is also the case that the alternate example is able to move some stores out of the loop (which occur in the lib1 & lib2 subroutines); the original example cannot.

LOOP BEGIN at working_example_alt.f90(95,4)

   remark #25087: Preprocess Loopnests <MAIN__>: Moving Out Store @Line<22> in Loop @Line<95>

   remark #25087: Preprocess Loopnests <MAIN__>: Moving Out Store @Line<44> in Loop @Line<95>

   remark #25087: Preprocess Loopnests <MAIN__>: Moving Out Store @Line<45> in Loop @Line<95>

Perhaps we can do better -- I will discuss with the developers.

Patrick

 

 

I tried removing the associate construct from inside the lib subroutines and access directly the elements needed as tdata%element. The behavior now is even stranger as it's even slower...

$ ifort -O2 -o test working_example_alt.f90
$ time ./test 
real	0m0.002s
user	0m0.000s
sys	0m0.000s

$ ifort -O2 -o test working_example.f90
$ time ./test 
real	0m0.210s
user	0m0.204s
sys	0m0.004s

$ ifort -O2 -o test working_example_no_associate.f90
$ time ./test 
real	0m0.463s
user	0m0.460s
sys	0m0.004s

$ ifort -V
Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.2.144 Build 20140120

 

 

 

Attachments: 

As Tim hints - with your alt case the optimizer is able to see that the iterations of the loop are independent and that it only needs to execute the contents of the loop once in order to work out the final value of x.  So ... that's all it does!  Consequently you are not running the test you think you are running.

Even with the pointer case - the compiler is smart enough to move lots of the calculations out of the loop.  From looking at the assembly (which for me is synonymous with "I am talking through my hat when I say...") it still executes then a loop that then does very little, bar what might be repetitive definition of x, f, etc (perhaps because it can see the aliasing of those variables in the original code). 

Neither of those situations is likely similar to your real code.  You need a better test - one with an outer loop that the optimizer cannot ignore.  Use some random data as input, and make sure there is a clear iteration to iteration dependence in the calculations all the way through to some final result that gets printed out.

Hi Ian,

Thanks for the answer. Yes, you were right. It was kind of stupid to keep the loops without dependencies... The calls in my original program are more irregular. The driver is still called hundreds of thousands of times but between two calls, the values are modified by other functions. I repeated the test by putting a dependence to the previous values of x,f and on the loop iteration i.

$ ifort -O2 -o test working_example_alt.f90 && time ./test 
real	0m0.769s
user	0m0.768s
sys	0m0.000s

$ ifort -O2 -o test working_example.f90 && time ./test 
real	0m1.441s
user	0m1.444s
sys	0m0.000s

$ ifort -O2 -o test working_example_no_associate.f90 && time ./test 
real	0m1.524s
user	0m1.516s
sys	0m0.004s

Now it makes more sense. And it's consistent with the slow-down I get in my program (the direct passing is faster by a factor of 20-30%).

So my take home message is that by passing as a struct, the compiler cannot do the same optimizations by passing directly the arguments?

Again thanks for all the time spent!

Quote:

IanH wrote:

As Tim hints - with your alt case the optimizer is able to see that the iterations of the loop are independent and that it only needs to execute the contents of the loop once in order to work out the final value of x.  So ... that's all it does!  Consequently you are not running the test you think you are running.

...

Attachments: 

I think at issue is when using the type dat_struct, it is containing a pointer to an array (:)
Which in this case means a pointer to an array descriptor (not a simple POD reference to the first element of an array)

In the calling program, the  => operation is constructing the array descriptor.

Unfortunately what you need to be able to do is to have a user declared type containing a pointer to an assumed size array (of rank 1 in this case). Using c_f_pointer would defeat the purpose since you would be performing the array descriptor construction yourself.

I haven't installed the latest Fortran, nor consulted the F2008 specification to see if you can now have a pointer to an assumed size array within a user defined type.

Maybe Steve (Dr. Fortran) can comment on this.

Jim Dempsey

 

www.quickthreadprogramming.com

No, Fortran doesn't have the concept of a pointer to an assumed-size array.

I haven't really followed the discussion here, but my usual advice is to not make guesses about where performance issues lie. Instead, run the program through a profiler such as Intel Vtune Amplifier XE. What I find is that most programmer's guesses as to where the bottleneck lies are wrong and they waste time micro-optimizing things that don't matter. Memory and cache behavior often become large factors in such applications.

My other usual suggestion is to write the code that is most straightforward and easy for a human to understand. The compiler should be able to help from there.

Steve

Hi Steve,

Thanks for taking the time to answer. Actually, the first thing I always do is running my code through Vtune. It's my invaluable ally to the war against hotspots and for better concurrency. Vtune pointed the extra delay coming from inside the library subroutines. For example, on a small case from the real program, this is the main difference in one library:

In a bigger (bigger case => more calls) example, for the same library:

I was surprised by the flagging of the associate construct. I don't know why this construct would be penalizing at all. I thought it was a simple -but extremely convenient- text substitution. I was probably wrong.

Thanks in advance,

Petros

Quote:

Steve Lionel (Intel) wrote:

No, Fortran doesn't have the concept of a pointer to an assumed-size array.

...

Yes, you were wrong about ASSOCIATE. A popular misconception, though. ASSOCIATE effectively creates a new variable with some, but not necessarily all, properties of the selector. I would not be too quick to conclude that ASSOCIATE itself is the issue, but there is some bookkeeping that needs to be done when an ASSOCIATE construct is entered. It may be that the times ascribed to ASSOCIATE are actually for code related to other statements.

Steve

>>It may be that the times ascribed to ASSOCIATE are actually for code related to other statements.

Right, and this can be disclosed by viewing disassembly at the points in the code that use the associated variable. Note, when the associate does not reduce to a single reference, the associate then becomes more like an inline function where some of the values and/or expressions can be evaluated at the associate statement and the remaining at the point of use of the associated variable.

Jim Dempsey

www.quickthreadprogramming.com

Login to leave a comment.