Automatic variable allocation on stack increases runtime 10X

Automatic variable allocation on stack increases runtime 10X

In a compute-bound pgm, there was a loop that called two functions.
Each of these functions had a large local array declared. The
allocation of these arrays on the stack increased the execution time by
a factor of 10-20. What can be done?

The original code, running under CVF6, used a real array for two
distinct purposes. Under certain circumstances, the entire array
contained real data as input to the function, and returned modified
data. Under different circumstances, the first few elements of the
array contained inputs to the function, and were returned unmodified.

So, we have something like

SUBROUTINE MySub(LongData)
REAL, DIMENSION(100000), INTENT(INOUT) :: LongData
REAL, DIMENSION(3) :: ShortData
x=MyFunc(LongData) ! the first case
! OR
x=MyFunc(ShortData) ! the second case
...

REAL FUNCTION MyFunc(RealArray)
REAL, DIMENSION(100000), INTENT(INOUT) :: RealArray
...

The trouble is, this gives a compiler error under IF9, because MyFunc
can exceed the dimensions of ShortData. I tried to get around this by
creating a new array, LongData2, and copying ShortData to its initial
elements. This works, but now the large array LongData2 must be created
on the stack each time MySub is called, and MySub is called millions of
times.

Even in a fully-optimized release version, allocation of temporary
space on the stack is done one page at a time, and that turns out to be
about 10 x the execution time of everything else!

My first question is, have the compiler designers already considered
this problem, and used a more efficient way of allocating stack space?
If it is reserved and committed, the process shouldn't need to check
every page, and that probably blows the cache, too.

If not, may I suggest you could save a lot of instructions by
allocating stack with a few instructions: a compare to the end of known
good space, followed by a move to esp.

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Steve Lionel (Intel)'s picture

The compiled code has no way of knowing if the stack space is committed. The only way to reliably give a stack overflow error if the stack does overflow is to do repeated checks on pages.

Have you considered declaring RealArray in the subroutine as assumed size (*)? Or you can make it "adjustable" with the bound passed as an argument, or assumed-shape (:). All of these would avoid the need for a local copy.

Steve

"The compiled code has no way of knowing if the stack space is committed." There must be a Windows API to determine this.

However, you could just assume the stack allocation was valid. Then, if
it turned out not to be, you would get the stack overflow error when
you tried to use the "allocated" stack space. Granted, this wouldn't be
as convenient, but a clever error handler would keep track of which
stack memory was untested.

I will consider your other suggestions for the next time I have this
situation. For now I just added another argument to the subroutine call.

Steve Lionel (Intel)'s picture

No, you wouldn't get a stack error. You'd get either an access violation or perhaps other odd behavior if you were now in some dynamically allocated address space. There are some guard pages at the bottom of the reserved stack area but in order to detect stack overflow, you can't blindly do the subtract and hope for the best. Instead, you have to repeatedly subtract a bit (less than the size of the guard area), test, repeat. There is no other reliable way to detect this problem.

Another solution for you in the next update is to make the local array ALLOCATABLE and allocate it to the desired size. But avoiding the copy seems a better approach to me.

Steve

I agree that you'd get an AV. However, your handler could see that the
AV was in reading/writing space which the process thought was valid
stack, and give the same error as would have occurred from _chkstk
failure.

If the stack were the lowest virtual address (like PDP-11), and you did
the subtract, checking the final address only would suffice, since
lower addresses would not be part of the process's address space. In
this method, you would also have to check for wraparound, but this is
just looking at the carry bit after subtraction. No guard area needed.

Another approach is, at the start of Fortran runtime, to determine the
limits of committed stack, by using _chkstk (slightly modified). Then
you know that this limit legal, and the stack can be changed to that
value without checking. If some dynamic stack allocation were done, you
might have to redo the limit check.

Login to leave a comment.