allocatable array efficiency issue

allocatable array efficiency issue

Is it normal to observe such a big difference in execution time when the allocatable arrays are used vs fixed.

an example code produces 

ifort pexcample.f90 -O3 ; time ./a.out

(a, fixed size)  Time =   0.000004 seconds.

real    0m0.041s
user    0m0.002s
sys    0m0.004s

(b, allocatable array)  Time =  14.428799 seconds.

real    0m19.532s
user    0m4.383s
sys    0m12.056s

 

program pexample
	implicit none
	integer(kind=8), parameter :: n = 1000,	m = 1024, p = 1024, np = n*m*p
!	real(kind=8) :: x(3,n,m,p)
	real(kind=8), allocatable :: x(:,:,:,:)
	real(kind=8) :: tic, toc
	integer(kind=8) :: i, j, k, l, id
	
	allocate( x(1:3,1:n,1:m,1:p) )		
	call cpu_time(tic)

	do k = 1, p
		do j = 1, m
			do i = 1, n
!				id = I + N*(J-1) + N*M*(K-1)
				x(1,i,j,k) = 0.d0!(i-1)*1.d0/(n-1)
				x(2,i,j,k) = 0.d0!(j-1)*1.d0/(m-1)					
				x(3,i,j,k) = 0.d0!(k-1)*1.d0/(p-1)
			end do
		end do
	end do
	
	call cpu_time(toc)		
	deallocate( x )
	print '("Time = ",f10.6," seconds.")', toc-tic
	
end program pexample

 

10 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Scenario static defined array:

The data of the empty array is loaded into memory at program load time prior to cpu_time. IOW the CPU time is measured after the array addresses are loaded in RAM.

Scenario allocated array:

In the specific code above, the allocate(x(...)) obtains a node from the heap, however the memory representing this node has never been touched (excepting for possibly the page where the header of the node resides). The cpu_time is taken prior to first touch. The subsequently, as you walk onto (first touch) pages of the address (not yet used first), this causes a page fault. The O/S then obtains a page (address) from the page file (assuming one is available), maps it to the virtual address of the process (pexample), possibly wipes the page (or reads page, then possibly wipes), then returns to your code to continue the loop until the next page of the array is touched. This repeats until loop finishes.

Your loop above is designed to measure array access time. Therefore the appropriate action would be to insert

x = 0.0

between the allocate and the call to cpu_time.

Before you do that, as a learning experience, add another integer variable iRep, then construct a DO iRep=1,3 loop from before the allocate to after the print. I also suggest adding a print 'array located at', LOC(x) after the allocate.

If the same memory space gets reallocated, then the 2nd and later runs will be fast. If not they will be slow up until the heap allocations cycle back to prior first-touched memory locations. The behavior of this is dependent on the CRTL (C Run Time Library used by Fortran)

Jim Dempsey

www.quickthreadprogramming.com

Thank you Jim for the detailed explanation.

Indeed the same memory space gets reallocated but no improvement

 array located at            4475424768
Time =  14.375916 seconds.
 array located at            4475424768
Time =  14.443573 seconds.
 array located at            4475424768
Time =  14.414449 seconds.

real    0m56.574s
user    0m13.171s
sys    0m36.430s

when x is initialised after allocation (x = 0.0)

 array located at            4401082368
Time =  10.560929 seconds.
 array located at            4401082368
Time =  10.038265 seconds.
 array located at            4401082368
Time =  10.134284 seconds.

real    1m38.054s
user    0m25.221s
sys    0m55.270s

when I use static array,  LOC(X) causes elapsed time to increase at each step of the loop. I did not experience it before on Linux. could it be related to OS X? I am using Xcode 5.1 which is not supported yet.

Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.2.139 Build 20140121
Copyright (C) 1985-2014 Intel Corporation.  All rights reserved 

	do iRep = 1, 3	
	print*,'array located at', LOC(x)
	x = 0
	call cpu_time(tic)
	do k = 1, p
		do j = 1, m
			do i = 1, n
				x(1,i,j,k) = 0.d0!(i-1)*1.d0/(n-1)
				x(2,i,j,k) = 0.d0!(j-1)*1.d0/(m-1)					
				x(3,i,j,k) = 0.d0!(k-1)*1.d0/(p-1)
			end do
		end do
	end do
	
	call cpu_time(toc)
	print '("Time = ",f10.6," seconds.")', toc-tic
	end do

 array located at            4320210208
Time =   9.153378 seconds.
 array located at            4320210208
Time =  18.616963 seconds.
array located at            4320210208
Time =  23.337855 seconds.

You may also want to compile the code with the vdc-report and opt-report turned on. -O3 is a relatively aggressive optimization flag, and it is possible that the compiler is finding optimizations for one case and not for the other. At the very least it might provide some insight into what the compiler is doing behind the scenes.

-Zaak

Try using omp_get_wtime() for timing.

I cannot run this program. It requires .gt. 25GB, I only have 16GB

Jim

www.quickthreadprogramming.com

I don't really know how to report this issue since it seems like the problem is generated randomly. I've tested the same code on different Macs (same OS, compiler, but different RAM).

When I use omp_get_wtime the code compiles randomly. If compiled illegal instruction: 4 is issued runtime independent of array size.

if not compiled: error #6930: The size of the array dimension is too large,

Yet another ridicules observation is that the timing for loop is the same regardless of the array dimensions (this is also randomly occurring).

Time  = 0.00000400 seconds for (n = 600,    m = 1024, p = 1024) 

Time = 0.00000400 seconds for (n = 600,    m = 124, p = 124)

program pexample
#ifdef WTIME
	USE omp_lib
#endif
	implicit none
	integer(kind=4), parameter :: n = 50,	m = 1024, p = 1024, np = n*m*p
#ifdef ALOC
	real(kind=8), allocatable :: x(:,:,:,:)
#else
	real(kind=8) :: x(3,n,m,p)
#endif
	real(kind=8) :: tic, toc
	integer(kind=4) :: i, j, k, l, iRep

	do iRep = 1,3
#ifdef ALOC	
	allocate( x(1:3,1:n,1:m,1:p) )
#endif
!	print '("Array located at",I,f)', LOC(x), sizeof(x)*9.3132e-10
#ifdef WTIME
	tic = omp_get_wtime()
#else
	call cpu_time(tic)
#endif
	x = 0.0
	do k = 1, p
		do j = 1, m
			do i = 1, n
				x(1,i,j,k) = 0.d0!(i-1)*1.d0/(n-1)
				x(2,i,j,k) = 0.d0!(j-1)*1.d0/(m-1)					
				x(3,i,j,k) = 0.d0!(k-1)*1.d0/(p-1)
			end do
		end do
	end do
			
#ifdef WTIME
	toc = omp_get_wtime()
#else	
	call cpu_time(toc)
#endif

#ifdef ALOC
	deallocate( x )	
#endif
	print '("Time = ",f15.8," seconds.")', toc-tic
	end do
end program pexample

! Results
!ifort -fpp pexample.f90 -o pexample
	!Time =      0.00000400 seconds.
	!Time =      0.00000100 seconds.
	!Time =      0.00000100 seconds

!ifort -fpp pexample.f90 -DALOC -o pexampleALOC
	!Time =     10.44109600 seconds.
	!Time =     10.46074400 seconds.
	!Time =     10.52917000 seconds.

!ifort -fpp pexample.f90 -DWTIME -openmp -o pexample_WTIME
	! random result 
	! 1. Illegal instruction: 4
	! 2. error #6930: The size of the array dimension is too large, and overflow occurred when computing the array size.   [X]
	!	real(kind=8) :: x(3,n,m,p)

!ifort -fpp pexample.f90  -DALOC -DWTIME -openmp -o pexampleALOC_WTIME
	!Time =     10.51393390 seconds.
	!Time =     10.62356901 seconds.
	!Time =     10.63541198 seconds.

 

I seem to recall an allocation issue where when the size of the allocation is over 2GB/4GB and the index(s) used in the allocation are integer(4)

Try changing allocate( x(1:3,1:n,1:m,1:p) ) to allocate(  x(1_8:3_8,1_8:INT8(n),1_8:INT8(m),1_8:INT8(p)) )
or allocate(  x(:3_8, INT8(n), INT8(m), INT8(p)) ).

You may also need to experiment with changing the (or one of the) loop control variables to integer(8).

Jim Dempsey

www.quickthreadprogramming.com

Actually, I've been playing with the integer kind, but without any success. I can not even run a simple code

program pexample
	USE omp_lib 
	use, intrinsic :: ISO_FORTRAN_ENV, only : RP => REAL64, IP => INT64

	implicit none

	integer(kind=IP), parameter :: ub = 3_IP
	integer(kind=IP), parameter :: lb = 1_IP
	integer(kind=IP), parameter :: n = 2_IP
	integer(kind=IP), parameter :: m = 1024_IP
	integer(kind=IP), parameter :: p = 1024_IP
	integer(kind=IP), parameter :: np = n*m*p
	
	real(kind=RP) :: x(lb:ub,lb:n,lb:m,lb:p)
	real(kind=RP) :: tic, toc
	
	tic = omp_get_wtime();
	x = 0.0_RP;
	toc = omp_get_wtime();
	
end program pexample
! Results
! x(1:3,1:2,1:1024,1:1024)
! ifort -openmp pexample1.f90 ; ./a.out 
! Segmentation fault: 11

! x(1:4,1:2,1:1024,1:1024)
! Illegal instruction: 4

 

Enable heap arrays.

Jim Dempsey

www.quickthreadprogramming.com

Leave a Comment

Please sign in to add a comment. Not a member? Join today