Hello everybody,

I am trying to reduce the execution time of an existing program for a 2D flow simulation, which was originally written in Fortran 77. Since i am quite new to Fortran programming, at first i tried to gather some information about efficient coding. Many times i was recommended to make use of the "new" intrinsic functions, since this way the compiler would be told whats going on and could perform some optimizations (e.g. vectorization). Although, these functions and the use of vector expressions instead of scalar ones made my code much more compact and elegant, the computation run time increased to my surprise.

These are some code snippets in the old and new version, and the respective required cpu time (3 runs each):

===========================================================================

1. Example CSHIFT: performs a shift of the elements in a matrix by one element along all directions (including diagonals)

===========================================================================

*old: t = 8.777; 8.553; 8.789 s*

-----------------------------------

do j=1,nj

do i=1,ni

ie = mod(i,ni) + 1

iw = ni - mod(ni+1-i,ni)

jn = mod(j,nj) + 1

js = nj - mod(nj+1-j,nj)

fn(ie,j ,1) = f(i,j,1)

fn(i ,jn,2) = f(i,j,2)

fn(iw,j ,3) = f(i,j,3)

fn(i ,js,4) = f(i,j,4)

fn(ie,jn,5) = f(i,j,5)

fn(iw,jn,6) = f(i,j,6)

fn(iw,js,7) = f(i,j,7)

fn(ie,js,8) = f(i,j,8)

fn(i ,j ,0) = f(i,j,0)

enddo

enddo

----------------------------------------*new: t = 11.009; 11.241; 11,033 s*

-----------------------------------------

fn(:,:,0) = f(:,:,0)

fn(:,:,1) = cshift(f(:,:,1),-1,1)

fn(:,:,2) = cshift(f(:,:,2),-1,2)

fn(:,:,3) = cshift(f(:,:,3),1,1)

fn(:,:,4) = cshift(f(:,:,4),1,2)

fn(:,:,5) = cshift(cshift(f(:,:,5),-1,1),-1,2)

fn(:,:,6) = cshift(cshift(f(:,:,6),1,1),-1,2)

fn(:,:,7) = cshift(cshift(f(:,:,7),1,1),1,2)

fn(:,:,8) = cshift(cshift(f(:,:,8),-1,1),1,2)

=====================================================

2. Example WHERE: assigns new values to a matrix where condition (obst) is fulfilled

=====================================================

*old: t = 2.488; 2.460; 2.712 s*

-----------------------------------

do j=1,nj

do i=1,ni

if (obst(i,j)) then

f(i,j,1) = fn(i,j,3)

f(i,j,2) = fn(i,j,4)

f(i,j,3) = fn(i,j,1)

f(i,j,4) = fn(i,j,2)

f(i,j,5) = fn(i,j,7)

f(i,j,6) = fn(i,j,8)

f(i,j,7) = fn(i,j,5)

f(i,j,8) = fn(i,j,6)

f(i,j,0) = fn(i,j,0)

endif

enddo

enddo

------------------------------------*new: t = 5.404; 5.628; 5.456 s *

------------------------------------

where(obst)

f(:,:,1) = fn(:,:,3)

f(:,:,2) = fn(:,:,4)

f(:,:,3) = fn(:,:,1)

f(:,:,4) = fn(:,:,2)

f(:,:,5) = fn(:,:,7)

f(:,:,6) = fn(:,:,8)

f(:,:,7) = fn(:,:,5)

f(:,:,8) = fn(:,:,6)

f(:,:,0) = fn(:,:,0)

endwhere

=====================

3. Example WHERE,SUM,various:

=====================

*old: t = 6.020; 6.056; 6.160 s*

-----------------------------------

do j=1,nj

do i=1,ni

if(.not.obst(i,j))then

rho(i,j) = fn(i,j,0)+fn(i,j,1)+fn(i,j,2)+fn(i,j,3)+fn(i,j,4)+fn(i,j,5)+fn(i,j,6)+fn(i,j,7)+fn(i,j,8)

u(i,j) = (fn(i,j,1)+fn(i,j,5)+fn(i,j,8)-fn(i,j,6)-fn(i,j,3)-fn(i,j,7))/rho(i,j)

v(i,j) = (fn(i,j,5)+fn(i,j,2)+fn(i,j,6)-fn(i,j,7)-fn(i,j,4)-fn(i,j,8))/rho(i,j)

else

rho(i,j) = rho_in

u(i,j) = 0.d0

v(i,j) = 0.d0

endif

enddo

enddo

-----------------------------------------*new: t = 12.421; 11.897; 11.645 s*

-----------------------------------------

where (.not.obst)

rho(:,:) = sum(fn,3)

u(:,:) = (fn(:,:,1)+fn(:,:,5)+fn(:,:,8)-fn(:,:,6)-fn(:,:,3)-fn(:,:,7))/rho

v(:,:) = (fn(:,:,5)+fn(:,:,2)+fn(:,:,6)-fn(:,:,7)-fn(:,:,4)-fn(:,:,8))/rho

elsewhere

rho = rho_in

u(:,:) = 0.d0

v(:,:) = 0.d0

endwhere

-------------------------------------------

I did not expect a significant performance boost, but also not a drop. Has anybody made similar observations or can explain the results? Any help would be greatly appreciated.

With best regards,

Eric

My setup:

- OS: Kubuntu 11.10
- Compiler Version: Fortran Intel(R) 64 Compiler XE, Version 12.1.5.339
- Compilation flags: none
- CPU: Intel(R) Pentium(R) Dual CPU E2180 @ 2.00GHz (cache size: 1024 KB)
- Memory: 2 GB
- Time measurement: using subroutine CPU_TIME, Code is beeing looped 20000 times (variables are changing every loop)

Used variables:

real*8, dimension(1:100,1:100,0:8) :: f, fn;

logical, dimension(100,100,9) :: obst (10% of the elements are true, arranged as a sphere)

real*8, dimension(100) :: ni, nj, u, v, rho

real*8 :: rho_in