Optimisation of Do Loops

Optimisation of Do Loops

I have been looking at the response of old style DO loops to optimisation. I have found one case which performs opposite to what is expected.
I have considered 6 options for calculating a Dot_Product, using ifort ver 11.1:

1) Including a conventional DO loop in a block of code.

```            c = 0

do k = JEQ_bot,J-1

c = c + A(A_ptr_0+k) * B(B_ptr_0+k)

end do

A(A_ptr_0+J) = A(A_ptr_0+J) - c

```

2) Converting the DO Loop to Dot_Product, using array sections.

```            A_ptr_t = A_ptr_b + JBAND - 1

B_ptr_t = B_ptr_b + JBAND - 1

A(A_ptr_0+J) = A(A_ptr_0+J) - Dot_Product (A(A_ptr_b:A_ptr_t), B(B_ptr_b:B_ptr_t) )

```

3) Wrapping the DO loop into a F77 style function

```            A(A_ptr_0+J) = A(A_ptr_0+J) - Vec_Sum   (A(A_ptr_b), B(B_ptr_b), JBAND)

where

REAL*8 FUNCTION VEC_SUM (A, B, N)

!

integer*4,               intent (in) :: n

real*8,    dimension(n), intent (in) :: a

real*8,    dimension(n), intent (in) :: b

!

real*8    c

integer*4 k

!

c = 0

do k = 1,N

c = c + A(k) * B(k)

end do

vec_sum = c

!

RETURN

!

END

```

4) Wrapping the Dot_Product into a F77 style function, to avoid array sections.

```            A(A_ptr_0+J) = A(A_ptr_0+J) - Vec_Sum_d (A(A_ptr_b), B(B_ptr_b), JBAND)

where

REAL*8 FUNCTION VEC_SUM_d (A, B, N)

!

integer*4,               intent (in) :: n

real*8,    dimension(n), intent (in) :: a

real*8,    dimension(n), intent (in) :: b

!

vec_sum_d = dot_product ( a, b )

!

RETURN

!

END

```

5) Modified the DO loop with subscripts as temproary variables
```            c = 0

do k = JEQ_bot,J-1

ia = A_ptr_0+k

ib = B_ptr_0+k

c = c + A(ia) * B(ib)

end do

A(A_ptr_0+J) = A(A_ptr_0+J) - c

```

6) Modified DO loop for temporary subscripts as forauto increment
```            c = 0

do k = JEQ_bot,J-1

c = c + A(A_ptr_b) * B(B_ptr_b)

A_ptr_b = A_ptr_b+1

B_ptr_b = B_ptr_b+1

end do

A(A_ptr_b) = A(A_ptr_b) - c

```

A and B are real*8 vectors and all subscripts are integer*4

I have tested these on a Xeon processor with compiler options:
/o1, /o2 or /o3 (/o2 is default)
/Qvect or /Qvect- (/Qvect is default)
/QxHost

All coding options, EXCEPT option 3 show improvement from ( /o1or /o2 /Qvec- )to /o2, with /o3 similar, reducing from 13 seconds to 8 seconds, presumably benefiting from the vector instructions./QxHost does not have any significant effect on the results.

However for option 3, all /o1 or (/o2 /Qvec- /QxHost) took 13 seconds, but if /QxHostand (/o2 or /o3) were combined, with/Qvect default, the run time blew out to 22 seconds. /o3 without /QxHost reduced slightly to 11.5 seconds. This is much different from all other options.

I was expecting that /QxHost was to utilise the preferred coding for the processor installed, but in this case 3 it appears to fail.

Option 3 is a coding approach I have used in a lot of old F77 stylecodes, where the wrapper is a set of common calculations stored in a library of common routines. They are introduced as simple routines, that have been expected to benefit from optimisation at a local level. This approach suits a number of other compilers.

From this I conclude that for ifort:
/QxHost should not be used and
I should review my use of libraries of common calculations.

Why would the combination of /QxHost with /o2 or /o3 cause such a contra result in this coding example 3 ?

John

4 posts / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.

What make and exact model CPU are you using? I would not expect /QxHost to make things worse.

Your option 3 means that the compiler has no idea how long the loops are nor whether the arguments are aligned. Otherwise it looks ok. But I would suggest you look at the BLAS dot-product routines in MKL if performance is important.

Retired 12/31/2016

I have now updated ifort from Ver 11.1 to Ver 12.1.5.344, and the Xeon processor and to Win 7_64.
The problem I found with /QxHost has now been removed, so I am relieved that my old approach of using libraries of simple procedures can still be used. ( the problem still occured with Ver 11.1 and teh upgraded PC)

I now need to go back and see what other problems I was having.

The aim of this review ( and selection of ifort) has been to identify how to "parallel" a skyline direct solver for large sets of linear equations.
I have found that vectorising was easy to use, but I struggled with the advice I received last year on parallelizing my code. While I am very experienced in coding to F77 and F95, a consequence of this experience is that my age makes it more difficult to learn new techniques !

If you could recommend to me sections of the ifort documentation that I should first read to better understand how to approach parallelizing, it would be appreciated.

John

I suggest that you do a build with Guided Auto Parallelization/Vectorization (GAP) on to see what the compiler has to say about what you might do differently. This is a build using /Qguide - it does not create an executable, but can output diagnostics with recommendations.

Retired 12/31/2016