I have been looking at the response of old style DO loops to optimisation. I have found one case which performs opposite to what is expected.
I have considered 6 options for calculating a Dot_Product, using ifort ver 11.1:
1) Including a conventional DO loop in a block of code.
c = 0 do k = JEQ_bot,J-1 c = c + A(A_ptr_0+k) * B(B_ptr_0+k) end do A(A_ptr_0+J) = A(A_ptr_0+J) - c
2) Converting the DO Loop to Dot_Product, using array sections.
A_ptr_t = A_ptr_b + JBAND - 1 B_ptr_t = B_ptr_b + JBAND - 1 A(A_ptr_0+J) = A(A_ptr_0+J) - Dot_Product (A(A_ptr_b:A_ptr_t), B(B_ptr_b:B_ptr_t) )
3) Wrapping the DO loop into a F77 style function
A(A_ptr_0+J) = A(A_ptr_0+J) - Vec_Sum (A(A_ptr_b), B(B_ptr_b), JBAND) where REAL*8 FUNCTION VEC_SUM (A, B, N) ! integer*4, intent (in) :: n real*8, dimension(n), intent (in) :: a real*8, dimension(n), intent (in) :: b ! real*8 c integer*4 k ! c = 0 do k = 1,N c = c + A(k) * B(k) end do vec_sum = c ! RETURN ! END
4) Wrapping the Dot_Product into a F77 style function, to avoid array sections.
A(A_ptr_0+J) = A(A_ptr_0+J) - Vec_Sum_d (A(A_ptr_b), B(B_ptr_b), JBAND) where REAL*8 FUNCTION VEC_SUM_d (A, B, N) ! integer*4, intent (in) :: n real*8, dimension(n), intent (in) :: a real*8, dimension(n), intent (in) :: b ! vec_sum_d = dot_product ( a, b ) ! RETURN ! END
5) Modified the DO loop with subscripts as temproary variables
c = 0 do k = JEQ_bot,J-1 ia = A_ptr_0+k ib = B_ptr_0+k c = c + A(ia) * B(ib) end do A(A_ptr_0+J) = A(A_ptr_0+J) - c
6) Modified DO loop for temporary subscripts as forauto increment
c = 0 do k = JEQ_bot,J-1 c = c + A(A_ptr_b) * B(B_ptr_b) A_ptr_b = A_ptr_b+1 B_ptr_b = B_ptr_b+1 end do A(A_ptr_b) = A(A_ptr_b) - c
A and B are real*8 vectors and all subscripts are integer*4
I have tested these on a Xeon processor with compiler options:
/o1, /o2 or /o3 (/o2 is default)
/Qvect or /Qvect- (/Qvect is default)
All coding options, EXCEPT option 3 show improvement from ( /o1or /o2 /Qvec- )to /o2, with /o3 similar, reducing from 13 seconds to 8 seconds, presumably benefiting from the vector instructions./QxHost does not have any significant effect on the results.
However for option 3, all /o1 or (/o2 /Qvec- /QxHost) took 13 seconds, but if /QxHostand (/o2 or /o3) were combined, with/Qvect default, the run time blew out to 22 seconds. /o3 without /QxHost reduced slightly to 11.5 seconds. This is much different from all other options.
I was expecting that /QxHost was to utilise the preferred coding for the processor installed, but in this case 3 it appears to fail.
Option 3 is a coding approach I have used in a lot of old F77 stylecodes, where the wrapper is a set of common calculations stored in a library of common routines. They are introduced as simple routines, that have been expected to benefit from optimisation at a local level. This approach suits a number of other compilers.
From this I conclude that for ifort:
/QxHost should not be used and
I should review my use of libraries of common calculations.
Why would the combination of /QxHost with /o2 or /o3 cause such a contra result in this coding example 3 ?