# Special loops

## Special loops

I am trying to implement some Fortran code from someone else as follows:

do i=1,n
do j=1,i
S = A(i,j) - G(j,j:n) * G(i,j:n)'
enddo
enddo

I believe it is meant to be equivalent to the following:

do i=1,n
do j=1,i
S = A(i,j)
do k=j,n
S = S - G(j,k) * G(i,k)
enddo
enddo
enddo

The first code sample results in a compiler error because it doesn't like the ' at the end of line 3 which I believe is meant to tell it to accept S as a scalar. If I take the ' away then I get a different compiler error saying that S is dimensioned incorrectly, presumably because it expects it to be an array.

Does anyone know how to get around this and what the equivalent of the ' at the end of line 3 is in Intel Fortran?

9 posts / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.

There should not be a ' at the end of line three it is not valid syntax, remove it.

Yes, but then I get a compiler error because it is expecting S to be dimensioned as an array.

What I really want to know is what to replace the ' with so that line 3 sums the G(j,j:n) * G(i,j:n) terms and puts them into S as a scalar. I'm pretty sure that's what the ' is there for but I don't know what the Intel equivalent is.

G(j,j:n)  and  G(i,j:n) are both arrays that are 1 x (n-j+1) big so the result  of the multiply is a 1d array . You are trying to assign and array to a scaler.

I think you really want SUM( G(j,j:n) * G(i,j:n)) which adds  the array elements of the multiplication result to give a single scaler.

Quote:

schulzey wrote:

Yes, but then I get a compiler error because it is expecting S to be dimensioned as an array.

What I really want to know is what to replace the ' with so that line 3 sums the G(j,j:n) * G(i,j:n) terms and puts them into S as a scalar. I'm pretty sure that's what the ' is there for but I don't know what the Intel equivalent is.

The prime at the end makes sense in a mathematical way, in the context of computing the scalar product of two row vectors taken from matrices. It makes no sense in Fortran, since the quote character is used to delimit character strings.

It strikes me that someone may have gone too far in paring down the first version of the code extract before posting here. The nested loops do nothing useful, since only the last-computed value of S is available after the loops are completed, so the entire code extract is equivalent to

`S = A(n,n) - G(n,n) * G(n,n)`

if the value of S is not assigned to something else with subscripts inside the loops.

If, however, the lines in question were comments in Fortran code, those would be understandable as mathematical explanations of subsequent lines of Fortran code with DO loops.

In your second code extract, the loop with index k could probably be replaced by either A(i,j) = A(i,j) - sum(G(j,j:n) * G(i,j:n)) or A(i,j) = A(i,j) - DOT_PRODUCT(G(j,j:n) , G(i,j:n))

On the other hand, the code suggests that some sort of matrix orthogonalization is intended, in which case there are routines in Lapack (or MKL) that will serve you better.

Thanks guys, SUM(G(j,j:n) * G(i,j:n)) and DOT_PRODUCT(G(j,j:n) , G(i,j:n)) both do what I want. I am after maximum speed and so I will try some speed tests to see which is faster (DOT_PRODUCT I suspect). mecej4, I will also look into Lapack and MKL to see if they are faster.

>>SUM(G(j,j:n) * G(i,j:n)) and DOT_PRODUCT(G(j,j:n) , G(i,j:n)) both do what I want. I am after maximum speed

Without seeing all your code it is impossible for me, or anyone else not seeing all your code, to suggest improvements.

However, in abstract terms, if G were kept in transposed form then

DOT_PRODUCT(G(j:n,i) , G(j:n),i) ! Transposed faster
DOT_PRODUCT(G(j,j:n) , G(i,j:n))

Note, I am not suggesting you immediately transpose G prior to the above statement. Rather, I am suggesting that you investigate as to if making (new) G the transposition of (old) G. This may require significant code change.

When the code you are using had a heritage from decades ago, the computers then may not have had cache nor SIMD instructions. On "ancient" systems without cache, the order of subscripts made no difference in performance. On any system with cache memory, memory order makes a large difference (on the order of 5x-100x). Add vectorization and you might attain an additional 2x-10x.

In Fortran G(i,j) and G(i+1,j) are in adjacent memory, and have a high probability of being in the same cache line. Cache lines are typically 64 bytes (16 floats, 8 doubles). *** Note, Fortran fastest (nearest) references go left to right, C/C++ are right to left.

Look through the rest of your code to see how much of it is indexing subscripts from right to left. In some cases you can simply change the loop order, in other cases like the above, it may (may stressed) make a significant difference to organize the data in the transposed format.(swap rows with columns).

Jim Dempsey

WOW, simply working with the transpose of G sped it up by around 16x on my Intel i7 laptop !!! Note that A, G and S are all doubles.

I have always known that 2D Fortran arrays are arranged in memory from left to right, but I didn't realise that having the left index in the innermost loop would have such a big speed benefit !

By the way, using an inner loop with k=1,j-1 rather than using DOT_PRODUCT (shown below) or SUM doesn't seem to make any difference in speed.

My application is a simple Cholesky factorization of a symmetric positive definite matrix [A] of size n x n. The aim is to find a lower triangular matrix [G] so that [A] = [G][G]T. I am just as happy to find [G]T instead of [G] by swapping the subscripts of [G] in lines 5 and 8 below, and this simple change gives the 16x speed increase. The full code for the factorization is:

do i=1,n
do  j=1,i
S=A(j,i)-DOT_PRODUCT(G(1:j-1,j),G(1:j-1,i))   ! Changed from ...(G(j,1:j-1),G(i,1:j-1))
if (j.lt.i) then
G(j,i)=S/G(j,j)   ! Changed from G(i,j)=S/G(j,j)
else
G(i,i)=SQRT(S)
endif
enddo
enddo

Can you see any other ways I could speed it up further? How do I add vectorization?

Peter Schulze

The routines dpotrf, etc., in the MKL library (which is included with the Intel compilers) performs the GGT factorization of a positive definite matrix. Try calling the appropriate MKL routine instead of writing your own, if speed is your goal.