# what is the fastest way to compute 3x3 matrix inverse, 3x3 matrix multiplication?

## what is the fastest way to compute 3x3 matrix inverse, 3x3 matrix multiplication?

I perform tens of millions of computations of 3x3 matrix products. The code is already parallelized using a very efficient domain decomposition, so let's consider only single-thread code for now. (At the end of the post, I mention my target architecture; maybe you will recognize a way to utilize multithreading to speed up the computation.)

My program is nothing but computations like this, and takes days to run on very fast cluster systems. Every little bit will help!

Due to legacy code, the matrix dimension, NSD, is stored in a common block. Its value is always 3.

The two methods of matrix multiplication I know of are:

do i=1,NSD

do j = 1,NSD

C(i,j) = 0

do k = 1,NSD

C(i,j) = C(i,j) + A(i,k)*B(k,j)

enddo !k

enddo !j

enddo !i

and

do i=1,NSD

do j = 1,NSD

C(i,j) = sum( A(i,:)*B(:,j) )

enddo !j

enddo !i

Can these loops be unrolled, what with their values set at runtime? Is vectorizing a 3-place operation worth the bother? Are there other speedup possibilities? (Again, "macroscopic" speedups through MPI or OpenMP have already been utilized to their limit.) Ten to fifteen sig figs of accuracy is plenty, so I am comfortable with flags such as -O3 and -fast.

I use ifort and the MKL on linux.

What about computing the 3x3 matrix inverse?

The target processors are

• Xeon Intel Duo-Core 64-bit processors
• Dell Poweredge 2950 workstations with 2 dualcore, hyperthreading 3.73 GHz Xeon processors

If this means something to you, awesome.

15 posts / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.

MKL won't help until the dimension is fairly large, something on the order of 25. You would likely need to disable hyperthreading in the BIOS, particularly if running an OS version as old as your hardware, or not wanting to learn about affinity tools.

You imply that neither you nor your predecessor heard of MATMUL? Not that it should make a difference here; compilers ought to unroll the code automatically (ifort -O3), if the dimension is written into the code as a constant. LOOP COUNT and UNROLL directives might be used to accomplish a similar result, but, if you are allowed to correct the code at all, why not simply write in the case you intend?

You don't provide enough information to tell why you want matrix inversions, rather than a potentially faster method (not much faster, for such a small matrix), nor to advise about threading. If you can split up the work of these matrix operations between threads, that would be an obvious thing to do. I think you are hinting that your source code is written so as to prevent a compiler from performing auto-parallel.

> You would likely need to disable hyperthreading in the BIOS, particularly if running an OS version as old as your hardware,

> or not wanting to learn about affinity tools.

The OS is Ubuntu, probably 32-bit, I can never tell. I've never heard of affinity tools before. I do not have admin access, only user access on these machines.

>You imply that neither you nor your predecessor heard of MATMUL?

I hadn't, before yesterday. My predecessor is a brute-force do-it-yourself kind of guy--there's not a single library call in any of his code--and fully admits his code could be improved. I'm trying to learn as much as I can about easy speedup tips (ie, "use this construct instead of that") because an all-out overhaul (which would require months of re-testing and re-benchmarking) is beyond either of us.

> compilers ought to unroll the code automatically (ifort -O3),if the dimension is written into the code as a constant.

That's part of my question: how can they unroll it? The variable is set in a module at run time:

module foo

integer NSD

end module foo

program main

use foo

NSD = 3

call mysub()

end program main

subroutine mysub()

use foo

real*8 a(NSD,NSD),b(NSD,NSD), c(NSD,NSD)

do i = 1,3

do j = 1,3

c(i,j) = sum( a(i,:)*b(:,j) )

enddo !j

enddo !i

though I could make NSD a parameter, thus potentially knowable at compile time:

module foo

integer, parameter :: NSD = 3

end module foo

>but, if you are allowed to correct the code at all, why not simply write in the case you intend?

My advisor wants to preserve the possibility that the code can be used for 2D problems in the future, at which point NSD=2.

> You don't provide enough information to tell why you want matrix inversions, rather than a potentially faster method

> (not much faster, for such a small matrix), nor to advise about threading. If you can split up the work of these

> matrix operations between threads, that would be an obvious thing to do. I think you are hinting that your source

> code is written so as to prevent a compiler from performing auto-parallel.

The matrix measures the deformation of an elastic material from its original configuration (ie, at rest, time t=0) intoa new configuration (ie, with forces applied, at some time t>0). The matrix is used to map vectors in one configuration to what they would be the other coordinate system if they were to ride on the elastic material:

do i = 1,NSD

vector_in_ref_coords(i) = sum( Finv(i,:) * same_vector_in_current_coords(:) )

enddo !i

I guess I could have just said "coordinate transformations" because that's what they are. Every graphics card for the past fifteen years has done more 4x4 matrix multiplications than I could possibly imagine. That's what makes me think there's got to be a fast way. But every time I look for fast methods of matrix manipulation, they immediately start assuming large matrices, as if the brute-force methods are sufficient for small ones.

I am computing the inverse like so, and I use it in exactly the manner as described above. That is to say, I don't really care about the inverse, I just want (1) the vector vector_in_ref_coords(:) given the vector same_vector_in_current_coords(:), and (2) the determinant of the matrix. (I need the determinant for other computations.)

real*8 A(3,3), ! input

Ainv = 0d+0;

Ainv(1,1) = A(2,2)*A(3,3) - A(3,2)*A(2,3)

Ainv(1,2) = A(3,2)*A(1,3) - A(1,2)*A(3,3)

Ainv(1,3) = A(1,2)*A(2,3) - A(1,3)*A(2,2)

Adet = Ainv(1,1)*A(1,1) + Ainv(1,2)*A(2,1) + Ainv(1,3)*A(3,1)

return ! end of invert3x3 matrix code

end subroutine invert3x3

- Fred

Quoting - nooj

The OS is Ubuntu, probably 32-bit, I can never tell. I've never heard of affinity tools before.

though I could make NSD a parameter, thus potentially knowable at compile time:

module foo

integer, parameter :: NSD = 3

end module foo

Every graphics card for the past fifteen years has done more 4x4 matrix multiplications than I could possibly imagine. That's what makes me think there's got to be a fast way.

Everyone has permission to execute 'uname -a' which tells whether you have i386 (32-bit) or x86_64 (64-bit). If you yave 64-bit, you need to ask specifically if you don't want 64-bit compilation, as by 'gfortran -m32' or by installing and invoking the ia32 version ofifort.

Your linux should include tools such as taskset, and OpenMP compilers such as gfortran and ifort observe environment variables such as GOMP_CPU_AFFINITY and KMP_AFFINITY in order to accomplish similar things.

MATMUL would be a simple substitution, but probably would accomplish little except to make your source code more readable. It's nothing like the level of effort usually involved in parallelizing.

I thought you understood the point we were discussing, that you must set the size of the array at compile time, in order for the compiler to unroll fully and eliminate loops, and that looks easy to doby parameter, as you say.

SSE parallel is well suited for 4x4; you should be able to find reference onlow-level sourcecode. It mightalmost be worth while to use that code, ignoring the extra element.

You could evaluate the determinant by "brute force," in your terminology, faster than you could invert. No magic there.

Fred,

>>I guess I could have just said "coordinate transformations" because that's what they are.

Well, it would have helped if you said that.

Your original question was interpreted as (by me at least)

Arbitrary 3x3 matrix to multiply other arbitrary 3x3 matrix

It would have helped (me) if you requested

Arbitrary 3x3 matrix to multiply large number of arbitrary 3x3 matricies

The way to get this to be most efficient is to create a subroutine that is called once for the list of the large number of arbitrary 3x3 matricies. In this manner thetransformation matrix has a better chance of remaining within the register set. If your list of large number of arbitrary matricies (both input and output) is not convienently stored (together) then consider passing in a prebuilt array of pointers to 3x3 matricies.

type type_3x3
real(8), target:: x(3,3)
end type type_3x3

type type_3x3_pointer
type(type_3x3), pointer :: p
end type_3x3_pointer

subroutine matmult_3x3_ListOf_3x3(inA, ListOf_B, ListOf_C, n)
real(8) :: inA(3,3)
integer :: n
type(type_3x3_pointer) :: ListOf_B(n), ListOf_C(n)
real(8) :: A(3,3)
real(8) :: temp
A=inA
do iVector=1,n
do i=1,3
do j=1,3
temp = 0.0D
do k=1,3
temp = temp + ListOf_B(iVector)%p%x(i,k) * A(k,j)
end do
ListOf_C(iVector)%p%x(i,j) = temp
end do
end do
end do

Or something like that.

Jim Dempsey

Quoting - jimdempseyatthecove
>>I guess I could have just said "coordinate transformations" because that's what they are.

Well, it would have helped if you said that.

Your original question was interpreted as (by me at least)

Arbitrary 3x3 matrix to multiply other arbitrary 3x3 matrix

It would have helped (me) if you requested

Arbitrary 3x3 matrix to multiply large number of arbitrary 3x3 matricies

Jim -

It turns out that your original interpretation was the one I meant. Mathematically, my matrices are coordinate transformations, but each point has its own transformation (believe it or not).

Until you sent this, I didn't realize fortran had pointers!

Thanks for the reply, and my apologies for delaying my response.

- Fred

Quoting - nooj
subroutine mysub()

use foo

real*8 a(NSD,NSD),b(NSD,NSD), c(NSD,NSD)

do i = 1,3

do j = 1,3

c(i,j) = sum( a(i,:)*b(:,j) )

enddo !j

enddo !i

The matrix measures the deformation of an elastic material from its original configuration (ie, at rest, time t=0) intoa new configuration (ie, with forces applied, at some time t>0). The matrix is used to map vectors in one configuration to what they would be the other coordinate system if they were to ride on the elastic material:

do i = 1,NSD

vector_in_ref_coords(i) = sum( Finv(i,:) * same_vector_in_current_coords(:) )

enddo !i

I guess I could have just said "coordinate transformations" because that's what they are.

I am computing the inverse like so, and I use it in exactly the manner as described above. That is to say, I don't really care about the inverse, I just want (1) the vector vector_in_ref_coords(:) given the vector same_vector_in_current_coords(:), and (2) the determinant of the matrix. (I need the determinant for other computations.)

Fred before I think any further I just wanted to make sure I am understanding things correctly:

Above, Finv is the coordinate transform which not only rotates, but also stretches and shears?

Where do the matrix multiplications come into play? Are you just transforming three vectors simultaneously? (perhaps some initial cube which is rotated, stretched and sheared?

When you mention the inverse and the determinant do you mean of the transformation matrix? Do you actually need the inverse or just the determinant? It seems usually the inverse is needed to solve Ax=b for x. If this is the case will you solve for x given just one b, or many x's each corresponding to a different b?

Ultimately besides the way you code it, using the right algorithm to solve the system may make a difference. Since it is not a large system it seems exact methods are appropriate. Any special mathematical properties of your matrices may help expidite this. Since you mention you are seeking the determinant I have infered that the transformation does more than just rotatate the coordinates system since rotation matrices are unitary and the determinant of any unitary matrix is always 1. Anyway i'm sure you already knew this last bit, just thinking outloud.

-Zaak

Fred,

Since each point has its own matrix, can you carry your transformation matricies flipped about the (0,0),(2,2) diagonal axis. i.e. make the columns rows and rows columns. If you can then the vector instructrions might be able to be used more effectively by the compiler.

Jim Dempsey

ifort is entitled to perform loop interchange and unrolling optimizations, if you set -O3. Whether it will do so, you will need to check. It's possible you might get more optimization by using matmul where it's applicable. While early gfortran favored the style where you avoid matmul and dot_product and use only sum, I doubt any current compilers have that limitation.

If nooj can maintain the flipped transformation matrix his inner loop willhave

temp = temp + A(k,j)*B(k,i) ! notice index swap on A

This will vectorize to some extent

zbeekman:
> Above, Finv is the coordinate transform which not only rotates, but also
> stretches and shears?

Yes, and you are right, there is stretch and shear, so its determinant is
positive, but not necessarily 1.

I need both the inverse and the determinant at various times. And
remember that it's always a 3x3 matrix: can even vectorization provide
a speedup? The entire procedure and all the variables accessed can live
in the L1 cache.

> Where do the matrix multiplications come into play?

Allow me to elaborate.

60% of my computation time lies in the following routine,
"e3lhs_3d_struct", (at bottom). It is called roughly 28,000 times per
Newton-Raphson iteration. Below is the timing of my code using gprof
info for 3 Newton-Raphson iterations.

Three things to note:

First and foremost, ifort requires fully 4.5 minutes to compile this routine
(compared with .5 seconds using just -O0).
So it's doing the best job it can, for sure!

14% of time is spent in "vizgmv_nooj", but that is my results output
routine, and its time is all disk access. Let's ignore that one. And
all the others below it, for this conversation.)

For codes like mine, spending 60% of time in this routine is completely
normal. But I think spending extra time on speedup now, when I have no
pressing deadlines, is worthwhile, even if development time + run time
ends up larger.

The exact compile line (as near as my records show) was

ifort -g -p ! include gprof profiling calls
-O3 ! max optimization
-static ! no shared libraries; static linking only
-no-prec-div ! enable optimizations that give slightly less precise results
-unroll-aggressive ! complete unrolling for loops with small constant trip counts
-ftrapuv ! initialize stack local variables to an unusual value to aid error detection
-vec-guard-write ! perform a conditional check in a vectorized loop
-sox ! include the compiler options in the executable
-r8 -i4 -u ! all reals are 8 bytes; all ints are 4 bytes; implicit none
-C ! obviously the -C slows things down; I'm not sure if I turned it off.
-fpscomp logicals ! .TRUE. == 1; .FALSE. == 0

Flat profile:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
59.74 23.70 23.70 82944 0.00 0.00 e3lhs_3d_struct_
14.55 29.47 5.77 1 5.77 6.20 vizgmv_3d_nooj_ ! results output, ie disk access
8.27 32.75 3.28 354 0.01 0.01 sparseprodfull_3d_
6.25 35.23 2.48 3 0.83 1.92 sparsegmres_up_
3.10 36.46 1.23 82944 0.00 0.00 eval_shape_3d_
1.61 37.10 0.64 82944 0.00 0.00 e3int_
0.86 37.44 0.34 27648 0.00 0.00 eval_shape_3d_orig_
0.78 37.75 0.31 8957952 0.00 0.00 sparsematloc_3d_
0.50 37.95 0.20 3 0.07 8.89 intelmass_3d_
0.45 38.13 0.18 __intel_new_memset
0.43 38.30 0.17 829440 0.00 0.00 dersbasisfuns_
0.43 38.47 0.17 27648 0.00 0.00 eval_face_3d_

And here is the routine: It's a bunch of matrix multiplications, and some
adding and placing into the xKebe matrix. Ftens, Finv, and Stens are 3x3,
symmetric and positive definite matrices. Positive definite means its
determinant is strictly positive. In fact, we can assume here that
everyone's determinant is bounded away from zero (even the inverses); thus,
condition numbers, floating point precision, overflow, etc. are not an
issue.

In the comments, I write the terms being calculated using Einstein summation
notation. If that already means something to you, great; if not, ignore the
comments with impunity. And if you happen to recognize the equation I am
solving and wonder why there is no proper mass or momentum term, it's because
the simulation is quasi-static (ie, at equilibrium).

Okay, anything else? Oh, yeah: Ctens has a lot of symmetry:

Ctens(I,J,K,L) = Ctens(J,I,K,L) (minor symmetry)
= Ctens(I,J,L,K) (minor symmetry)
= Ctens(K,L,I,J) (major symmetry)
= Ctens(L,K,I,J) (the symmetries combine)

That means Ctens has at most ??? unique elements. I could spoil your fun
by telling you how many, but I won't. The spoiler is behind this link:
http://en.wikipedia.org/wiki/Elasticity_tensor#Anisotropic_homogeneous_m...

The above link provides a way to represent Ctens as a symmetric 6x6 matrix
(ie, two indices) using the so-called Voigt Notation.

Unless there is a consensus that using the less-memory-intensive implementation
is the way to go, I'd rather avoid it. I'd have to change a lot of routines.

subroutine e3LHS_3D_struct (gwt, shg, shgradg, Ftens,
& Stens, Ctens, xKebe)

! real*8, intent(in) :: gwt
! real*8, intent(in) :: shg
! real*8, intent(in) :: Ftens(NSD,NSD)
! real*8, intent(in) :: Stens(NSD,NSD)
! real*8, intent(in) :: Ctens(NSD,NSD,NSD,NSD)
! real*8, intent(out) :: xKebe(NSD*NSD,NSHLu,NSHLu)

implicit none
include "common.h"

integer aa, bb, i,j,k,l,m,o

& xKebe(NSD*NSD,NSHLu,NSHLu), fact1, pt5

real*8 Ftens(NSD,NSD), Stens(NSD,NSD), Ctens(NSD,NSD,NSD,NSD),
& temp1(NSHLu,NSHLu), temp2(NSD,NSD,NSD,NSD),
& temp3(NSD,NSD,NSD,NSD), temp4(NSHLu,NSHLu,NSD,NSD),
& temp5(NSHLu,NSHLu,NSD,NSD), temp6(NSHLu,NSHLu)

pt5 = 1d+0/2d+0

! fact1 is a momentum term coefficient.
fact1 = 1d+0

temp2 = 0d+0 ! iJLo component
do i = 1, NSD
do J = 1, NSD
do L = 1, NSD
do o = 1, NSD
! temp2_iJLo = F_iK C_KJLM F_oM (= F C F^T)
temp2(i,J,L,o) = Ftens(i,1)*Ctens(1,J,L,1)*Ftens(o,1)
& + Ftens(i,2)*Ctens(2,J,L,1)*Ftens(o,1)
& + Ftens(i,3)*Ctens(3,J,L,1)*Ftens(o,1)
& + Ftens(i,1)*Ctens(1,J,L,2)*Ftens(o,2)
& + Ftens(i,2)*Ctens(2,J,L,2)*Ftens(o,2)
& + Ftens(i,3)*Ctens(3,J,L,2)*Ftens(o,2)
& + Ftens(i,1)*Ctens(1,J,L,3)*Ftens(o,3)
& + Ftens(i,2)*Ctens(2,J,L,3)*Ftens(o,3)
& + Ftens(i,3)*Ctens(3,J,L,3)*Ftens(o,3)
enddo
enddo
enddo
enddo
temp2(:,:,:,:) = pt5*fact1*temp2(:,:,:,:)

! Ctens(:,:,L,M) == Ctens(:,:,M,L) implies temp3 == temp2.
temp3 = temp2

! Build material nonlinearities
temp4 = 0d+0
do AA = 1, NSHLu
do BB = 1, NSHLu
do i = 1, NSD
do o = 1, NSD
! temp4_AABBio = N_AA,J temp2_iJLo N_BB,L
! = N_AA,J F_iK C_KJLM F_oM N_BB,L
temp4(AA,BB,i,o)
enddo
enddo
enddo
enddo

! temp3 == temp2 implies temp5 == temp4.
temp5 = temp4

! Build geometric nonlinearity
temp6 = 0d+0
do AA = 1, NSHLu
do BB = 1, NSHLu
! temp6_AABB = N_AA,J S_IJ N_BB,I
enddo
enddo
temp6(:,:) = fact1*temp6(:,:)

! Thinking of xKebe(:,AA,BB)
! as a second-order tensor K_AABB(:,:)
! (and similarly for the temps), we have
!
! K_AABB(:,:) = DetJ*gwt*(
! temp1_AABB*Itens(:,:) ! Mass
! temp4_AABB(:,:) ! Mat Stiff
! temp5_AABB(:,:) ! Mat Stiff
! temp6_AABB*Itens(:,:) ! Geom Stiff
! )
! = DetJ*gwt*(
! almi*N_AA*rho*N_BB * Itens(:,:) (or 0 in quasi-static case)
! N_AA,J F_:K C_KJLM F_:M N_BB,L
! N_AA,J F_:K C_KJLM F_:L N_BB,M
! N_AA,J S_IJ N_BB,I * Itens(:,:)
! )

! diagonal entries (1,5,9)
xKebe (1,:,:) = xKebe(1,:,:) +
& (temp4(:,:,1,1) + ! Mat Stiff
& temp5(:,:,1,1) + ! Mat Stiff
& temp6(:,:) )*DetJ*gwt ! Geom Stiff

xKebe (5,:,:) = xKebe(5,:,:) +
& (temp4(:,:,2,2) + ! Mat Stiff
& temp5(:,:,2,2) + ! Mat Stiff
& temp6(:,:) )*DetJ*gwt ! Geom Stiff

xKebe (9,:,:) = xKebe(9,:,:) +
& (temp4(:,:,3,3) + ! Mat Stiff
& temp5(:,:,3,3) + ! Mat Stiff
& temp6(:,:) )*DetJ*gwt ! Geom Stiff

! off-diagonal entries
xKebe (2,:,:) = xKebe(2,:,:) +
& (temp4(:,:,1,2) + ! Mat Stiff
& temp5(:,:,1,2) )*DetJ*gwt

xKebe (4,:,:) = xKebe(4,:,:) +
& (temp4(:,:,2,1) + ! Mat Stiff
& temp5(:,:,2,1) )*DetJ*gwt

xKebe (3,:,:) = xKebe(3,:,:) +
& (temp4(:,:,1,3) + ! Mat Stiff
& temp5(:,:,1,3) )*DetJ*gwt

xKebe (7,:,:) = xKebe(7,:,:) +
& (temp4(:,:,3,1) + ! Mat Stiff
& temp5(:,:,3,1) )*DetJ*gwt

xKebe (6,:,:) = xKebe(6,:,:) +
& (temp4(:,:,2,3) + ! Mat Stiff
& temp5(:,:,2,3) )*DetJ*gwt

xKebe (8,:,:) = xKebe(8,:,:) +
& (temp4(:,:,3,2) + ! Mat Stiff
& temp5(:,:,3,2) )*DetJ*gwt

return
end

The following is a rework of your code. This adds some temporary arrays and in the process should reduce some work. The code has not been tested.

```subroutine e3LHS_3D_struct (gwt, shg, shgradg, Ftens,
& Stens, Ctens, xKebe)

! real*8, intent(in) :: gwt
! real*8, intent(in) :: shg
! real*8, intent(in) :: Ftens(NSD,NSD)
! real*8, intent(in) :: Stens(NSD,NSD)
! real*8, intent(in) :: Ctens(NSD,NSD,NSD,NSD)
! real*8, intent(out) :: xKebe(NSD*NSD,NSHLu,NSHLu)

implicit none
include "common.h"

integer aa, bb, i,j,k,l,m,o

& xKebe(NSD*NSD,NSHLu,NSHLu), fact1, pt5

!dec\$ attributes align : 16 :: temp1
!dec\$ attributes align : 16 :: temp2
!dec\$ attributes align : 16 :: temp3
!dec\$ attributes align : 16 :: temp4
!dec\$ attributes align : 16 :: temp5
!dec\$ attributes align : 16 :: temp6
!dec\$ attributes align : 16 :: temp45

real*8 Ftens(NSD,NSD), Stens(NSD,NSD), Ctens(NSD,NSD,NSD,NSD),
& temp1(NSHLu,NSHLu), temp2(NSD,NSD,NSD,NSD),
& temp3(NSD,NSD,NSD,NSD), temp4(NSHLu,NSHLu,NSD,NSD),
& temp5(NSHLu,NSHLu,NSD,NSD), temp6(NSHLu,NSHLu),
& temp45(NSHLu,NSHLu,NSD,NSD)

! use aligned stack local arrays (automatic)
! for pieces of Ftens.
! likelyhood of being placed in XMM registers high
!dec\$ attributes align : 16 :: FtensI
!dec\$ attributes align : 16 :: FtensO
!dec\$ attributes align : 16 :: shgradgAA
!dec\$ attributes align : 16 :: shgradgBB
!dec\$ attributes align : 16 :: Stens1
!dec\$ attributes align : 16 :: Stens2
!dec\$ attributes align : 16 :: Stens3
real*8, automatic :: Stens1(3), Stens2(3), Stens3(3)

real*8 :: pt5fact1

pt5 = 1d+0/2d+0

! fact1 is a momentum term coefficient.
fact1 = 1d+0

pt5fact1 = pt5*fact1

! All of temp2 is written in nested loops immediately following
! no requirement to zero out array
! temp2 = 0d+0 ! iJLo component
do i = 1, NSD
! copy interested part of Ftens to SSE aligned array
! possible XMM registered variable
FtensI(1)= Ftens(i,1)
FtensI(2)= Ftens(i,2)
FtensI(3)= Ftens(i,3)
! swap order of index (make life easier for compiler optimizations)
! copy interested part of Ftens to SSE aligned array
! possible XMM registered variable
do o = 1, NSD
FtensO(1)= Ftens(o,1)
FtensO(2)= Ftens(o,2)
FtensO(3)= Ftens(o,3)
do J = 1, NSD
do L = 1, NSD
! temp2_iJLo = F_iK C_KJLM F_oM (= F C F^T)
temp2(i,J,L,o) =
&        (FtensI(1)*Ctens(1,J,L,1)*FtensO(1)
&       + FtensI(2)*Ctens(2,J,L,1)*FtensO(1)
&       + FtensI(3)*Ctens(3,J,L,1)*FtensO(1)
&       + FtensI(1)*Ctens(1,J,L,2)*FtensO(2)
&       + FtensI(2)*Ctens(2,J,L,2)*FtensO(2)
&       + FtensI(3)*Ctens(3,J,L,2)*FtensO(2)
&       + FtensI(1)*Ctens(1,J,L,3)*FtensO(3)
&       + FtensI(2)*Ctens(2,J,L,3)*FtensO(3)
&       + FtensI(3)*Ctens(3,J,L,3)*FtensO(3))*pt5fact1
enddo
enddo
enddo
enddo
! the following was performed inside above nested loops
! temp2(:,:,:,:) = pt5*fact1*temp2(:,:,:,:)

! Ctens(:,:,L,M) == Ctens(:,:,M,L) implies temp3 == temp2.
! therefor temp3 references can use temp2 and temp3 can be removed
! temp3 = temp2

! Build material nonlinearities
! all of temp4 is written in following loops
! not need to zero
! temp4 = 0d+0
do AA = 1, NSHLu
do BB = 1, NSHLu
! change order of inner two loops
do o = 1, NSD
do i = 1, NSD
! temp4_AABBio = N_AA,J temp2_iJLo N_BB,L
! = N_AA,J F_iK C_KJLM F_oM N_BB,L
temp4(AA,BB,i,o)
enddo
enddo
enddo
enddo

! temp3 == temp2 implies temp5 == temp4.
! temp5 not use, remove
! temp5 = temp4

! reorder Stens
Stens1(1) = Stens(1,1)
Stens1(2) = Stens(1,2)
Stens1(3) = Stens(1,3)
Stens2(1) = Stens(2,1)
Stens2(2) = Stens(2,2)
Stens2(3) = Stens(2,3)
Stens3(1) = Stens(3,1)
Stens3(2) = Stens(3,2)
Stens3(3) = Stens(3,3)
! Build geometric nonlinearity
! all of temp6 is written, remove initialization
!temp6 = 0d+0
do AA = 1, NSHLu
do BB = 1, NSHLu
! temp6_AABB = N_AA,J S_IJ N_BB,I
temp6(AA,BB) =
enddo
enddo
! multiplication taken care of inside above loops
! temp6(:,:) = fact1*temp6(:,:)

! Thinking of xKebe(:,AA,BB)
! as a second-order tensor K_AABB(:,:)
! (and similarly for the temps), we have
!
! K_AABB(:,:) = DetJ*gwt*(
! temp1_AABB*Itens(:,:) ! Mass
! temp4_AABB(:,:) ! Mat Stiff
! temp5_AABB(:,:) ! Mat Stiff
! temp6_AABB*Itens(:,:) ! Geom Stiff
! )
! = DetJ*gwt*(
! almi*N_AA*rho*N_BB * Itens(:,:) (or 0 in quasi-static case)
! N_AA,J F_:K C_KJLM F_:M N_BB,L
! N_AA,J F_:K C_KJLM F_:L N_BB,M
! N_AA,J S_IJ N_BB,I * Itens(:,:)
! )

! apply weights to temps
! consolidate temp4 and temp5
temp45 = (temp4 + temp5) * DetJ*gwt
temp6 = temp6 * DetJ*gwt
! diagonal entries (1,5,9)
xKebe (1,:,:) = xKebe(1,:,:) +
& temp45(:,:,1,1) + ! Mat Stiff
& temp6(:,:)        ! Geom Stiff

xKebe (5,:,:) = xKebe(5,:,:) +
& temp45(:,:,2,2) + ! Mat Stiff
& temp6(:,:)        ! Geom Stiff

xKebe (9,:,:) = xKebe(9,:,:) +
& temp45(:,:,3,3) + ! Mat Stiff
& temp6(:,:)        ! Geom Stiff

! off-diagonal entries
xKebe (2,:,:) = xKebe(2,:,:) +
& temp45(:,:,1,2)   ! Mat Stiff

xKebe (4,:,:) = xKebe(4,:,:) +
& temp45(:,:,2,1)   ! Mat Stiff

xKebe (3,:,:) = xKebe(3,:,:) +
& temp45(:,:,1,3)   ! Mat Stiff

xKebe (7,:,:) = xKebe(7,:,:) +
& temp45(:,:,3,1)   ! Mat Stiff

xKebe (6,:,:) = xKebe(6,:,:) +
& temp45(:,:,2,3)   ! Mat Stiff

xKebe (8,:,:) = xKebe(8,:,:) +
& temp45(:,:,3,2)   ! Mat Stiff

return
end
```

Jim Dempsey

Just wondering why you didn't make the J loop innermost at line 75.

Thanks
Dan

Quoting - jimdempseyatthecove

The following is a rework of your code. This adds some temporary arrays and in the process should reduce some work. The code has not been tested.

```subroutine e3LHS_3D_struct (gwt, shg, shgradg, Ftens,
& Stens, Ctens, xKebe)

! real*8, intent(in) :: gwt
! real*8, intent(in) :: shg
! real*8, intent(in) :: Ftens(NSD,NSD)
! real*8, intent(in) :: Stens(NSD,NSD)
! real*8, intent(in) :: Ctens(NSD,NSD,NSD,NSD)
! real*8, intent(out) :: xKebe(NSD*NSD,NSHLu,NSHLu)

implicit none
include "common.h"

integer aa, bb, i,j,k,l,m,o

& xKebe(NSD*NSD,NSHLu,NSHLu), fact1, pt5

!dec\$ attributes align : 16 :: temp1
!dec\$ attributes align : 16 :: temp2
!dec\$ attributes align : 16 :: temp3
!dec\$ attributes align : 16 :: temp4
!dec\$ attributes align : 16 :: temp5
!dec\$ attributes align : 16 :: temp6
!dec\$ attributes align : 16 :: temp45

real*8 Ftens(NSD,NSD), Stens(NSD,NSD), Ctens(NSD,NSD,NSD,NSD),
& temp1(NSHLu,NSHLu), temp2(NSD,NSD,NSD,NSD),
& temp3(NSD,NSD,NSD,NSD), temp4(NSHLu,NSHLu,NSD,NSD),
& temp5(NSHLu,NSHLu,NSD,NSD), temp6(NSHLu,NSHLu),
& temp45(NSHLu,NSHLu,NSD,NSD)

! use aligned stack local arrays (automatic)
! for pieces of Ftens.
! likelyhood of being placed in XMM registers high
!dec\$ attributes align : 16 :: FtensI
!dec\$ attributes align : 16 :: FtensO
!dec\$ attributes align : 16 :: shgradgAA
!dec\$ attributes align : 16 :: shgradgBB
!dec\$ attributes align : 16 :: Stens1
!dec\$ attributes align : 16 :: Stens2
!dec\$ attributes align : 16 :: Stens3
real*8, automatic :: Stens1(3), Stens2(3), Stens3(3)

real*8 :: pt5fact1

pt5 = 1d+0/2d+0

! fact1 is a momentum term coefficient.
fact1 = 1d+0

pt5fact1 = pt5*fact1

! All of temp2 is written in nested loops immediately following
! no requirement to zero out array
! temp2 = 0d+0 ! iJLo component
do i = 1, NSD
! copy interested part of Ftens to SSE aligned array
! possible XMM registered variable
FtensI(1)= Ftens(i,1)
FtensI(2)= Ftens(i,2)
FtensI(3)= Ftens(i,3)
! swap order of index (make life easier for compiler optimizations)
! copy interested part of Ftens to SSE aligned array
! possible XMM registered variable
do o = 1, NSD
FtensO(1)= Ftens(o,1)
FtensO(2)= Ftens(o,2)
FtensO(3)= Ftens(o,3)
do J = 1, NSD
do L = 1, NSD
! temp2_iJLo = F_iK C_KJLM F_oM (= F C F^T)
temp2(i,J,L,o) =
&        (FtensI(1)*Ctens(1,J,L,1)*FtensO(1)
&       + FtensI(2)*Ctens(2,J,L,1)*FtensO(1)
&       + FtensI(3)*Ctens(3,J,L,1)*FtensO(1)
&       + FtensI(1)*Ctens(1,J,L,2)*FtensO(2)
&       + FtensI(2)*Ctens(2,J,L,2)*FtensO(2)
&       + FtensI(3)*Ctens(3,J,L,2)*FtensO(2)
&       + FtensI(1)*Ctens(1,J,L,3)*FtensO(3)
&       + FtensI(2)*Ctens(2,J,L,3)*FtensO(3)
&       + FtensI(3)*Ctens(3,J,L,3)*FtensO(3))*pt5fact1
enddo
enddo
enddo
enddo
! the following was performed inside above nested loops
! temp2(:,:,:,:) = pt5*fact1*temp2(:,:,:,:)

! Ctens(:,:,L,M) == Ctens(:,:,M,L) implies temp3 == temp2.
! therefor temp3 references can use temp2 and temp3 can be removed
! temp3 = temp2

! Build material nonlinearities
! all of temp4 is written in following loops
! not need to zero
! temp4 = 0d+0
do AA = 1, NSHLu
do BB = 1, NSHLu
! change order of inner two loops
do o = 1, NSD
do i = 1, NSD
! temp4_AABBio = N_AA,J temp2_iJLo N_BB,L
! = N_AA,J F_iK C_KJLM F_oM N_BB,L
temp4(AA,BB,i,o)
enddo
enddo
enddo
enddo

! temp3 == temp2 implies temp5 == temp4.
! temp5 not use, remove
! temp5 = temp4

! reorder Stens
Stens1(1) = Stens(1,1)
Stens1(2) = Stens(1,2)
Stens1(3) = Stens(1,3)
Stens2(1) = Stens(2,1)
Stens2(2) = Stens(2,2)
Stens2(3) = Stens(2,3)
Stens3(1) = Stens(3,1)
Stens3(2) = Stens(3,2)
Stens3(3) = Stens(3,3)
! Build geometric nonlinearity
! all of temp6 is written, remove initialization
!temp6 = 0d+0
do AA = 1, NSHLu
do BB = 1, NSHLu
! temp6_AABB = N_AA,J S_IJ N_BB,I
temp6(AA,BB) =
enddo
enddo
! multiplication taken care of inside above loops
! temp6(:,:) = fact1*temp6(:,:)

! Thinking of xKebe(:,AA,BB)
! as a second-order tensor K_AABB(:,:)
! (and similarly for the temps), we have
!
! K_AABB(:,:) = DetJ*gwt*(
! temp1_AABB*Itens(:,:) ! Mass
! temp4_AABB(:,:) ! Mat Stiff
! temp5_AABB(:,:) ! Mat Stiff
! temp6_AABB*Itens(:,:) ! Geom Stiff
! )
! = DetJ*gwt*(
! almi*N_AA*rho*N_BB * Itens(:,:) (or 0 in quasi-static case)
! N_AA,J F_:K C_KJLM F_:M N_BB,L
! N_AA,J F_:K C_KJLM F_:L N_BB,M
! N_AA,J S_IJ N_BB,I * Itens(:,:)
! )

! apply weights to temps
! consolidate temp4 and temp5
temp45 = (temp4 + temp5) * DetJ*gwt
temp6 = temp6 * DetJ*gwt
! diagonal entries (1,5,9)
xKebe (1,:,:) = xKebe(1,:,:) +
& temp45(:,:,1,1) + ! Mat Stiff
& temp6(:,:)        ! Geom Stiff

xKebe (5,:,:) = xKebe(5,:,:) +
& temp45(:,:,2,2) + ! Mat Stiff
& temp6(:,:)        ! Geom Stiff

xKebe (9,:,:) = xKebe(9,:,:) +
& temp45(:,:,3,3) + ! Mat Stiff
& temp6(:,:)        ! Geom Stiff

! off-diagonal entries
xKebe (2,:,:) = xKebe(2,:,:) +
& temp45(:,:,1,2)   ! Mat Stiff

xKebe (4,:,:) = xKebe(4,:,:) +
& temp45(:,:,2,1)   ! Mat Stiff

xKebe (3,:,:) = xKebe(3,:,:) +
& temp45(:,:,1,3)   ! Mat Stiff

xKebe (7,:,:) = xKebe(7,:,:) +
& temp45(:,:,3,1)   ! Mat Stiff

xKebe (6,:,:) = xKebe(6,:,:) +
& temp45(:,:,2,3)   ! Mat Stiff

xKebe (8,:,:) = xKebe(8,:,:) +
& temp45(:,:,3,2)   ! Mat Stiff

return
end
```

Jim Dempsey

In Fortran (as opposed to C/C++) the left most index represents adjacent data (in multi-dimentional arrays). In C/C++ it is the other way around. In this example the J and L are interior indicies. Order may only be significant under paging circumstances. In any event, changes outlined in my post will provide the most impact on performance. As is often the case, their is almost alway room for improvement.

Jim Dempsey