I perform tens of millions of computations of 3x3 matrix products. The code is already parallelized using a very efficient domain decomposition, so let's consider only single-thread code for now. (At the end of the post, I mention my target architecture; maybe you will recognize a way to utilize multithreading to speed up the computation.)

My program is nothing but computations like this, and takes days to run on very fast cluster systems. Every little bit will help!

Due to legacy code, the matrix dimension, NSD, is stored in a common block. Its value is always 3.

The two methods of matrix multiplication I know of are:

do i=1,NSD

do j = 1,NSD

C(i,j) = 0

do k = 1,NSD

C(i,j) = C(i,j) + A(i,k)*B(k,j)

enddo !k

enddo !j

enddo !i

and

do i=1,NSD

do j = 1,NSD

C(i,j) = sum( A(i,:)*B(:,j) )

enddo !j

enddo !i

Can these loops be unrolled, what with their values set at runtime? Is vectorizing a 3-place operation worth the bother? Are there other speedup possibilities? (Again, "macroscopic" speedups through MPI or OpenMP have already been utilized to their limit.) Ten to fifteen sig figs of accuracy is plenty, so I am comfortable with flags such as -O3 and -fast.

I use ifort and the MKL on linux.

What about computing the 3x3 matrix inverse?

The target processors are

- Xeon Intel Duo-Core 64-bit processors
- Dell Poweredge 2950 workstations with 2 dualcore, hyperthreading 3.73 GHz Xeon processors

If this means something to you, awesome.