How to calculate a multiplication of two matrices efficiently?

How to calculate a multiplication of two matrices efficiently?

Hello, everyone!

I want to calculate the product of two matrices. This could be done, of course, by using the intrinsic function MATMUL(A,B) where A and B are two matrices. But I found this is not very efficient, especially for large matrices, e.g. two 1000-by-1000 matrices.
Could anyone give me some suggestions on doing the multiplication EFFICIENTLY? Thank you in advance!
7 posts / 0 new
For more complete information about compiler optimizations, see our Optimization Notice. Quoting - woshiwuxin
I want to calculate the product of two matrices. This could be done, of course, by using the intrinsic function MATMUL(A,B) where A and B are two matrices. But I found this is not very efficient, especially for large matrices, e.g. two 1000-by-1000 matrices.
Could anyone give me some suggestions on doing the multiplication EFFICIENTLY? Thank you in advance!

This is where we let library developers do the work for us. MKL comes with the Intel compilers.
By the way, MATMUL probably optimizes much better with ifort at -O3, but I think does not yet perform any automatic threading, nor observe OMP PARALLEL WORKSHARE on compilers for Intel platforms.
gfortran provides options to translate MATMUL into an MKL call, but this implies a temporary array in some cases where direct BLAS usage avoids it.

I wrote a short program to compare the performances of doing matrix multiplications (two 2000-by-2000 matrices) in three different ways: i) a triple do-loop; ii) an intrinsic matmul(a,b) function; and iii) ?gemm subroutine in INTEL MKL. The differences in CPU time are noticeable. ?gemm is very much efficient!

A triple do-loop takes 13.025 seconds.
A matmul(a,b) function takes 32.910 seconds.
A DGEMM subroutine takes 1.872 seconds.

The source code is also attached.
The compile command is

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! This program compares the performances of the matrix multiplication
! implemented in terms of i) a triple do-loop; ii) a fortran intrinsic
! function - matmul(a,b); and iii) a subroutine in INTEL MKL - ?GEMM.
program test
implicit none
! The order of the square matrices is 2000.
integer(kind=4)::n=2000
! Calculate the matrix multiplications:
! i) c:=a*b in a triple do-loop.
! ii) d:=a*b by matmul(a,b).
! iii) e:=a*b by dgemm in INTEL MKL.
real(kind=8),allocatable::a(:,:),b(:,:),c(:,:),d(:,:),e(:,:)
real(kind=8)::alpha,beta
integer(kind=4)::i,j,k,lda,ldb,lde
real(kind=8)::start,finish

allocate(a(n,n),b(n,n),c(n,n),d(n,n),e(n,n))
alpha=1.0;beta=1.0
lda=n;ldb=n;lde=n

! Generate the matrices, a and b, randomly.
call cpu_time(start)
call random_seed()
do j=1, n
do i=1, n
call random_number(a(i,j))
call random_number(b(i,j))
enddo
enddo
call cpu_time(finish)
write(unit=6,fmt=100) "The generation of two matrices takes ",finish-start," seconds."

! i) c:=a*b in a triple do-loop.
call cpu_time(start)
c=0.0D0
do j=1, n
do i=1, n
do k=1, n
c(i,j)=c(i,j)+a(i,k)*b(k,j)
enddo
enddo
enddo
call cpu_time(finish)
write(unit=6,fmt=100) "A triple do-loop takes ",finish-start," seconds."

! ii) d:=a*b by matmul(a,b).
call cpu_time(start)
d=0.0D0
d=matmul(a,b)
call cpu_time(finish)
write(unit=6,fmt=100) "A matmul(a,b) function takes ",finish-start," seconds."

! iii) e:=a*b by dgemm in INTEL MKL.
call cpu_time(start)
e=0.0D0
call dgemm("N","N",n,n,n,alpha,a,lda,b,ldb,beta,e,lde)
call cpu_time(finish)
write(unit=6,fmt=100) "A DGEMM subroutine takes ",finish-start," seconds."

deallocate(a,b,c,d,e)

stop
100 format(A,F8.3,A)
end program test -O3 -xSSE3 would be required to get automatic optimization of your DO loops. This is an odd situation where an Intel-only option is needed for full optimization of nested DO loops, but it appears to make the MATMUL worse.

Quoting - tim18
-O3 -xSSE3 would be required to get automatic optimization of your DO loops. This is an odd situation where an Intel-only option is needed for full optimization of nested DO loops, but it appears to make the MATMUL worse.

Yes! That's true!

Without any ifort options.

ifort test.f90
The generation of two matrices takes 0.184 seconds.
A triple do-loop takes 12.904 seconds.
A matmul(a,b) function takes 32.777 seconds.

With ifort options.
ifort -O3 -xSSSE3 test.f90
The generation of two matrices takes 0.278 seconds.
A triple do-loop takes 4.857 seconds.
A matmul(a,b) function takes 49.055 seconds.

Hardware: Intel Core 2 Duo 2.4 GHz
OS: Mac OS X Leopard

I think the best choice is to use ?GEMM subroutine, and the matrix multiplication has been optimized for parallel compuations. This is awesome! MATMUL is most likely to give competitive performance for matrices which aren't large enough for full performance in DGEMM, say up to 15x15, so there isn't a need for threading.
I'm somewhat surprised that you saw such a speedup for DGEMM, when you timed by cpu_time, which should give you the total time for all threads (increasing with number of threads), while omp_get_wtime() would show you the elapsed time.
In the past, there were compilers which would choose automatically between DGEMM and MATMUL, and should have given the same performance for each case, supposing that it is possible for the compiler to guess the problem size accurately.
The choice of allocatable arrays appears to have broken the compiler's ability to optimize MATMUL for the problem size.

Quoting - tim18
MATMUL is most likely to give competitive performance for matrices which aren't large enough for full performance in DGEMM, say up to 15x15, so there isn't a need for threading.
I'm somewhat surprised that you saw such a speedup for DGEMM, when you timed by cpu_time, which should give you the total time for all threads (increasing with number of threads), while omp_get_wtime() would show you the elapsed time.
In the past, there were compilers which would choose automatically between DGEMM and MATMUL, and should have given the same performance for each case, supposing that it is possible for the compiler to guess the problem size accurately.
The choice of allocatable arrays appears to have broken the compiler's ability to optimize MATMUL for the problem size.

1) For matrix multiplications of small sizes, i.e. less than 100, the performances for the three approaches are nearly the same.
2) If I use
integer(kind=4),parameter::n=2000
to specify the orders of those matrices, and declare them, but NOT allocate them. DGEMM is still better than the others.
3) I tried to link the program with a blas library provided by Apple, and DGEMM is also better, but not as good as that observed by a blas library provided by Intel.

So I think ?gemm subroutine is better for large matrice multiplications, although the blas performances provided by different vendors might be slightly different.

i) The following links with the Apple blas lib.
ifort -lblas test.f90
otool -L a.out
a.out:
/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib (compatibility version 1.0.0, current version 218.0.0)
/usr/lib/libgcc_s.1.dylib (compatibility version 1.0.0, current version 1.0.0)
/usr/lib/libmx.A.dylib (compatibility version 1.0.0, current version 292.4.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 111.1.4)
./a.out
The generation of two matrices takes 0.186 seconds.
A triple do-loop takes 13.042 seconds.
A matmul(a,b) function takes 32.843 seconds.
A DGEMM subroutine takes 3.316 seconds.

2) The following links with the Intel blas lib.
ifort -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread test.f90
otool -L a.out
a.out:
libmkl_intel_lp64.dylib (compatibility version 0.0.0, current version 0.0.0)
libmkl_sequential.dylib (compatibility version 0.0.0, current version 0.0.0)
libmkl_core.dylib (compatibility version 0.0.0, current version 0.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 111.1.4)
/usr/lib/libgcc_s.1.dylib (compatibility version 1.0.0, current version 1.0.0)
/usr/lib/libmx.A.dylib (compatibility version 1.0.0, current version 292.4.0)
./a.out
The generation of two matrices takes 0.189 seconds.
A triple do-loop takes 13.089 seconds.
A matmul(a,b) function takes 33.056 seconds.
A DGEMM subroutine takes 1.840 seconds.