Intel® MKL provides the general purpose BLAS* matrix multiply routines ?GEMM defined as follows:
C := alpha*op(A)*op(B) + beta*C
where alpha and beta are scalars, op(A) is an m-by-k matrix, op(B) is a k-by-n matrix, C is an m-by-n matrix, with op(X) being either X, or XT, or XH.
In this article, we describe a ?GEMM extension called ?HAMM that allows us to update only the upper or lower triangular part of the result matrix at performance levels close to that which ?GEMM provides. When both C and op(A)*op(B) are symmetric, we are able to exploit these symmetries by only updating the upper (or lower) triangular part of the output matrix C. This lower or upper triangular update requires approximately half of the floating-point computations of ?GEMM, and thus can be computed in roughly half the time of ?GEMM.
A naïve ?HAMM implementation could simply be a call to ?GEMM where we ignore the upper or lower half of the computed result, but that approach would not provide the 50% performance boost sought over a single ?GEMM call. If we instead employ a recursive divide-and-conquer approach to compute just the upper or lower half of the result matrix, then we should be able to get within about 80% of ?GEMM performance. The first step of the algorithm is shown in Figure 1. First we call ?GEMM on just some of the rows and columns of A, B, and C to compute the rectangular submatrix C1. In Figure 2, we do the same again for the smaller triangular parts of Figure 1 to produce C2 and C3. We then continue this process, recursively, until the the full upper or lower triangular result has been computed. We stop the recursion when the problem size reaches a predefined threshold, which is the base case of the recursive algorithm. We solve the base case using the naïve algorithm, which involves computing a small submatrix no bigger than the threshold size on either dimension. The brown blocks in Figure 3 shows the base case situation. Note that some spurious computation (in light brown) are done in the naïve algorithm -- we just simply discard these elements because they are not needed.
Using DHAMM as an example (where the data type is double precision floating point), the interface of DHAMM in C is:
void dhamm(char *uplo, char *transa, char *transb, MKL_INT *m, MKL_INT *n, MKL_INT *k, double *alpha, double *A, MKL_INT *lda, double *B, MKL_INT *ldb, double *beta, double *C, MKL_INT* ldc);
As a comparison, the interface of DGEMM looks like:
void dgemm(char *transa, char *transb, MKL_INT *m, MKL_INT *n, MKL_INT *k, double *alpha, double *A, MKL_INT *lda, double *B, MKL_INT *ldb, double *beta, double *C, MKL_INT* ldc);
The only difference is that DHAMM has an extra uplo argument in the first position of the argument list.
A sample implementation is provided for DHAMM. See the link at the bottom of this article to download the source code.
Benchmarking on a two-socket Intel® Xeon® E5 processor shows that the recursive algorithm can get within roughly 80% of DGEMM performance, while the naïve implementation slowly approaches 50% of the performance of DGEMM.
- If matrix C was symmetric or Hermitian before calling ?HAMM, then this symmetry is not preserved during this operation, unless the product op(A)*op(B) is symmetric or Hermitian, for example, when op(A)*op(B) = F*D*FT, where D is a diagonal matrix and F is a general matrix.
- The default value for block size threshold is suggested to be 512. If the problem size is very large then a threshold larger than 512 may be used. Similarly, if the problem size is very small then a threshold smaller than 512 may be used.
- The sample implementation source code is only meant to be used as an example. It is not rigorously tested and does not necessarily exhibit all elements of good programming practice. Users should build their own solutions based on it.