Intel® MKL provides the general purpose BLAS* matrix multiply routines ?GEMM defined as follows:
C := alpha*op(A)*op(B) + beta*C
where alpha and beta are scalars, op(A) is an m-by-k matrix, op(B) is a k-by-n matrix, C is an m-by-n matrix, with op(X) being either X, or XT, or XH.
In this article, we describe a ?GEMM extension called ?HAMM that allows us to update only the upper or lower triangular part of the result matrix at performance levels close to that which ?GEMM provides. When both C and op(A)*op(B) are symmetric, we are able to exploit these symmetries by only updating the upper (or lower) triangular part of the output matrix C. This lower or upper triangular update requires approximately half of the floating-point computations of ?GEMM, and thus can be computed in roughly half the time of ?GEMM.
A naïve ?HAMM implementation could simply be a call to ?GEMM where we ignore the upper or lower half of the computed result, but that approach would not provide the 50% performance boost sought over a single ?GEMM call. If we instead employ a recursive divide-and-conquer approach to compute just the upper or lower half of the result matrix, then we should be able to get within about 80% of ?GEMM performance. The first step of the algorithm is shown in Figure 1. First we call ?GEMM on just some of the rows and columns of A, B, and C to compute the rectangular submatrix C1. In Figure 2, we do the same again for the smaller triangular parts of Figure 1 to produce C2 and C3. We then continue this process, recursively, until the the full upper or lower triangular result has been computed. We stop the recursion when the problem size reaches a predefined threshold, which is the base case of the recursive algorithm. We solve the base case using the naïve algorithm, which involves computing a small submatrix no bigger than the threshold size on either dimension. The brown blocks in Figure 3 shows the base case situation. Note that some spurious computation (in light brown) are done in the naïve algorithm -- we just simply discard these elements because they are not needed.
Using DHAMM as an example (where the data type is double precision floating point), the interface of DHAMM in C is:
void dhamm(char *uplo, char *transa, char *transb, MKL_INT *m, MKL_INT *n, MKL_INT *k, double *alpha, double *A, MKL_INT *lda, double *B, MKL_INT *ldb, double *beta, double *C, MKL_INT* ldc);
As a comparison, the interface of DGEMM looks like:
void dgemm(char *transa, char *transb, MKL_INT *m, MKL_INT *n, MKL_INT *k, double *alpha, double *A, MKL_INT *lda, double *B, MKL_INT *ldb, double *beta, double *C, MKL_INT* ldc);
The only difference is that DHAMM has an extra uplo argument in the first position of the argument list.
A sample implementation is provided for DHAMM. See the link at the bottom of this article to download the source code.
Benchmarking on a two-socket Intel® Xeon® E5 processor shows that the recursive algorithm can get within roughly 80% of DGEMM performance, while the naïve implementation slowly approaches 50% of the performance of DGEMM.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804