# A Matrix Multiplication Routine that Updates Only the Upper or Lower Triangular Part of the Result Matrix

### Background

Intel® MKL provides the general purpose BLAS*  matrix multiply routines ?GEMM defined as follows:

`C := alpha*op(A)*op(B) + beta*C`

where alpha and beta are scalars, op(A) is an m-by-k matrix, op(B) is a k-by-n matrix, C is an m-by-n matrix, withop(X) being either X, or XT, or XH.

In this article, we describe a ?GEMM extension called ?HAMM  that allows us to update only the upper or lower triangular part of the result matrix at performance levels close to that which ?GEMM provides. When both C and op(A)*op(B) are symmetric, we are able to exploit these symmetries by only updating  the upper (or lower)  triangular part of the output matrix C. This lower or upper triangular update requires approximately half of the floating-point computations of ?GEMM, and thus can be computed in roughly half the time of ?GEMM.

### Algorithm

A naïve ?HAMM  implementation could simply be a call to ?GEMM where we ignore the upper or lower half of the computed result, but that approach would not provide the 50% performance boost sought over a single ?GEMM call.  If we instead employ a recursive divide-and-conquer approach to compute just the upper or lower half of the result matrix, then we should be able to get within about 80% of ?GEMM performance.  The first step of the algorithm is shown in Figure 1.  First we call ?GEMM  on just some of the rows and columns of A, B, and C to compute the rectangular submatrix C1. In Figure 2, we do the same again for the smaller triangular parts of Figure 1 to produce C2 and C3. We then continue this process, recursively, until the the full upper or lower triangular result has been computed. We stop the recursion when the problem size reaches a predefined threshold, which is the base case of the recursive algorithm. We solve the base case using the naïve algorithm, which involves computing a small submatrix no bigger than the threshold size on either dimension.  The brown blocks in Figure 3 shows the base case situation. Note that some spurious computation (in light brown) are done in the naïve algorithm -- we just simply discard these elements because they are not needed.

Using DHAMM as an example (where the data type is double precision floating point), the interface of DHAMM in C is:

```
void dhamm(char *uplo, char *transa, char *transb, MKL_INT *m, MKL_INT *n, MKL_INT *k, double *alpha, double *A, MKL_INT *lda, double *B, MKL_INT *ldb, double *beta, double *C, MKL_INT* ldc);

```

As a comparison, the interface of DGEMM looks like:

```
void dgemm(char *transa, char *transb, MKL_INT *m, MKL_INT *n, MKL_INT *k, double *alpha, double *A, MKL_INT *lda, double *B, MKL_INT *ldb, double *beta, double *C, MKL_INT* ldc);

```

The only difference is that DHAMM has an extra uplo argument in the first position of the argument list.