By Zhang Zhang, Published: 02/15/2013, Last Updated: 02/14/2013

Intel® MKL provides the general purpose BLAS* matrix multiply routines ?GEMM defined as follows:

C := alpha*op(A)*op(B) + beta*C

where *alpha* and *beta* are scalars, *op(A)* is an *m-by-k* matrix, *op(**B)* is a *k-by-n* matrix, *C* is an *m-by-n* matrix, with *op(X) *being either *X, *or *X ^{T}, *or

In this article, we describe a ?GEMM extension called ?HAMM that allows us to update only the upper or lower triangular part of the result matrix at performance levels close to that which ?GEMM provides. When both *C* and *op(A)*op(B)* are symmetric, we are able to exploit these symmetries by only updating the upper (or lower) triangular part of the output matrix *C*. This lower or upper triangular update requires approximately half of the floating-point computations of ?GEMM, and thus can be computed in roughly half the time of ?GEMM.

A naïve ?HAMM implementation could simply be a call to ?GEMM where we ignore the upper or lower half of the computed result, but that approach would not provide the 50% performance boost sought over a single ?GEMM call. If we instead employ a recursive divide-and-conquer approach to compute just the upper or lower half of the result matrix, then we should be able to get within about 80% of ?GEMM performance. The first step of the algorithm is shown in Figure 1. First we call ?GEMM on just some of the rows and columns of *A*, *B*, and *C* to compute the rectangular submatrix *C _{1}*. In Figure 2, we do the same again for the smaller triangular parts of Figure 1 to produce

Using DHAMM as an example (where the data type is double precision floating point), the interface of DHAMM in C is:

```
void dhamm(char *uplo, char *transa, char *transb, MKL_INT *m, MKL_INT *n, MKL_INT *k, double *alpha, double *A, MKL_INT *lda, double *B, MKL_INT *ldb, double *beta, double *C, MKL_INT* ldc);
```

As a comparison, the interface of DGEMM looks like:

```
void dgemm(char *transa, char *transb, MKL_INT *m, MKL_INT *n, MKL_INT *k, double *alpha, double *A, MKL_INT *lda, double *B, MKL_INT *ldb, double *beta, double *C, MKL_INT* ldc);
```

The only difference is that DHAMM has an extra *uplo *argument in the first position of the argument list.

A sample implementation is provided for DHAMM. See the link at the bottom of this article to download the source code.

Benchmarking on a two-socket Intel® Xeon® E5 processor shows that the recursive algorithm can get within roughly 80% of DGEMM performance, while the naïve implementation slowly approaches 50% of the performance of DGEMM.

- If matrix
*C*was symmetric or Hermitian before calling ?HAMM, then this symmetry ispreserved during this operation, unless the product*not**op(A)*op(B)*is symmetric or Hermitian, for example, when*op(A)*op(B)*=*F*D*F*, where^{T}*D*is a diagonal matrix and*F*is a general matrix. - The default value for
- The sample implementation source code is only meant to be used as an example. It is not rigorously tested and does not necessarily exhibit all elements of good programming practice. Users should build their own solutions based on it.

Attachment | Size |
---|---|

dhamm.c | 6.1 KB |

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804