### Introduction

By **Joe Bissell**, University of Delaware**Gary Zoppetti**, University of Delaware, and**Walter Triebel**, Fairleigh Dickinson University

Integer matrix multiplication is a common procedure that is used generically to explore options for optimizing nested loops and specifically for computations with multimedia algorithms, such as JPEG image decoding, all the way to uses as diverse as oceanographic analysis. The essence of this algorithm is the assignment statement:

c[i][j] = c[i,j] + a[i][k] * b[k][j];

This innocuous looking line of code, which is typically known as "madd" - for multiply and add, is one of the more difficult processing tasks that a CPU must perform. Such computations break down into a series of multiply-accumulate operations that cannot efficiently be built out of simple, separate functions. The problem here is that multiplication takes several clock cycles and addition only one. Therefore, a critical scheduling problem exists. If the addition is done too early, the CPU stalls. If it is done too late, there is wasted time because the multiplication units are not kept busy.

No CPU escapes without some penalty, but as will be shown; processors from the Itanium® processor family pay a smaller price than current pure RISC machines. The reason is in part due to the massive CPU register resources, generous primary caches, the rich instruction set, and the fact that integer multiplication is done in the floating-point unit. The advantage of using the floating-point unit is that much of the housekeeping associated with each madd instruction can be shared between the multiply and add stages. The rest of this paper discusses an approach to implementing a matrix multiply algorithm in the C language which optimizes performance of that calculation. The paper further illustrates some of the strengths of an Itanium processor in performing this calculation, a key component in solving many high computationally intensive problems.

### Test Algorithms

There's no gainsaying the choice of algorithm and adherence to good general optimization principles is of utmost importance. In this paper, we show two ways to implement the matrix multiply, both of which perform well, but one is clearly superior as the value of N (iterations) increases. This fact is important because the matmul programs are computationally intensive. For example, doubling the value of N increases execution time by a factor of 8 - the basic algorithm is cubic, so we expect an increase of approximately 2³ as N is doubled. When N=800, each matrix stores 640,000 *ints*. Each *int* is 4 bytes so this requires 2,560,000 bytes for each of the three matrices.

In the following sections we show test algorithms written in the C language and which have been run in the context of the test environment shown in the nex t section, Figure 1. The test programs matmul_const.c and matmul_stride.c are given in Listing 1 and Listing 2, respectively. In both programs, memory is allocated dynamically in contiguous blocks to maximize the spatial locality. In the second algorithm, the inner loop is further tuned by making it "unit stride". What this means is that for multiple-dimensioned arrays, access to each element is fastest if it can be done on the smallest possible step size; i.e., unit stride. The outer loop will automatically be unit stride, but the inner loop will not, because its stride is N, which is always assumed greater than one. The comments in the matmul_stride.c program explain how the unit stride was obtained.

The matmul_stride.c program demonstrates how the inner loop can be coerced into being unit stride (under the assumption that the matrix, b is already transposed). This means that once N is longer than the cache line, the performance degradation that existed in matmul_cont.c prior to this tuning is eliminated. (It is worth noting that in C, accessing matrix 'b' is by column and C uses row-major order.)**Listing 1: matmul_const.c program**

/***********************************/ #include <time.h> /* matmul_cont.c */ #include <stdio.h> #include <stdlib.h> /***********************************/ int main (int argc, char *argv[]) { int i, j, k; int N; // command line arg for no. of iterations int *a, *b, *c; // pointers to matrices a, b, and c int *am, *bm, *cm; // row pointers for matrices a, b, and c clock_t start; // from 'time.h' if (argc != 2) { printf ("Usage: %s <dimension> ", argv[0]); exit (-1); } N = atoi (argv[1]); // convert command line argument to int a = malloc (N * sizeof (int *)); // allocate row pointers am = malloc (N * N * sizeof (int)); // allocate NxN matrix of int's for (i = 0; i < N; i++) // initialize row pointers a[i] = &am[i * N]; b = malloc (N * sizeof (int *)); // (same for b and c) bm = malloc (N * N * sizeof (int)); for (i = 0; i < N; i++) b[i] = &bm[i * N]; c = malloc (N * sizeof (int *)); cm = malloc (N * N * sizeof (int)); for (i = 0; i < N; i++) c[i] = &cm[i * N]; for (i = 0; i < N; i++) // initialize a and b to random int's for (j = 0; j < N; j++) // mod 32768 { a[i][j] = rand () % 32768; b[i][j] = rand () % 32768; } start = clock (); // start the clock /* begin matrix multiply */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) ; { c[i][j] = 0; for (k = 0; k < N; k++) // inner loop to compute dot product c[i][j] += a[i][k] * b[k][j]; } /* end matrix multiply */ /* note that there is no unit stride in inner loop, so cache misses account for substantial overhead */ start = clock () - start; // stop clock and calc. elapsed clocks printf ("N = %d Time = %f ", N, (float) start / CLOCKS_PER_SEC); free (a); free (am); free (b); free (bm); free (c); free (cm); return (0); }

**Listing 2: matmul_stride.c program**

/**********************************/ #include <time.h> /* matmul_stride.c */ #include <stdio.h> #include <stdlib.h> /**********************************/ int main (int argc, char *argv[]) { int i, j, k; int N; // command line arg for no. of iterations int *a, *b, *c; // command line arg - no. of iterations int *am, *bm, *cm; // pointers to matrices a, b, and c int stride_scale; // stride_scale provides unit stride in inner loop clock_t start; if (argc != 2) { printf ("Usage: %s <dimension> ", argv[0]); exit (-1); } N = atoi (argv[1]); // convert command line argument to int a = malloc (N * sizeof (int *)); // allocate row pointers am = malloc (N * N * sizeof (int)); // allocate NxN matrix of int's for (i = 0; i < N; i++) // initialize row pointers a[i] = &am[i * N]; b = malloc (N * sizeof (int *)); // (same for b and c) bm = malloc (N * N * sizeof (int)); for (i = 0; i < N; i++) b[i] = &bm[i * N]; c = malloc (N * sizeof (int *)); cm = malloc (N * N * sizeof (int)); for (i = 0; i < N; i++) c[i] = &cm[i * N]; for (i = 0; i < N; i++) for (j = 0; j < N; j++) { a[i][j] = rand () % 32768; b[i][j] = rand () % 32768; } start = clock (); // start the clock /* begin matrix multiply */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) c[i][j] = 0; for (i = 0; i < N; i++) for (j = 0; j < N; j++) { stride_scale = b[j][i]; for (k = 0; k < N; k++) // inner loop to compute dot product c[j][k] += a[j][k] * stride_scale; // but now we have unit stride } // in inner loop, /* end matrix multiply */ start = clock () - start;; printf ("N = %d Time = %f ", N, (float) start / CLOCKS_PER_SEC); free (a); free (am); free (b); free (bm); free (c); free (cm); return (0); } /**********************************/

### Test Environment

The specification of the computers used to perform the testing of the matmul programs are given in Figure 1. The Itanium-based system we used had 1 Gigabyte of main memory, and the UltraSPARC* has 2 Gigabytes of main memory. The choice of compiler is also a factor that must be considered in any intra-CPU analysis of matrix operations. For the Itanium-based system two compilers were tested, the Red Hat Linux* gcc and the Intel C++ Compiler for Linux*. For the SPARC* processor, cc and gcc were used, but only the results from gcc reported because gcc performed so poorly-more than 3X slower than the Itanium processor's gcc.

We provide the results of the two algorithms, each with two levels of optimization, using four different data sets varying in size. Since the optimization level of the Itanium processor defaults to -O2, we used that and optimization level -O3 for the test runs.

**Figure 1. Hardware Specifications for Example Systems.**

### Conclusion

**Measured Results**

The programs were run on the Itanium processor running Red Hat Linux* 7.0 using both the Intel® C++ Compiler for Linux and gcc. The results of data sets with N having values of 400, 500, 800, and 1000-the latter representing a billion operations are shown in Table 1. The runs were then repeated on an UltraSPARC II* with comparable processing capability, and produced the results in Table 2. From the elapsed time, it is seen that the Itanium-based system performs the madd matrix computation faster than the UltraSPARC II does, and the lead of the Itanium-based system increases with larger N. A significant performance speedup is observed over the first program matmul_cont.c in the results for the second program matmul_stride.c. This gain is seen in all runs on all compilers, but the Intel Compiler for Linux ecc is the clear winner.

In general, the rules that govern algorithm performance on most computers apply similarly to Itanium processor-based systems. In fact, the first runs with lower values of N were about the same for both machines and all compilers, but as N increased, the Itanium was able to take greater advantage of the efficient algorithms.

We expect that similar tests on the Itanium 2 processor will give times will be significantly faster because of the on-chip L3 cache and higher front side bandwidth. We plan further testing of these algorithms and one more finely tuned algorithm.**Table 1: Results of 2-D Matrix Madd for Itanium Processor using ecc and gcc**

a) matmul_cont.c results with ecc -O2 - contiguous memory (2 mallocs only) ======================================================== N = 400 Time = 3.159312 N = 500 Time = 6.680720 N = 800 Time = 59.979103 N = 1000 Time = 118.821167 b)matmul_cont.c results with ecc -O3 - contiguous memory (2 mallocs only) ================================================================ N = 400 Time = 3.372080 N = 500 Time = 6.778320 N = 800 Time = 60.158688 N = 1000 Time = 118.190674 --------------------------------------------------------------- --------------------------------------------------------------- c) matmul_stride.c results with ecc -O2 - contiguous memory (2 mallocs only) assumes unit stride ================================================================ N = 400 Time = 2.687904 N = 500 Time = 5.384592 N = 800 Time = 27.390465 N = 1000 Time = 55.467056 d) matmul_stride.c results with ecc -O3 - contiguous memory (2 mallocs only) assumes unit stride ======================================================== N = 400 Time = 2.456592 N = 500 Time = 4.796064 N = 800 Time = 20.189535 N = 1000 Time = 39.765167 --------------------------------------------------------------- --------------------------------------------------------------- e) matmul_cont.c results with gcc -O2 - contiguous memory (2 mallocs only) ======================================================== N = 400 Time = 3.123200 N = 500 Time = 6.510896 N = 800 Time = 53.237873 N = 1000 Time = 119.425308 f) matmul_cont.c results with gcc -O3 - contiguous memory (2 mallocs only) ======================================================== N = 400 Time = 3.174928 N = 500 Time = 6.482592 N = 800 Time = 53.376465 N = 1000 Time = 119.184242

**Table 2: Results of 2-D Matrix Madd for UltraSPARC II Processor using cc**

a) matmul_cont.c results with cc -O2 - contiguous memory (2 mallocs only) ======================================================== N = 400 Time = 8.814000 N = 500 Time = 18.448000 N = 800 Time = 87.832001 N = 1000 Time = 188.445999 b) matmul_cont.c results with cc -O3 - contiguous memory (2 mallocs only) ======================================================== N = 400 Time = 8.145000 N = 500 Time = 17.061001 N = 800 Time = 81.969002 N = 1000 Time = 172.462006 c) matmul_stride.c results with cc -O2 - contiguous memory (2 mallocs only) assumes unit stride ======================================================== N = 400 Time = 6.938000 N = 500 Time = 13.755000 N = 800 Time = 55.528999 N = 1000 Time = 109.347000 d) matmul_stride.c results with cc -O3 - contiguous memory (2 mallocs only) assumes unit stride ======================================================== N = 400 Time = 6.341000 N = 500 Time = 12.388000 N = 800 Time = 51.528000 N = 1000 Time = 101.095001

### Related Resources

- Recognizing Efficient Use of Caches in Code for the Itanium Processor Family
- Linux and Itanium Architecture: The Developer Side
- Porting Your Code to the Itanium Architecture
- Tune-up your Itanium code with VTune™
- Get help porting and tuning your code for Itanium-based systems

### About the Authors

**Walter Triebel** has authored college textbooks and engineering reference books and has over twenty years of experience as an application engineer and educator. He is the author of "Itanium Architecture for Software Developers" and a co-author of "Programming Itanium-based Systems,"-both published by Intel Press. For eighteen years, he worked for Intel Corporation, and he is currently an Adjunct Professor at Fairleigh Dickinson University. Currently he teaches microprocessor architecture courses, including a course solely on the Itanium Processor. He holds a Bachelor of Science in Electrical Engineering from Fairleigh Dickinson University and a Master of Science in Electrical Engineering from New York University.

**Joseph D. Bissell** has taught computer science/engineering for over 16 years. Currently, he teaches computer architecture, IA-32 and SPARC assembly language and embedded systems at the University of Delaware, and has also taught there a number of courses on C, C++ and compiler construction. His main interests are in the areas of high performance computing. He is also a consultant in the private sector for the West Company. He is the co-author of the book "Programming Itanium-based Systems" with Walt Triebel and Rick Booth; Intel Press.

His formal education is a bit non-traditional. He holds a Doctor of Dental Science degree from Temple University, a Bachelor of Science in Physiology from the University of Pennsylvania, and Master of Science in Electrical Engineering from Drexel University.

**Gary M. Zoppetti** earned the Ph.D. (2001) and M.S. (1997) degrees in Computer and Information Sciences from the University of Delaware, and the B.S. (1992) degree in Mathematics and Computer Science from California U niversity of Pennsylvania. As an undergraduate, he participated in the National Student Exchange program to further his studies in computer science, physics, and mathematics at the University of Utah. Prior to joining the faculty at Millersville University in 2002, he held an assistant professorship at the University of Delaware. Dr. Zoppetti's research interests include high performance computing; programming languages, compilers, and runtime systems; and computer architecture.