# Tests of Efficient Implementation of Madd Algorithms on an Itanium®-based System

### Introduction

By Joe Bissell, University of Delaware
Gary Zoppetti, University of Delaware, and
Walter Triebel, Fairleigh Dickinson University

Integer matrix multiplication is a common procedure that is used generically to explore options for optimizing nested loops and specifically for computations with multimedia algorithms, such as JPEG image decoding, all the way to uses as diverse as oceanographic analysis. The essence of this algorithm is the assignment statement:

```c[i][j] = c[i,j] + a[i][k] * b[k][j];
```

This innocuous looking line of code, which is typically known as "madd" - for multiply and add, is one of the more difficult processing tasks that a CPU must perform. Such computations break down into a series of multiply-accumulate operations that cannot efficiently be built out of simple, separate functions. The problem here is that multiplication takes several clock cycles and addition only one. Therefore, a critical scheduling problem exists. If the addition is done too early, the CPU stalls. If it is done too late, there is wasted time because the multiplication units are not kept busy.

No CPU escapes without some penalty, but as will be shown; processors from the Itanium® processor family pay a smaller price than current pure RISC machines. The reason is in part due to the massive CPU register resources, generous primary caches, the rich instruction set, and the fact that integer multiplication is done in the floating-point unit. The advantage of using the floating-point unit is that much of the housekeeping associated with each madd instruction can be shared between the multiply and add stages. The rest of this paper discusses an approach to implementing a matrix multiply algorithm in the C language which optimizes performance of that calculation. The paper further illustrates some of the strengths of an Itanium processor in performing this calculation, a key component in solving many high computationally intensive problems.

### Test Algorithms

There's no gainsaying the choice of algorithm and adherence to good general optimization principles is of utmost importance. In this paper, we show two ways to implement the matrix multiply, both of which perform well, but one is clearly superior as the value of N (iterations) increases. This fact is important because the matmul programs are computationally intensive. For example, doubling the value of N increases execution time by a factor of 8 - the basic algorithm is cubic, so we expect an increase of approximately 2³ as N is doubled. When N=800, each matrix stores 640,000 ints. Each int is 4 bytes so this requires 2,560,000 bytes for each of the three matrices.

In the following sections we show test algorithms written in the C language and which have been run in the context of the test environment shown in the nex t section, Figure 1. The test programs matmul_const.c and matmul_stride.c are given in Listing 1 and Listing 2, respectively. In both programs, memory is allocated dynamically in contiguous blocks to maximize the spatial locality. In the second algorithm, the inner loop is further tuned by making it "unit stride". What this means is that for multiple-dimensioned arrays, access to each element is fastest if it can be done on the smallest possible step size; i.e., unit stride. The outer loop will automatically be unit stride, but the inner loop will not, because its stride is N, which is always assumed greater than one. The comments in the matmul_stride.c program explain how the unit stride was obtained.

The matmul_stride.c program demonstrates how the inner loop can be coerced into being unit stride (under the assumption that the matrix, b is already transposed). This means that once N is longer than the cache line, the performance degradation that existed in matmul_cont.c prior to this tuning is eliminated. (It is worth noting that in C, accessing matrix 'b' is by column and C uses row-major order.)

Listing 1: matmul_const.c program

```    /***********************************/

#include <time.h>      /*

matmul_cont.c */

#include <stdio.h>

#include <stdlib.h>

/***********************************/

int

main (int argc, char *argv[])

{

int i, j, k;

int

N;         //

command line arg for no. of iterations

int *a, *b, *c;   // pointers

to matrices a, b, and c

int *am, *bm, *cm;   // row

pointers for matrices a, b, and c

clock_t

start;      // from

'time.h'

if (argc != 2)

{

printf ("Usage: %s

<dimension>

", argv[0]);

exit (-1);

}

N = atoi (argv[1]); // convert command line

argument to int

a = malloc (N * sizeof (int *)); // allocate

row pointers

am = malloc (N * N * sizeof (int)); //

allocate NxN matrix of int's

for (i = 0; i < N; i++) // initialize row

pointers

a[i] =

&am[i * N];

b = malloc (N * sizeof (int *)); // (same

for b and c)

bm = malloc (N * N * sizeof (int));

for (i = 0; i < N; i++)

b[i] =

&bm[i * N];

c = malloc (N * sizeof (int *));

cm = malloc (N * N * sizeof (int));

for (i = 0; i < N; i++)

c[i] =

&cm[i * N];

for (i = 0; i < N;

i++)      // initialize a and b

to random int's

for (j = 0; j < N; j++)

// mod 32768

{

a[i][j] =

rand () % 32768;

b[i][j] =

rand () % 32768;

}

start = clock

();         //

start the clock

/* begin matrix multiply */

for (i = 0; i < N; i++)

for (j = 0; j < N;

j++)

; {

c[i][j] =

0;

for (k =

0; k < N; k++)   // inner loop to compute dot

product

c[i][j]

+= a[i][k] * b[k][j];

}

/* end matrix multiply */

/* note that there is no unit stride in inner loop, so cache

misses account

start = clock () - start; // stop clock and

calc. elapsed clocks

printf ("N = %d		Time = %f
", N, (float)

start / CLOCKS_PER_SEC);

free (a);

free (am);

free (b);

free (bm);

free (c);

free (cm);

return (0);

}
```

Listing 2: matmul_stride.c program

```    /**********************************/

#include <time.h>      /*

matmul_stride.c */

#include <stdio.h>

#include <stdlib.h>

/**********************************/

int main (int argc, char *argv[])

{

int i, j, k;

int

N;         //

command
line arg for no. of iterations

int *a, *b, *c;   // command

line arg - no. of iterations

int *am, *bm, *cm;   //

pointers to matrices a, b, and c

int stride_scale;   //

stride_scale provides unit stride in inner loop

clock_t start;

if (argc != 2)

{

printf ("Usage: %s

<dimension>

", argv[0]);

exit (-1);

}

N = atoi

(argv[1]);      // convert

command line argument to int

a = malloc (N * sizeof (int

*));    // allocate row pointers

am = malloc (N * N * sizeof (int)); //

allocate NxN matrix of int's

for (i = 0; i < N;

i++)       // initialize row

pointers

a[i] =

&am[i * N];

b = malloc (N * sizeof (int

*));    // (same for b and c)

bm = malloc (N * N * sizeof (int));

for (i = 0; i < N; i++)

b[i] =

&bm[i * N];

c = malloc (N * sizeof (int *));

cm = malloc (N * N * sizeof (int));

for (i = 0; i < N; i++)

c[i] =

&cm[i * N];

for (i = 0; i < N; i++)

for (j = 0; j < N;

j++)

{

a[i][j] =

rand () % 32768;

b[i][j] =

rand () % 32768;

}

start = clock

();         //

start the clock

/* begin matrix multiply */

for (i = 0; i < N; i++)

for (j = 0; j < N;

j++)

c[i][j] =

0;

for (i = 0; i < N; i++)

for (j = 0; j < N;

j++)

{

stride_scale

= b[j][i];

for (k =

0; k < N; k++)

// inner loop to compute

dot product

c[j][k]

+= a[j][k] * stride_scale;   // but now we have

unit stride

}

// in inner loop,

/* end matrix multiply */

start = clock () - start;;

printf ("N = %d		Time = %f
", N, (float)

start / CLOCKS_PER_SEC);

free (a);

free (am);

free (b);

free (bm);

free (c);

free (cm);

return (0);

}

/**********************************/
```

### Test Environment

The specification of the computers used to perform the testing of the matmul programs are given in Figure 1. The Itanium-based system we used had 1 Gigabyte of main memory, and the UltraSPARC* has 2 Gigabytes of main memory. The choice of compiler is also a factor that must be considered in any intra-CPU analysis of matrix operations. For the Itanium-based system two compilers were tested, the Red Hat Linux* gcc and the Intel C++ Compiler for Linux*. For the SPARC* processor, cc and gcc were used, but only the results from gcc reported because gcc performed so poorly-more than 3X slower than the Itanium processor's gcc.

We provide the results of the two algorithms, each with two levels of optimization, using four different data sets varying in size. Since the optimization level of the Itanium processor defaults to -O2, we used that and optimization level -O3 for the test runs.

Figure 1. Hardware Specifications for Example Systems.

### Conclusion

Measured Results

The programs were run on the Itanium processor running Red Hat Linux* 7.0 using both the Intel® C++ Compiler for Linux and gcc. The results of data sets with N having values of 400, 500, 800, and 1000-the latter representing a billion operations are shown in Table 1. The runs were then repeated on an UltraSPARC II* with comparable processing capability, and produced the results in Table 2. From the elapsed time, it is seen that the Itanium-based system performs the madd matrix computation faster than the UltraSPARC II does, and the lead of the Itanium-based system increases with larger N. A significant performance speedup is observed over the first program matmul_cont.c in the results for the second program matmul_stride.c. This gain is seen in all runs on all compilers, but the Intel Compiler for Linux ecc is the clear winner.

In general, the rules that govern algorithm performance on most computers apply similarly to Itanium processor-based systems. In fact, the first runs with lower values of N were about the same for both machines and all compilers, but as N increased, the Itanium was able to take greater advantage of the efficient algorithms.

We expect that similar tests on the Itanium 2 processor will give times will be significantly faster because of the on-chip L3 cache and higher front side bandwidth. We plan further testing of these algorithms and one more finely tuned algorithm.

Table 1: Results of 2-D Matrix Madd for Itanium Processor using ecc and gcc

```    a) matmul_cont.c results with ecc -O2 - contiguous memory (2

mallocs only)

========================================================

N = 400 Time = 3.159312

N = 500 Time = 6.680720

N = 800 Time = 59.979103

N = 1000 Time = 118.821167

b)matmul_cont.c results with ecc -O3 - contiguous memory (2

mallocs only)

================================================================

N = 400 Time = 3.372080

N = 500 Time = 6.778320

N = 800 Time = 60.158688

N = 1000 Time = 118.190674

---------------------------------------------------------------

---------------------------------------------------------------

c) matmul_stride.c results with ecc -O2 - contiguous memory (2

mallocs only)

assumes unit stride

================================================================

N = 400 Time = 2.687904

N = 500 Time = 5.384592

N = 800 Time = 27.390465

N = 1000 Time = 55.467056

d) matmul_stride.c results with ecc -O3 - contiguous memory (2

mallocs only)

assumes unit stride

========================================================

N = 400 Time = 2.456592

N = 500 Time = 4.796064

N = 800 Time = 20.189535

N = 1000 Time = 39.765167

---------------------------------------------------------------

---------------------------------------------------------------

e) matmul_cont.c results with gcc -O2 - contiguous memory (2

mallocs only)

========================================================

N = 400 Time = 3.123200

N = 500 Time = 6.510896

N = 800 Time = 53.237873

N = 1000 Time = 119.425308

f) matmul_cont.c results with gcc -O3 - contiguous memory (2

mallocs only)

========================================================

N = 400 Time = 3.174928

N = 500 Time = 6.482592

N = 800 Time = 53.376465

N = 1000 Time = 119.184242

```

Table 2: Results of 2-D Matrix Madd for UltraSPARC II Processor using cc

```    a) matmul_cont.c results with cc -O2 - contiguous memory (2

mallocs only)

========================================================

N = 400 Time = 8.814000

N = 500 Time = 18.448000

N = 800 Time = 87.832001

N = 1000 Time = 188.445999

b) matmul_cont.c results with cc -O3 - contiguous memory (2

mallocs only)

========================================================

N = 400 Time = 8.145000

N = 500 Time = 17.061001

N = 800 Time = 81.969002

N = 1000 Time = 172.462006

c) matmul_stride.c results with cc -O2 - contiguous memory (2

mallocs only)

assumes unit stride

========================================================

N = 400 Time = 6.938000

N = 500 Time = 13.755000

N = 800 Time = 55.528999

N = 1000 Time = 109.347000

d) matmul_stride.c results with cc -O3 - contiguous memory (2

mallocs only)

assumes unit stride

========================================================

N = 400 Time = 6.341000

N = 500 Time = 12.388000

N = 800 Time = 51.528000

N = 1000 Time = 101.095001

```