# Superscalar programming 101 (Matrix Multiply) Part 2 of 5

By Jim Dempsey

In my last article we left off with

The above charts, impressive as they are, are an "apples and oranges" type of comparison. The chart is comparing a non-cache sensitive serial technique against a cache sensitive parallel technique. Good for promotional literature, certainly a good incentive to learn how to program in parallel, but this falls short for analysis purposes by the seasoned parallel programmer.

The first thing any parallel programmer should know is: improve the serial algorithm first, and then address the problem through parallelization (save intrinsic small vector functions for last).

If you look at the inner most loop of the Serial function we find:

```         for (int k = 0; k < size; k++)
{
temp += m1[i][k] * m2[k][j];
}
```

We addressed the issue of the traditional C/C++ programming practice of using an array of pointers to rows, and saw that by dispensing with this practice that a 6x to 12x improvement in performance. What else can be done?

Set aside the issue of the TLBs and row table for a moment. In examining the principal computational statement we find that while the k index in m1[i][k] sequentially accesses memory, the k index in m2[k][j] does not. What this means is the compiler is unable to vector this loop. And potentially worse, stepping down the rows at certain distance intervals are known to cause cache evictions (the dip in curve observed on Core i7 920). Using vectors on doubles might attain a 2x improvement - well worth going after. So let's improve the vector-ability of the Serial version of this program. (and Parallel variant of the Serial version too)

Note, while we could use the approach done in the Cilk++ implementation (eliminate row table of pointers), I will choose an alternate means that will become useful later. First address the simple Serial and Simple Parallel to see what happens.

In order to assist the compiler in vectorizing the matrix multiplication (and minimize cache evictions) you can transpose the matrix m2 (to m2t), and then you can transpose the indexes in the inner loop. At first this may sound absurd. To transpose the matrix m2 you will have to add overhead to:

Perform an allocation (actually size+1 allocations with 1 for the row pointers)
run transposition loops
reading and re-writing one of the arrays (array m2 to m2t)

You might assume that this must be slower than the simple matrix multiplication.
Well you would be wrong to assume this. Each cell in m2 is used N times, each cell in m1 is used N times. We have the potential of trading off the cache misses on 2N**2 reads plus N*N writes against N reads + N/2 writes + (2N**2)/2 reads + N/2 writes .The /2 is due to the inner loop now becomes a DOT function that operates on two sequential arrays and is an ideal candidate for vectorization by the compiler. The rewritten Serial loop (and inner loop transformed into an inline function) looks like:

```// compute DOT product of two vectors, return result as double
inline double DOT(double* v1, double* v2, size_t size)
{
double temp = 0.0;
for(size_t i = 0; i < size; i++)
{
temp += v1[i] * v2[i];
}
return temp;
}
```

The matrix multiply function, with transposition now looks like:

```void matrix_multiplyTranspose(
double** m1, double** m2, double** result, size_t size)
{
// create a matrix to hold the transposition of m2
double** m2t = create_matrix(size);
// perform transposition m2 into m2t
for (size_t i = 0; i < size; i++)
{
for (size_t j = 0; j < size; j++)
{
m2t[j][i] = m2[i][j];
}
}
// Now matrix multiply with the transposed matrix m2t
for (size_t i = 0; i < size; i++)
{
for (size_t j = 0; j < size; j++)
{
result[i][j] = DOT(m1[i], m2t[j], size);
}
}
// remove the matrix that holds the transposition of m2
destroy_matrix(m2t, size);
}
```

Note, in a real implementation you might want to consider preserving the buffers allocate on the first call. Then on subsequent calls you check to see if the prior allocation was sufficient for the current call. If so, the bypass the allocation. If not then delete the buffers, and allocate new buffers. This is a further optimization detail left to the readers.

We will include the allocation/deallocation overhead in our charts. Also note, if the m2 matrix is going to be used many times (e.g. in a filter system), then you can perform the transposition once, and then reused the transposed matrix many times. These are implementation strategies to keep in mind when taking these suggestions into your own code.

Let's see how the serial with transposition technique (ST) stacks up against the serial without transposition (S) and the parallel code (P) using the non-transposed matrix method:

Fig 5 (above)

Fig 6

The revised technique, which still uses a row table is a significant improvement over the Serial and Parallel methods. The revised technique does not approach that of the row table elimination method of the Cilk++ demo program. Is there anything to learn from this?

The revised serial code, using one thread, overwhelms the parallel code using 4 or 8 threads on the two different processor at higher N values.

Although 4x better parallel performance than before, it is still less than the Cilk++ method, is this information useful in helping produce a better algorithm?

Why the "turbo charged" boost at 513 for the Serial Transpose method?
Although Parallel Transpose (PT) is 4x faster than Parallel (P) it is actually less than the performance of the Serial Transpose method! Why!?!?

The answer to this is the chart lines are relative to the original (non-transposed) Serial performance. Let's see the actual run time for the (non-transposed) Serial method to see if something shows up:

Fig 7

And there it is. At N = 513 we find the serial time experiencing a drastic change in slope. (Q6600 shows earlier slope change at about 350).

What is significant about N = 513 on Core i7 920?. We principally have access to two arrays of doubles dimension at 513 x 513. The number of bytes is 2 x 513 x 513 x 8 = 4,210,704. Hmm...

The Core i7 920 from specifications gleaned from the internet indicate that this processor has

L1 cache 32KB instruction + 32KB data (one for each core)
L2 cache 256KB (one for each core)
L3 cache 8MB (shared)
So why the performance drop at ~4MB instead of ~8MB? To be honest, I haven't performed an thorough analysis of this, my postulation is this is a TLB issue, so I will simply state that it appears that the cache system is being over taxed at this point. What is important, is the steep rise in the curve at this point accounts for the superscalar performance gain of a better serial technique over a non-optimal serial technique.

The important piece of information learned is the Serial Transpose method trashes the performance of the parallel technique using the non-transformed technique. (Although the Cilk++ technique is still 3.nnx faster that either)

Now when the Cilk++ row table elimination method is compared to Serial Transpose method we have:

Fig 8

Fig 9

The Cilk++ technique shows approximately 3.25x performance over Serial Transpose method.
More noteworthy, the Parallel Transpose is not faster than the Serial Transpose method. Why? And more importantly, does it matter why?

As a general rule, when you observe your parallel code not performing better than your serial code, this is a good indication that you have reached a memory bandwidth problem. The Cilk++ code performance clearly indicates the memory bandwidth problem is due to poor cache utilization by the Parallel Transpose technique.

Is the Cilk++ method the best we can expect to do?

The answer to this is: No.

We will discuss this in the next article (Part 3).

Jim Dempsey

For more complete information about compiler optimizations, see our Optimization Notice.

Top

Werner,

The issues addressed in these articles will apply equaly to most compiled languages, including C#. The intention of the series of articles is to teach a discovery method, and to illustrate that you should not give up too soon if one (or some) attempts at improvement do not appear furitful. There is information to learn from otherwise "failures to improve" steps. The 2nd article illustrates a "failure to improve" method, but also includes information to be gleened from this failure. The 3rd article will apply what we learned in the 1st and 2nd articles to produce an improved method. The 4th article will apply this learning proceedure to a larger system (Dual Xeon 5570). And the 5th article will make improvements and run on Quad Xeon 7560 (32 cores 64 hw threads) - assuming I can get the test program to build on that system (working on this now). Larger systems and larger problems benifit from additional programming.

Jim Dempsey

Excellent pair of articles so far, I dont know C++ just some c# but you examples are so clear that I have learnt a lot of them so far Im already gladly expenting the next part.

After reading the text below Fig 9, I think the text needs clarification:

The cache sensitive (row packing) Cilk++ parallel code with 4 cores yields 3.25x performance over cache sensitive Serial version of the Matrix Multiply code. (now an apples to apples comparison)

Jim Dempsey