**By Jim Dempsey**

The subject matter of this article is: How to optimally tune a well known algorithm. We will take this well known (small) algorithm, a common approach to parallelizing this algorithm, a better approach to parallelizing this algorithm, and then produce a fully cache sensitized approach to parallelizing this algorithm. The intention of this article is to teach you a methodology of how to interpret the statistics gathered during test runs and then use those interpretations at improving your parallel code.

This is a multi-part article, where the author believes the process in reaching a goal (learning how to assess results and improve upon your parallel programming technique) is as important as the goal (finished application). For without the process you will not know how to attain goal to the extent possible.

To titillate you into reading the complete set of articles, you will experience how to attain up to 80x parallel programming performance increase on a single processor (4 core with HT) system over a simple serial method on the same system. (don't zoom the charts)

Let us begin…

One of the posters on the Intel Developer Zone forums provided a link to an MSDN article titled "**How to: Write a parallel_for Loop**" (http://msdn.microsoft.com/en-us/library/dd728073.aspx). This article illustrates how to use the Visual Studio 2010 **Concurrency::parallel_for** to compute the product of two matrices. The example illustrates how to use the **parallel_for** in a nested loop.

The typical C++ serial method to compute the matrix multiplication of a square matrix might look as follows:

// Computes the product of two square matrices. void matrix_multiply( double** m1, double** m2, double** result, size_t size) { for (size_t i = 0; i < size; i++) { for (size_t j = 0; j < size; j++) { double temp = 0; for (int k = 0; k < size; k++) { temp += m1[i][k] * m2[k][j]; } result[i][j] = temp; } } }

Using the VS concurrency **collection parallel_for** the code looks like this:

// Compute the product of two square matrices in parallel. void parallel_matrix_multiply( double** m1, double** m2, double** result, size_t size) { parallel_for (size_t(0), size, [&](size_t i) { for (size_t j = 0; j < size; j++) { double temp = 0; for (int k = 0; k < size; k++) { temp += m1[i][k] * m2[k][j]; } result[i][j] = temp; } }); }

This makes use of the C++0x Lambda functions and the Concurrency template for the **parallel_for**. This conversion on the surface * appears* to be elegant (unusually effective and simple) by essentially rewriting two lines of code: The first for statement and closing brace of that

**for**statement.

The MSDN reported performance boost on a 4 processor system for 750 x 750 matrix multiplication as:

Serial: 3853 (ticks)

Parallel: 1311 (ticks)

Approximately a 2.94x speed-up.

This is not an ideal scaling situation in that it does not produce a 4x speed-up using 4 processors. But considering that there must be some setup overhead, 2.94x is not too bad on this small of matrix. And you might be inclined to think that there is only 25% room remaining for improvement.

The article did not chart the scaling as a function of N (one dimension of a square matrix) so it is difficult to tell the shape of the performance gain trend line as a function of N.

Although this article was written to illustrate the use of the

**parallel_for**in the MS Concurrency, a parallel programmer might be miss-lead into assuming that this example illustrates how to write a parallel matrix multiplication function. After all, this article describes a parallel method for matrix multiplication and was written in an MSDN article - an authoritative source.

Let's see how we can do better at parallelization of the Matrix Multiplication.

First off, I do not have Visual Studio 2010, so I cannot use the sample program as-is.

I do have QuickThread (I am the author of this parallel programming toolkit www.quickthreadprogramming.com). Therefore, I adapted the code to use QuickThread.

Adaptation is relatively easy. Change the include files and minor differences in syntax.

// Computes the product of two square matrices in parallel. void parallel_matrix_multiply( double** m1, double** m2, double** result, size_t size) { parallel_for( 0, size, [&](intptr_t iBegin, intptr_t iEnd) { for(intptr_t i = iBegin; i < iEnd; ++i) { for (intptr_t j = 0; j < size; j++) { double temp = 0; for (intptr_t k = 0; k < size; k++) { temp += m1[i][k] * m2[k][j]; } result[i][j] = temp; } } } ); }

In the QuickThread dialect, the **parallel_for** passes the half open range as arguments to the function, as opposed to the **parallel_for** of the VS 2010 concurrency collection passing the single index into the body of the function. QuickThread chose to pass the half open range, as opposed to a single index, because knowing the range, the programmer and/or compiler can better optimize the code within the loop. N.B. this loop could have as easily been written using OpenMP or Cilk++, or TBB. The choice parallel tool language/dialect is not important at this point of the article.

Using the QuickThread modified code, and run on Windows XP x64 and using an Intel Q6600 4 Core processor without Hyper Threading we find the scaling as:

**Fig 1**- Parallel Scaling

For an average scaling ratio of 3.77x at N values between 128 and 1024. And considering four threads are involved the above chart shows the parallel version of the serial code yielding a scaling factor of 0.94 (scale / number of cores). This is a reasonably good scaling factor.

The processor model was not listed for the Visual Studio test so it is hard to tell the reason why QuickThread yields 3.77x improvement on 4 cores, while VS yields 2.94x improvement on 4 processors (cores?). Setting aside the issue of which threading toolkit is better, let's concentrate on the parallelization of the matrix multiplication.

First thing to do is get an additional (different) set of sample data. Let's see what the performance is on a processor supporting L3 cache plus Hyper Threading (HT).

When running the program on an Intel Core i7 920, 4 cores with HT (8 hardware threads - but only 4 floating point instruction paths) we find a completely different trend line:

**Fig 2**

Why the dip when N ranges from 350 to 570?

When the performance recovers, why do we see 4x improvement instead of 8x improvement?

Why the jagged line (noise) towards lower N?

The dip is due to cache evictions caused by one thread adversely interacting with other threads.

We get just over 4x performance because the Core i7 920 is a 4 core processor capable of 4 execution streams for floating point (although it has 8 hardware threads for integer execution streams). On the higher N values it would appear that we are maxing out the capabilities of processor, 4 cores == 4 x performance boost.

The trend line is jagged due to only one sample run being taken, and due to the short run times when N is low. Also, the variable overhead in starting the thread team (now 8 threads) is more noticeable on the left hand side of the curve. An average of a larger number of runs would smooth out this line, but this consumes unnecessary time from the developer. Additionally the large spike at about 250 indicates that the Serial version took a large tick count hit at that point due to out of application control circumstances (expansion of page file or other activity on the system). You do not need a precise chart for program analysis purposes.

Is there anything we can we do to improve this performance and/or correct for the dip in performance for N between 350 and 570?

The main computational section of code for both Serial and Parallel are the same:

for (intptr_t j = 0; j < size; j++) { double temp = 0; for (intptr_t k = 0; k < size; k++) { temp += m1[i][k] * m2[k][j]; } result[i][j] = temp; }

Notice that the product accumulation statement

temp += m1[i][k] * m2[k][j];

uses k to index the column of array m1 and the row of array m2. This access pattern has two problems: First, it is not cache friendly, and second, it is not amenable to vectorization by the compiler. Let's look at fixing these problems.

In the Cilk++ sample programs, which is (or will be) available with the Intel Parallel Studio, we find a sample program showing a cache friendly implementation of matrix multiply:

// Multiply double precision square n x n matrices. A = B * C // Matrices are stored in row major order. void matrix_multiply(double* A, double* B, double* C, unsigned int n) { if (n < 1) { return; } cilk_for(unsigned int i = 0; i < n; ++i) { // This is the only Cilk++ keyword used in this program // Note the order of the loops and the code motion of the i*n and k*n // computation. This gives a 5-10 performance improvement over // exchanging the j and k loops. int itn = i * n; for (unsigned int k = 0; k < n; ++k) { for (unsigned int j = 0; j < n; ++j) { int ktn = k * n; // Compute A[i,j] in the inner loop. A[itn + j] += B[itn + k] * C[ktn + j]; } } } return; }

The itn (i times n) and ktn (k times n) variables take large steps through the arrays. The array pointers point to single dimensioned arrays that are row packed. This is to say each row is appended to the prior row into one large single dimension array. Which is then indexed by way of i*n and/or k*n.

Serial and Parallel function signatures:

void matrix_multiply( double** m1, double** m2, double** result, size_t size); void parallel_matrix_multiply( double** m1, double** m2, double** result, size_t size);

Note double** on the 2D array pointers.

Cilk++ function signature:

void matrix_multiply(double* A, double* B, double* C, unsigned int n);

Note double* on the 2D array pointers.

Row packing principally does one beneficial thing to your program. Row packing eliminates a memory fetch of a pointer to the row (as done with array of row pointers). This uses fewer Virtual Memory Translation Look Aside Buffers. And this improves cache utilization.

temp += m1[i][k] * m2[k][j]; // serial/parallel

Uses: 1 TLB for code + 1 TLB for stack (could be 0) + 1TLB for m1 row table + 1 TLB for m1 data + 1 TLB for m2 row table + 1 TLB for m2 data = 6 (or 5) TLB's.

Whereas the row packing method employed by the Cilk++ sample program:

A[itn + j] += B[itn + k] * C[ktn + j];

Uses: 1 TLB for code + 1 TLB for stack (could be 0) + 1 TLB for m1 data + 1 TLB for m2 data = 4 (or 3) TLB's.

The reduction in the numbers of TLB's required should produce an advantage.

And the corrisponding charts:

**Fig 3**(above)

Wow!, the Cilk++ technique on the Q6600 has a nice peak between 400 and 700 where it attains close to 25x performance over Serial but dropps off to about 14x at the higher range. This represents a 3x to 6x improvement over the parallel version of the standard matrix multiply code. The elimination of the array of row pointers made a significant difference.

Now looking at the Core i7 920 processor we find:

**Fig 4**

The Cilk++ method on the Core i7 920 attains almost 50x performance gain over serial method at about N=800 (12x over parallel), but appears as if it will be dropping off after 1024 (as it did earlier at 700 on the Q6600). Additional data points should be collected.

What this teaches us is: constructing 2D arrays as an array of pointers to 1D arrays looks good but can cost you dearly. Clearly the more efficient route is to use row packing to reduce the number of TLBs and corrisponding entries in the cache. But this means changing

double** A; // referenced as A[iRow][iCol]

into some class

Array2D<double> A;

And then changing the allocations and references (or fancy template operators).

If you look at this from a different perspective the array class object(s) are effectively array descriptors. Array descriptors are a well used technique by FORTRAN.

WIth an improvement of 12x over a parallel version of the serial implimentation this shows that paying attention to cache issues realy pays off. And that you may have to dispense with the familiar C/C++ programming practice of referencing a multi-dimensioned array as an array of pointers. A little extra programming effort pays off with significant returns in performance.

You would have to ask yourself: is this the best you can do?

And, is it worth your while to attempt at better performance?

The answer to this is the subject of my next article: **Part 2**Jim Dempsey

jim@quickthreadprogramming.com

www.quickthreadprogramming.com