Superscalar Programming 101 (Matrix Multiply) Part 1 of 5

By Jim Dempsey

The subject matter of this article is: How to optimally tune a well known algorithm. We will take this well known (small) algorithm, a common approach to parallelizing this algorithm, a better approach to parallelizing this algorithm, and then produce a fully cache sensitized approach to parallelizing this algorithm. The intention of this article is to teach you a methodology of how to interpret the statistics gathered during test runs and then use those interpretations at improving your parallel code.

This is a multi-part article, where the author believes the process in reaching a goal (learning how to assess results and improve upon your parallel programming technique) is as important as the goal (finished application). For without the process you will not know how to attain goal to the extent possible.

To titillate you into reading the complete set of articles, you will experience how to attain up to 80x parallel programming performance increase on a single processor (4 core with HT) system over a simple serial method on the same system. (don't zoom the charts)
 

Let us begin…

One of the posters on the Intel Developer Zone forums provided a link to an MSDN article titled "How to: Write a parallel_for Loop" (http://msdn.microsoft.com/en-us/library/dd728073.aspx). This article illustrates how to use the Visual Studio 2010 Concurrency::parallel_for to compute the product of two matrices. The example illustrates how to use the parallel_for in a nested loop.

The typical C++ serial method to compute the matrix multiplication of a square matrix might look as follows:
 

// Computes the product of two square matrices.
void matrix_multiply(
double** m1, double** m2, double** result, size_t size)
{
   for (size_t i = 0; i < size; i++) 
   {
      for (size_t j = 0; j < size; j++)
      {
         double temp = 0;
         for (int k = 0; k < size; k++)
         {
            temp += m1[i][k] * m2[k][j];
         }
         result[i][j] = temp;
      }
   }
}

Using the VS concurrency collection parallel_for the code looks like this:
 

// Compute the product of two square matrices in parallel.
void parallel_matrix_multiply(
double** m1, double** m2, double** result, size_t size)
{
   parallel_for (size_t(0), size, [&](size_t i)
   {
      for (size_t j = 0; j < size; j++)
      {
         double temp = 0;
         for (int k = 0; k < size; k++)
         {
            temp += m1[i][k] * m2[k][j];
         }
         result[i][j] = temp;
      }
   });
}

This makes use of the C++0x Lambda functions and the Concurrency template for the parallel_for. This conversion on the surface appears to be elegant (unusually effective and simple) by essentially rewriting two lines of code: The first for statement and closing brace of that for statement.

The MSDN reported performance boost on a 4 processor system for 750 x 750 matrix multiplication as:

Serial: 3853 (ticks)
Parallel: 1311 (ticks)

Approximately a 2.94x speed-up.

This is not an ideal scaling situation in that it does not produce a 4x speed-up using 4 processors. But considering that there must be some setup overhead, 2.94x is not too bad on this small of matrix. And you might be inclined to think that there is only 25% room remaining for improvement.

The article did not chart the scaling as a function of N (one dimension of a square matrix) so it is difficult to tell the shape of the performance gain trend line as a function of N.

Although this article was written to illustrate the use of the parallel_for in the MS Concurrency, a parallel programmer might be miss-lead into assuming that this example illustrates how to write a parallel matrix multiplication function. After all, this article describes a parallel method for matrix multiplication and was written in an MSDN article - an authoritative source.

Let's see how we can do better at parallelization of the Matrix Multiplication.

First off, I do not have Visual Studio 2010, so I cannot use the sample program as-is.
I do have QuickThread (I am the author of this parallel programming toolkit www.quickthreadprogramming.com). Therefore, I adapted the code to use QuickThread.

Adaptation is relatively easy. Change the include files and minor differences in syntax.
 

// Computes the product of two square matrices in parallel.
void parallel_matrix_multiply(
double** m1, double** m2, double** result, size_t size)
{
    parallel_for(
        0, size,
        [&](intptr_t iBegin, intptr_t iEnd)
        {
            for(intptr_t i = iBegin; i < iEnd; ++i)
            {
                for (intptr_t j = 0; j < size; j++)
                {
                    double temp = 0;
                    for (intptr_t k = 0; k < size; k++)
                    {
                        temp += m1[i][k] * m2[k][j];
                    }
                    result[i][j] = temp;
                }
            }
        }
    );
}

In the QuickThread dialect, the parallel_for passes the half open range as arguments to the function, as opposed to the parallel_for of the VS 2010 concurrency collection passing the single index into the body of the function. QuickThread chose to pass the half open range, as opposed to a single index, because knowing the range, the programmer and/or compiler can better optimize the code within the loop. N.B. this loop could have as easily been written using OpenMP or Cilk++, or TBB. The choice parallel tool language/dialect is not important at this point of the article.

Using the QuickThread modified code, and run on Windows XP x64 and using an Intel Q6600 4 Core processor without Hyper Threading we find the scaling as:
 

 

Fig 1 - Parallel Scaling


For an average scaling ratio of 3.77x at N values between 128 and 1024. And considering four threads are involved the above chart shows the parallel version of the serial code yielding a scaling factor of 0.94 (scale / number of cores). This is a reasonably good scaling factor.

The processor model was not listed for the Visual Studio test so it is hard to tell the reason why QuickThread yields 3.77x improvement on 4 cores, while VS yields 2.94x improvement on 4 processors (cores?). Setting aside the issue of which threading toolkit is better, let's concentrate on the parallelization of the matrix multiplication.

First thing to do is get an additional (different) set of sample data. Let's see what the performance is on a processor supporting L3 cache plus Hyper Threading (HT).

When running the program on an Intel Core i7 920, 4 cores with HT (8 hardware threads - but only 4 floating point instruction paths) we find a completely different trend line:
 

 

Fig 2


Why the dip when N ranges from 350 to 570?
When the performance recovers, why do we see 4x improvement instead of 8x improvement?
Why the jagged line (noise) towards lower N?

The dip is due to cache evictions caused by one thread adversely interacting with other threads.
We get just over 4x performance because the Core i7 920 is a 4 core processor capable of 4 execution streams for floating point (although it has 8 hardware threads for integer execution streams). On the higher N values it would appear that we are maxing out the capabilities of processor, 4 cores == 4 x performance boost.

The trend line is jagged due to only one sample run being taken, and due to the short run times when N is low. Also, the variable overhead in starting the thread team (now 8 threads) is more noticeable on the left hand side of the curve. An average of a larger number of runs would smooth out this line, but this consumes unnecessary time from the developer. Additionally the large spike at about 250 indicates that the Serial version took a large tick count hit at that point due to out of application control circumstances (expansion of page file or other activity on the system). You do not need a precise chart for program analysis purposes.

Is there anything we can we do to improve this performance and/or correct for the dip in performance for N between 350 and 570?

The main computational section of code for both Serial and Parallel are the same:
 

for (intptr_t j = 0; j < size; j++)
{
	double temp = 0;
	for (intptr_t k = 0; k < size; k++)
	{
	temp += m1[i][k] * m2[k][j];
	}
	result[i][j] = temp;
}

Notice that the product accumulation statement
 

temp += m1[i][k] * m2[k][j];

uses k to index the column of array m1 and the row of array m2. This access pattern has two problems: First, it is not cache friendly, and second, it is not amenable to vectorization by the compiler. Let's look at fixing these problems.

In the Cilk++ sample programs, which is (or will be) available with the Intel Parallel Studio, we find a sample program showing a cache friendly implementation of matrix multiply:
 

// Multiply double precision square n x n matrices. A = B * C
// Matrices are stored in row major order.
void matrix_multiply(double* A, double* B, double* C, unsigned int n)
{
    if (n < 1) {
        return;
    }

    cilk_for(unsigned int i = 0; i < n; ++i) {
// This is the only Cilk++ keyword used in this program
// Note the order of the loops and the code motion of the i*n and k*n
// computation. This gives a 5-10 performance improvement over
// exchanging the j and k loops.
	int itn = i * n;
       for (unsigned int k = 0; k < n; ++k) {
            for (unsigned int j = 0; j < n; ++j) {
    	        int ktn = k * n;
                // Compute A[i,j] in the inner loop.
                A[itn + j] += B[itn + k] * C[ktn + j];
            }
        }
    }
    return;
}

The itn (i times n) and ktn (k times n) variables take large steps through the arrays. The array pointers point to single dimensioned arrays that are row packed. This is to say each row is appended to the prior row into one large single dimension array. Which is then indexed by way of i*n and/or k*n.

Serial and Parallel function signatures:
 

void matrix_multiply(
	double** m1, double** m2, double** result, size_t size);
void parallel_matrix_multiply(
	double** m1, double** m2, double** result, size_t size);

Note double** on the 2D array pointers.

Cilk++ function signature:
 

void matrix_multiply(double* A, double* B, double* C, unsigned int n);

Note double* on the 2D array pointers.

Row packing principally does one beneficial thing to your program. Row packing eliminates a memory fetch of a pointer to the row (as done with array of row pointers). This uses fewer Virtual Memory Translation Look Aside Buffers. And this improves cache utilization.
 

temp += m1[i][k] * m2[k][j]; // serial/parallel

Uses: 1 TLB for code + 1 TLB for stack (could be 0) + 1TLB for m1 row table + 1 TLB for m1 data + 1 TLB for m2 row table + 1 TLB for m2 data = 6 (or 5) TLB's.

Whereas the row packing method employed by the Cilk++ sample program:
 

	A[itn + j] += B[itn + k] * C[ktn + j];

Uses: 1 TLB for code + 1 TLB for stack (could be 0) + 1 TLB for m1 data + 1 TLB for m2 data = 4 (or 3) TLB's.

The reduction in the numbers of TLB's required should produce an advantage.

And the corrisponding charts:
 

 

Fig 3 (above)


Wow!, the Cilk++ technique on the Q6600 has a nice peak between 400 and 700 where it attains close to 25x performance over Serial but dropps off to about 14x at the higher range. This represents a 3x to 6x improvement over the parallel version of the standard matrix multiply code. The elimination of the array of row pointers made a significant difference.
 

I should state that the technique used by the Cilk++ demo program is referred to as "Cilk++" on the charts and in the body of this text. The technique used can be used by other parallel programming dialects. Keep this in mind as you read this series of articles.


Now looking at the Core i7 920 processor we find:
 

 

Fig 4


The Cilk++ method on the Core i7 920 attains almost 50x performance gain over serial method at about N=800 (12x over parallel), but appears as if it will be dropping off after 1024 (as it did earlier at 700 on the Q6600). Additional data points should be collected.

What this teaches us is: constructing 2D arrays as an array of pointers to 1D arrays looks good but can cost you dearly. Clearly the more efficient route is to use row packing to reduce the number of TLBs and corrisponding entries in the cache. But this means changing
 

   double** A; // referenced as A[iRow][iCol]

into some class
 

   Array2D<double>  A;

And then changing the allocations and references (or fancy template operators).

If you look at this from a different perspective the array class object(s) are effectively array descriptors. Array descriptors are a well used technique by FORTRAN.

WIth an improvement of 12x over a parallel version of the serial implimentation this shows that paying attention to cache issues realy pays off. And that you may have to dispense with the familiar C/C++ programming practice of referencing a multi-dimensioned array as an array of pointers. A little extra programming effort pays off with significant returns in performance.

You would have to ask yourself: is this the best you can do?
And, is it worth your while to attempt at better performance?

The answer to this is the subject of my next article: Part 2
Jim Dempsey
jim@quickthreadprogramming.com
www.quickthreadprogramming.com

For more complete information about compiler optimizations, see our Optimization Notice.

16 comments

Top
leifblaese's picture

cool article, but am i the only one that is seeing the wrong pictures? Fig 2 shows the parallel baseline, NUMA-aware threading, and Hybrid-MPI threading, Figure 3 and 4 show some code snippets of some android code - completely unrelated to the text. I guess there is something.

jimdempseyatthecove's picture

yuriisig,

Can you send me a working copy of the referenced source code as an application.
Input is an arbitrary size for two N x N arrays, generated with random data, then a timed multiplication producing an N x N product.

With that in hand, I can make a test run on the various systems used in this article.
And then I can return the results to this blog, or as a follow-up article.

For a follow-up article, I am writing a newer implementation using the CEAN (C++ Extension for Array Notation). The algorithm will be somewhat different due to the characteristics of CEAN.

Best Regards,
Jim Dempsey
jim@quickthreadprogramming.com
(I have license to MKL)

Taronyu's picture

Very good article!

yuriisig's picture

Were compared with dgemm Intel MKL? This is a presentation for beginners? See how to correctly use a matrix multiplication: http://software.intel.com/en-us/forums/showthread.php?t=77331&o=d&s=lr

jimdempseyatthecove's picture

Jean,

I would like to add:

I like FORTRAN, in particular Intel's version with Array Visualizer (anyone listening?). One of my major projects (when I get time to work on it) is an independent scientific research projects for studying Space Elevators. This project contains ~750 source files (~600,000 lines) and pays particular attention to parallel programming needs. Although not to the extent of the subject of this article (attention to cache Levels). I use OpenMP for that project instead of QuickThread due to QuickThread's dependency on C++ templates. Time permitting, I can (will?) write a FORTRAN "preprocessor" that will provide the necessary template-like capabilities. My earliest version of QuickThread was targeted to FORTRAN and worked, however, not having templates, it was a real pain to write the interfaces, making it no-marketable. The preprocessor will provide a similar functionality to -gen-interfaces but incorporating the template-like functionality. That is a future project waiting for development time/funds.

Jim Dempsey

jimdempseyatthecove's picture

Jean,

I am sorry for the delay in my response (I was looking at responses in the most recent articles).

When using IVF and

c = matmul(a,b)

With optimizations set to Maximum Speed, and enabling SSE3, and enabling Qparallel

one run (not definitive) on Intel Q6600 (4 core, no NT) 2.4GHz Windows XP x64

For square matrix sizes of

N MATMUL PTTx
1024 3.245 0.231 (14x faster – fits in cache)
2048 28.79 5.06 (5.69x faster – spills out of cache)

It is unclear to me as to if the Fortran MATMUL (with /Qparallel) is done in parallel.

Jim Dempsey

anonymous's picture

This is perhaps slightly out of topic, but I have to vent some of my frustrations:

In Fortran, matrix multiply can be simply written as:

c= matmul(a,b)

And all the low level details are supposed to be handled by the compiler, rather than by the programmer, to get the most efficient code possible. The programmer can then work at the algorithm level, not at the machine level.

It is rather disappointing to see how Intel consider Fortran as a second class citizen. Look for
example at the time they take to develop a full Fortran 2003 implementation. When a complete implementation will be available ? 2011 ? 2012 ? But they nevertheless add array extensions and parallel development tools to C++. If they had put the same efforts on Fortran, be assured that
we would had a full Fortran 2003 compiler since a long time ago.

But, eventually, one has to take a decision. A programming language is a tool, and if the development
of that tool stagnates or takes too much time, then one has to switch to another if he/she wants to
stay current in software development methodology. OOP is now an essential part of application development. That's why I decided, with regrets, to abandon Fortran.

The fact that Fortran compiler development has been deemphasized is not unique to Intel. Absoft and
Lahey compilers stagnate since several years now.

Cray and IBM have now fully compliant Fortran 2003 compilers, but they run only on specialized
platforms.

Sorry for the rant,

Jean

anonymous's picture

really good topic,i used this its useful

jimdempseyatthecove's picture

Ingo,

You are correct (although "flaw" might not be the correct choice of words).

The point of these articles is to get the reader thinking about cache access issues as opposed to brute force by add more cores. In this respect, together with your suggestions for improved performance, I would consider the articles a success.

One of the other readers, who was unable to post a comment here, emailed a similar message, and attached a .cpp file with an improved serial and parallel method. Although his contributions are too late to be included in the current series of articles, I will insert the suggestions into a test program that will be available at my website (www.quickthreadprogramming.com). This web site does not have the test program today, but will have the program after the series. I have one more set of test runs to make for these series of articles. Then the test code will be uploaded to the web site.

Arseny Kapoulkine was the reader referenced above who emailed me directly, also pointing out the "flaw" in the analysis. He offered an improved method, which cannot be addressed in this comment, but will eventually appear in the test code (to be uploaded to my web site later). Arseny, also pointed out an improved row table allocation strategy which is applicable to the original serial code (and parallel derivation thereof). This technique will maintain the row table and exhibit improved cache utilization. (~25%)

double** create_matrix(size_t size)
{
double* data = new double[size * size];
double** m = new double*[size];
for (size_t i = 0; i < size; ++i)
{
m[i] = data + i * size;
}
return m;
}

The above technique will assure the block of data holding the array uses a minimum number of virtual memory pages (page alligned allocation may get you 1 less page). And this in turn will reduce the number of different TLB's involved in mapping the data.

The path traversal (the order in which you process the arrays) has a much more significant impact. However, every little bit helps.

Jim Dempsey

Pages

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.