Multiplying dense matricies

Multiplying dense matricies

What is the recommended (optimised) way of multiplying 2 matrices?I've seen the matrix * vector example - not sure how to extend that to the full matrix *matrix case.thanks.....

20 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I'm working the same problem for sparse matrices, so having a dense solution is my first step. As with the matrix * vector example, where a "temporary matrix" is formed by duplicating the vector and then multiplying coincident elements and add_reduce() to form the answer, our goal is to multiply coincident elements of a 3-D container and add_reduce() to obtainthe2-D matrix answer.

Example: A . B = C

My first cut is to duplicate B as n pages, n = B.num_rows(). This is our first cube. Then we need A duplicated as n pages, n = B.num_cols(), but unfortunately these pagesneed to benormal to the column axis so we have to form this cube another way. I chose to serialize the entries, duplicate itwithout collating, and reshape intoa cube. Then multiply the cubes' entiries and add_reduce() along the row axis. Example code...

dense A = dense::parse("{{1.,2.},{3.,4.}}");
dense B = dense::parse("{{1.,2.,3.},{4.,5.,6.}}");

dense B3 = repeat_page(B, B.num_rows());
dense A3 = reshape(repeat(A.flatten(), B.num_cols(), false), B.num_cols(), B.num_rows(), B.num_rows());
dense ans = add_reduce(A3*B3, 1); //should be{{9,12,15},{19,26,33}}

Anyone else? Note,the quotes around "temporary matrix" (above) are to indicate ArBB optimizes such that usuallyonly part of theconceptual matrix is realized in memory at any time.

- paul

My second cut uses map(), which will be multicore friendly. Instead of add_reduce() on a cube, calculate the answer "in place" (as we were taught in high school)by summing subproducts of column/row traversals. (IMHO, this is an excellentargumentforsupportingreductionfunctions onread-only data inside map().)

void mul2D_map(flt &Ans, const dense &A, const dense &B)
usize col, row;
position(col, row);
Ans = 0.;
_for(usize idx = 0, idx < A.num_cols(), idx++) {
Ans += A(idx, row)*B(col, idx);
} _end_for;
void mul2D(dense &Ans, const dense &A, const dense &B)
map(mul2D_map)(Ans, A, B);

Example code...

denseA = dense::parse("{{1.,2.},{3.,4.}}");
denseB = dense::parse("{{1.,2.,3.},{4.,5.,6.}}");
dense ans(_b.num_cols(), _b.num_rows()); // sized so map() knows how to iterate

const closure &, const dense &, const dense &)>
mul2D = capture(mul2D);
mul2D(ans, A, B); // should be {{9,12,15},{19,26,33}}

Anyone else? Bueller?...

- paul

My third cut loops over rows, creating a nearly one-liner solution. Same set-up as my second cut (above)...

void mul2D(dense &Ans, const dense &A, const dense &B)
_for(usize idx = 0, idx < B.num_rows(), idx++) {
Ans = replace_row(Ans, idx, add_reduce(repeat_col(A.row(idx), B.num_cols())*B, 1));
} _end_for;

Anyone else?

- paul

Interesting - what kind of performance do you get?I've tried running it on my machine (I7 920, 64bit, 6Gb memory), and it is taking 100ms to multiply two 500x500 matricies.... It is thrashing the hell out of the CPUs - 98% most of the time!

I am justrunning the small example. Your case is 500 multiplies and 499 additions, 500 x 500 times, thus 249.75 Mflops in ~100ms, yielding ~2.5 Gflops/sec -- not bad for starting out! I am still iterating on the ways to "skin this cat" so try them out and please report your times. I've noticed that reduction operations along the column axis can be 8x faster than e.g. along the row axis, as (my hunch) the former is friendly to memory layout & access and the SSE hardware, loads & stores. And any of my "solutions" that form a cube will exceed cache for your problem, sothere is that to contend with.

By the way, "thrashing the CPUs" is a good thing ... means they are busy. Just need to make them effective.

- paul

My 4th cut uses map() to populatethe temporary cube, and arranges it so thatreduce is along the column axis (this seems to be friendly to memory layout & access and the SSE hardware, loads & stores). Same set-up as before (above)...

void mul2D_map(flt &Tmp, const dense &A, const dense &B)
usize col, row, pag;
position(col, row, pag);
Tmp = A(col, pag)*B(row, col);

void mul2D(dense &Ans, const dense &A, const dense &B)
dense tmp(B.num_rows(), B.num_cols(), B.num_rows()); // sized so map() knows how to iterate
map(mul2D_map)(tmp, A, B);
Ans = add_reduce(tmp, 0);
// for 3D column-wise reduce, the axes "flip down" by one dimension
// the old row axis becomes the newcolumn axis and
// the oldpage axis becomes the new row axis

This trades off space for time, as map() distributes the look-ups & multiplyacross cores and the add_reduce does too.

- paul

Yes - the reason I mentioned the CPUs is that it means the setup is OK and it is going heavily multithreaded.The times are slightly worse than doing the same calculation in Octave, which uses the ATLAS BLAS library (single threaded)..... so there must be some more fat in there somewhere!Many thanks for taking a look at this.

version 4 dies on a 400x400 caculation... investigating

My 5th (and likely final) cut is like my 3rd, but reversing the roles of columns and rows. My hunch is one of these two methods will be the fastest for larger matrices as the intermediate data bloat is minimized as it proceeds, which maximizes the size of problem that will fit in cache. The question is which way executes faster: reducing along columns to create rows, or reducing along rows to create columns? Same set-up as before (above)...

void mul2D(dense &Ans, const dense &A, const dense &B)
_for(usize idx = 0, idx < B.num_cols(), idx++) {
Ans = replace_col(Ans, idx, add_reduce(A*repeat_row(B.col(idx), B.num_rows()), 0));
} _end_for;

Now I shouldfocus onmy little problem of doing the same for sparse matrices.

- paul

Results! Ta-da!!!

My workhorse laptop is a Lenovo T60p, Core2 dual, T7600 @ 2.33GHz. This means, on a good day, with 100% concurrency of floating-point add and multiply units(i.e. not SSE), yields 5.66Gflops/sec for a core. Good luck with that.

Methods #1, #2, and #4 were relative dogs: single core times for 100x100 single-precision FP were 31.9ms, 7.53ms, and 10.6ms, respectively; Gflops/sec of 0.062, 0.264, and 0.188, respectively. These did scale well to two cores, FYI.

Methods #3 and #5 were winners: single core times for single-precision FP, single core, 100x100 and 300x300 were 10.5ms and 843us, Gflops/sec were 1.9 and 2.4, respectively!Dual core times were 548us and 446us, Gflops/sec of 3.6 and 4.5, respectively! This is awesome, considering the data shuffling of 1.9Mops and 53.9Mops.

Method #5, as champion (until the ArBB groups tells us how we should have done this!), had the following measurements for 300x300 (53Mops), 400x400 (128Mops), and 500x500 (250Mops), single-precision FP, single core: 15.9ms, 35.6ms, and 67.0ms; Gflops/sec of 3.412, 3.591, and 3.728. Dual core times were: 8.19ms, 18.5ms, and 34.7ms; Gflops/sec of 6.59, 6.91, and 7.20!

This is the first time I've seen anything faster than the best-case non-SSE floating-point performance! And thetwo-core scalingis quite good.

Method #5double-precision measurements single/dual core for100x100, 300x300, 400x400, and500x500: 1.33ms/0.700ms, 28.2ms/14.5ms, 61.4ms/31.7ms, and 120ms/61.9ms; Gflops/sec of 1.50/2.84, 1.92/3.72, 2.082/4.033, and 2.081/4.035. Very impressive ratesand scaling!

Wonder if Intel will loan an i7 Sandy Bridge machine to me?
- paul

Hello Paul,
Hello Malmesbury,

thank you guys for sharing your results in the forum, and helping each other!

First of all, we are working on making straight-forward variants of an algorithm the best performing implementation. This would include reproducing similar performance in the light of variations up to some degree.

Having said this, I am still guessing you guys like to play a little bit:

void mul(dense& result, const dense& a, const dense& b)
struct local {
static void mul(std::size_t u, dense& result,
const dense& a, const dense& b)
const usize m = a.num_rows();
const usize n = a.num_cols();

result = repeat_col(a.col(0), m) * repeat_row(b.row(0), m);
for (std::size_t j = 1; j < u; ++j) {
result += repeat_col(a.col(j), m) * repeat_row(b.row(j), m);

const usize size = n / u;
_for (usize i = 1, i < size, ++i) {
const usize base = i * u;
for (std::size_t j = 0; j != u; ++j) {
const usize k = base + j;
result += repeat_col(a.col(k), m) * repeat_row(b.row(k), m);
} _end_for;

local::mul(2, result, a, b);

const auto_closure arbb_mul = capture(mul); // does not include the compilation!
arbb_mul(ans, A, B); // warmup
arbb_mul(ans, A, B); // etc.

Please note, that the given factor 'u' requires to have a problem size which is a multiple of this factor in order to achieve correct results. Also, this variant is obviously not what we can recommend with respect to what's said earlier.


Since multiplying two dense matrices is a relatively fixed problem, it's worth considering a fixed function library which is optimized for this kind of problem, e.g. Intel MKL. Although Intel ArBB does well with multiplying two dense matrices, it might be even better in case of custom algorithms.

Multiplying two dense matrices using Intel ArBB in a straight-forward manner might looks like:

void mul(dense& result, const dense<32, 2>& a, const dense& b)
dense t;
_for (usize i = 0, i != a.num_rows(), ++i){
t = b * repeat_col(a.row(i), b.num_rows());
c = replace_row(c, i, add_reduce(t, 1));
} _end_for;

One objective is to replace rows (instead of columns) when the data format is row-major. Another objective is to grab as much data as possible by using operators to populate the intermediate results. Since the add_reduce in the above variant is collapsing along the non-default dimension, the earlier/above version (without unrolling the code :-) can be seen to be straight-forward as well:

void mul(dense& result, const dense& a, const dense& b)
const usize n = a.num_rows();
result = repeat_col(a.col(0), n) * repeat_row(b.row(0), n);
_for (usize i = 1, i < a.num_cols(), ++i) {
result += repeat_col(a.col(i), n) * repeat_row(b.row(i), n);
} _end_for;

Also, a version repeating (_for loop) the recommended way to multiply a dense matrix and a vector
result = add_reduce(a * repeat_row(b.col(i), a.num_rows()));
is certainly straight-forward (despite of grabbing the columns of the second matrix).


Hans makes two proposals, the former is similar to my 3rd cut and the latter is new to this thread. As it turns out, the latter needs a fix for non-square matricies...

void mul2D(dense &Result, const dense &A, const dense &B)
const usize n = B.num_cols();
const usize m = B.num_rows();
Result = repeat_col(A.col(0), n) * repeat_row(B.row(0), m);
_for(usize i = 1, i < m, i++) {
Result += repeat_col(A.col(i), n) * repeat_row(B.row(i), m);
} _end_for;

The performance I measured is close "but no cigar" with my 5th cut. Firstly, the single-precision FP, single-core results (sameCPU describedbefore)for 100x100, 300x300, and 400x400 are: 882us, 16.7ms, and 37.4ms; with Gflops/sec of 2.26, 3.23, and 3.42. The disappointment comes for dual-core: 848us, 10.1ms, and 22.0ms; with Gflops/sec of 2.35, 5.34, and 5.81. Something is not scaling as well, at least in beta 4.

The companion double-precision results, single/dual core, are: 1.37ms/1.11ms, 30.0ms/17.6ms, and 64.9ms/38.3ms; with Gflops/sec of 1.45/1.79, 1.80/3.06, and 1.97/3.34.

- paul

Thank you Paul!

Nice you've improved the code for non-square matrices.
I guess you just run the serious proposal, how is this?


Hi Hans,

Well, I may yet see how effective loop unrolling is, e.g. I've come across a case where hardcoding the zeroeth iteration is slightly slower, it's faster to let the loop do that one too! Also, it would be nice to not insist the size of the matrix be a integer multiple of the un-roll amount; a final loop to handle the remainder.

By the way, while you are being a good company-man mentioning MKS most of us think your job is to put things like MKS out of business! Hand-tuned libraries are so '90s! And it's just a matter of time before someone has back-ended ATLAS with ArBB.

Back to my little problems, I embarked on this one-day "matrix fest" to understand all the ways, efficient or otherwise, to multiply dense matrices to learn what I really needed, which is the same for sparse matrices. That illumination came today with a row-only method thatI haven't posted cuz it's only slightly faster than my 3rd cut but still slower than my 5th cut much above 100x100. But it's the technique I was looking for with CSR-style data format.

But, wow, those Gflop rates for NUMA were impressive, eh?

- paul

Many thanks for all of this. I have been trying the Intel MLK BLAS libarary as an altenative - should have looked at it earlier, really. I will do a test of this vs MLK in the next couple of days.thanks again!

One thing I'v enoticed is that while ARBB uses 95%+ of CPU resouces (8 threads on my I7 920), MLK seems top out at 5 threads and about 50%. I've seen some references to the more cunning BLAS libraries avoiding strangling themselves due to maxing out resources. Might some strategic throttling in ARBB help?

Hello Malmesbury,

Thank you for your continued support of Intel ArBB! Regarding the Intel MKL, did you link against the Intel OpenMP library? Here is a starting point for linking with the Intel Math Kernel Library (MKL) in general.


"Trust, but verify."

I downloaded the latest Eigen release ( to spot check using a 100x100 dense double-precision matrix. Same laptop (above) and a single core measured, 934us and 455us with SSE2 code generation. The respective Gflops/sec rates are 2.13 and 4.37 which is77% ofthe 5.66 limit for that core. (Could they also sneak in anx87 multiplyduring SSE2 execution? Curious.)

For single-precision SSE2, Eigen100x100 measured at 246us, Gflops/sec of8.09 (wow).

To recap, the best SP/DP times for the same calculation in ArBB, my 5th cut, are 843us/1.33ms and 446us/700us for one and two cores respectively; Gflops/sec of 2.40/1.50 and 4.50/2.84 . . .

Just FYI.

- paul

Leave a Comment

Please sign in to add a comment. Not a member? Join today