dgemm: very slow with rectangular matrices

dgemm: very slow with rectangular matrices

I need to calculate the products of some relatively small matrices. I've written a test program (attached, hopefully) which demonstrates several performance problems with MKL's dgemm function on Phi. The matrices are tall and narrow, which seems to be particularly bad for Phi. (All of my dimensions are multiples of 64 bytes, so I don't think alignment is the culprit.)

The first problems (a bug, in my opinion) is that performance falls dramatically when less than 4 threads are used. With 4 MKL threads I can complete nearly 250 dgemm calls per second. With 1 or 2 threads I can barely complete one dgemm call per second.  The problem only seems to occur when one of my matrices has less than 16 columns.

The second is that performance actually improves when I increase the number of columns in one of my matrices. With 8 columns I can complete 1244 dgemm calls per second, but by doubling to 16 columns I can get up to 1330 dgemm calls per second, even though twice as many FLOPs are required. Since the VPU registers are 512 bits, and can operate on 8 doubles in parallel, it surprises me that performance is lower with 8 columns.

The dgemm implementation seems to thrive on large matrices. However for the matrices I'm using the performance is very poor. For example, when calculating A[192x8] = B[192x1536] * C[1536x8] the Phi achieves maximum throughput of just under 1400 dgemm calls per second with 30 threads. My 3GHz Xeon X5570 can achieve over 10,000 dgemm calls per second with 8 threads and over 1700 calls per second with 1 thread. I understand that the Phi is optimized for larger matrices, but it surprises me that the Xeon beats the Phi by such a huge margin (7.5 times faster!).

1. Is the first problem I a known bug in MKL? Is there a tracking number?

2. Are there tuning options or something else I can do improve performance of dgemm for relatively small matrices (e.g. B[192x1536] * C[1536x8])?

Sample output from Phi:

A[192x8] = B[192x1536] * C[1536x8]
Warmed up and verified
240: 1243.329845 dgemm per second
120: 1242.103201 dgemm per second
60: 1243.661724 dgemm per second
30: 1389.185937 dgemm per second
16: 1075.548270 dgemm per second
8: 775.800619 dgemm per second
4: 335.006705 dgemm per second
2: 1.531311 dgemm per second
1: 1.523509 dgemm per second

Sample output from Xeon:

A[192x8] = B[192x1536] * C[1536x8]
Warmed up and verified
240: 10376.460424 dgemm per second
120: 10520.274101 dgemm per second
60: 10459.884941 dgemm per second
30: 10514.169381 dgemm per second
16: 10519.579217 dgemm per second
8: 10537.820857 dgemm per second
4: 6037.885280 dgemm per second
2: 3315.041953 dgemm per second
1: 1738.238651 dgemm per second

Downloadtext/x-csrc gemm-test.c2.87 KB
8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

As you indicated, MKL dgemm for MIC is optimized for large matrices (all dimensions >= 32).  In the automatic offload case, much larger dimensions are required to make the threshold for offloading to MIC rather than running on host.

By compiling the netlib dgemm source, adding OpenMP, I've seen up to 4x performance of MIC MKL, for matrices large enough to use threads effectively, but not large enough to engage MKL optimizations.  Full performance is dependent on using threads effectively; the nature of "many-core" is the use of more threads than you would use on host.  Without OpenMP collapse, the open source would use only 8 threads effectively for your problem; you could run several in parallel if that were useful.

Hi Tim,

I'm not using automatic offload - I'm running the entire test case on the Phi so PCIe latency isn't an issue for me.

In my workload I have some cases where I do need to perform many independent dgemm operations in parallel. The 200X penalty from going from 4 threads to 1 thread will cripple my workload. The massive penalty for using less than 4 threads must be a bug - can you confirm this?

I'll take a look at netlib, but I'm surprised that MIC MKL performance degrades so badly with small matrices.

The only reason I can think of for superlinear scaling with number of threads would be that your problem is taking advantage of additional cache by spreading across multiple cores.

You could easily devote 4 threads to each of 30 or more parallel instances of dgemm, so I don't see an objection in principle to choosing an optimum number of threads for each.

Hi Tim - I agree that I can work around the problem by using four or more threads. I can also work around the problem by using a larger matrix. The slowdown is NOT reproducible with A[192x16] = B[192x1536] * C[1536x16]. i.e. if  I double size of C it runs more than 200 times faster. This suggests that the problem is not cache related. You can easily reproduce this by compiling my test case and running it as ./a.out 1536 192 16.

This behavior does not make sense. I am convinced that this is a bug in MKL.

I agree that current MIC MKL dgemm is not a good solution for these cases with a small dimension.  The lack of documentation on this question (other than the stated limits for automatic offload) might lead one to hope incorrectly that MKL would optimize all such cases, as it may do more successfully for the host.

By the way, I wonder if you meant the title of this thread to refer to non-rectangular matrices.  Rectangular ones, and those which are suitable for high linpack benchmark performance, tend to be the first to be optimized, and some of the techniques don't apply well to the other cases.

Hi Tim,

Thanks for your note. Initially, I suspected that the problem was related to non-square matrices, thus the title. I now believe that the problem is due to very narrow matrices. (Of course all matrices are rectangular.) I tried to edit the title, but was unable to do so.

Unfortunately, my application demonstrates two different BLAS use cases which both appear to expose very poor performance in MKL.

I need to calculate matrix-matrix products of tall-but-narrow matrices (e.g. 10000x8). This is a serial part of my program, so I need MKL to parallelize it efficiently. Unfortunately it fails to to so; even a naive OpenMP implementation (three nested loops) is far faster.

I also need to calculate 10s of thousands of small matrix-vector and vector-vector products (dgemv/ddot). These typically have dimensions of between 8 and 192. I've found that ddot, for example, is 2 to 4 times slower than a simple loop for calculating dot products, even for several thousand elements.

Furthermore, the dimensions of my data change depending on the input. I'm very uncomfortable with the discovery that MKL performance can change dramatically and unpredictably due to small changes in the input data. For example, MKL can calculate dgemm of two 65536x16 matrices at a rate of 40/second (1.3 GFLOPS). It can calculate dgemm of two 65536x8 matrices at a rate of 6/second (50 MFLOPS). (By comparison, my own naive OpenMP implementation can calculate 430/second (3.6 GFLOPS)). I can't provide SLA guarantees to my customers if my underlying libraries are this unpredictable.

This is frustrating, because this application seems like a good fit for Phi and the programming model is very attractive.


We feel your frustration, and I highly recommend filing bugs with test cases on premier.intel.com if you have not already.  The MKL dgemm (and all the other functions), as you might expect, isn't implemented as a single subroutine, but is rather a series of heavily optimized versions for common matrix dimensions/ratios, and a default implementation (hopefully well-optimized) that handles the rest of the cases.  You have clearly fallen into problem sizes/ratios that don't work well in the default implementation (probably do something nasty to the caches), and don't have a special case implementations.  If would be good to capture those useage models, and possibly get you a fix.

Thanks, Charles

Leave a Comment

Please sign in to add a comment. Not a member? Join today