I need to calculate the products of some relatively small matrices. I've written a test program (attached, hopefully) which demonstrates several performance problems with MKL's dgemm function on Phi. The matrices are tall and narrow, which seems to be particularly bad for Phi. (All of my dimensions are multiples of 64 bytes, so I don't think alignment is the culprit.)

The first problems (a bug, in my opinion) is that performance falls dramatically when less than 4 threads are used. With 4 MKL threads I can complete nearly 250 dgemm calls per second. With 1 or 2 threads I can barely complete one dgemm call per second. The problem only seems to occur when one of my matrices has less than 16 columns.

The second is that performance actually improves when I increase the number of columns in one of my matrices. With 8 columns I can complete 1244 dgemm calls per second, but by doubling to 16 columns I can get up to 1330 dgemm calls per second, even though twice as many FLOPs are required. Since the VPU registers are 512 bits, and can operate on 8 doubles in parallel, it surprises me that performance is lower with 8 columns.

The dgemm implementation seems to thrive on large matrices. However for the matrices I'm using the performance is very poor. For example, when calculating A[192x8] = B[192x1536] * C[1536x8] the Phi achieves maximum throughput of just under 1400 dgemm calls per second with 30 threads. My 3GHz Xeon X5570 can achieve over 10,000 dgemm calls per second with 8 threads and over 1700 calls per second with 1 thread. I understand that the Phi is optimized for larger matrices, but it surprises me that the Xeon beats the Phi by such a huge margin (7.5 times faster!).

1. Is the first problem I a known bug in MKL? Is there a tracking number?

2. Are there tuning options or something else I can do improve performance of dgemm for relatively small matrices (e.g. B[192x1536] * C[1536x8])?

Sample output from Phi:

A[192x8] = B[192x1536] * C[1536x8]

Warmed up and verified

240: 1243.329845 dgemm per second

120: 1242.103201 dgemm per second

60: 1243.661724 dgemm per second

30: 1389.185937 dgemm per second

16: 1075.548270 dgemm per second

8: 775.800619 dgemm per second

4: 335.006705 dgemm per second

2: 1.531311 dgemm per second

1: 1.523509 dgemm per second

Sample output from Xeon:

A[192x8] = B[192x1536] * C[1536x8]

Warmed up and verified

240: 10376.460424 dgemm per second

120: 10520.274101 dgemm per second

60: 10459.884941 dgemm per second

30: 10514.169381 dgemm per second

16: 10519.579217 dgemm per second

8: 10537.820857 dgemm per second

4: 6037.885280 dgemm per second

2: 3315.041953 dgemm per second

1: 1738.238651 dgemm per second