Traditional MKL performance tests being executed on Nehalem-EX by default got rather poor performance on a number of MKL LAPACK functions, particularly on DGETRF. It's been detected that using such utility as numactll on that system under Linux, namely controlling NUMA policy for shared memory, improves performance dramatically , that is, starting an application in this way:
numactl --interleave=all <application>
instead of just running the application in the default mode, when NUMA memory policy isn't controlled:
gets another performance level for MKL LAPACK functions. BLAS functions behave identically.
Windows doesn't happen to provide the same sort of utility, so the performance on Windows is poor.
It was observed once on Itanium NUMA-based SGI Altix system, 64 nodes, that initializing the factorized matrix in multiple threads (i.e. distributing matrix) helps performance much. The article aims to explain this idea applied to Nehalem-EX system in details and make recommendations how to get the best performance on NUMA-based system regardless of NUMA memory policy.
1numactl runs processes with a specific NUMA scheduling or memory placement policy. The policy is set for command and inherited by all of its children.
In addition it can set persistent policy for shared memory segments or files.
Policy settings are:
--interleave=nodes, -i nodes
Set an memory interleave policy. Memory will be allocated using round robin on nodes. When memory cannot be allocated on the current interleave target fall back to other nodes.
Performance issue w/o NUMA memory policy control
MKL DGETRF example
Having an input matrix initialized on one node, we can't get the same level of performance in the default mode as with NUMA memory policy controlled. The following chart shows a serious performance limitation on a wide range of sizes in the default mode (no numactl used).
Interleaved memory policy has the effect that the data gets closer in general to the core where a code is being executed, thus extremely impacts on the performance.
High performance level at 1K-3K equations is explained by the fact that the working data fits LLC (Last Level Cache), that is, pulled in cache once it's never popped out - the cache is missed only once in the very beginning.
Since around 80-90% time of DGETRF is spent in DGEMM, this limitation is obviously inherited from DEGMM performance limitation working on the "far" memory.
Matrix distribution effect
Experiments have been made on Linpack first. Different matrix initialization styles have been examined:
- traditional single node distribution: the whole matrix is initialized (and previously allocated) in the master thread;
- multiple nodes distribution: the matrix is initialized in multiple threads with the 1-d block-cyclic distribution, that is, threads #1 fills columns 1:nb, threads #2 columns nb+1:2*nb and so on in round-robin manner. nb here depends on the problem size and is equal to the LU algorithm block size.
For both styles Linpack has been run in the default mode and under numactl.
The following chart demonstrates how seriously the matrix initialization style impacts on performance.
Distributing matrix over multiple nodes keeps performance close to the highest level even in the default mode.
In other words, distributing the data over nodes before calling MKL has about the same effect as controlling NUMA memory policy with numactl.
Effective distribution parameter nb
Further experiments show the effective value of 1-d block-cyclic distribution parameter nb. As demonstrated on the next chart, the nb values up to 64 give the optimal performance when numactl isn't used. Larger values slow down the code, the larger the worse.
Let's consider DGEMM outer product case relevant to Linpack, that is, A, B are not transposed, large M = N, K ~100. The operation is C = C - A*B:
where A is M-by-K, B is K-by-N, C is M-by-N.
Again, several kinds of performance measurements have been done at different K:
- NUMA interleave memory policy (numactl --interleave=all) turned on, matrices are not distributed;
- no NUMA memory policy (default mode), matrices are distributed;
- no NUMA memory policy (default mode), matrices are not distributed.
The results can be seen on the following charts for K=64 and K=192.
The following statements are actual for DGEMM performance concerning data distribution and NUMA memory policy:
- if the data are not distributed, numactl allows to get higher or much higher performance compared to default mode, the less K the larger the difference;
- distributing matrices in the default mode has almost the same effect as applying NUMA interleave memory policy.
How to distribute a matrix
The distribution method used here is a parallel initialization of a matrix in 1-d block-cyclic manner. Here's a code fragment distributing matrix a: mxn, column-major layout, leading dimension lda, with a distribution factor nb by means of OpenMP:
#pragma omp parallel for default(shared) private(j) schedule(static,1) for( j = 0; j < n; j += nb ) // init_matrix(m,n,a,lda,i,j) initializes sub-matrix a:m-by-n // with leading dimension lda, located at (i,j) offset from // the top-left corner of some global matrix init_matrix( m, min( nb, n-j ), &a[j*lda], lda, 0, j );
The same can be done while copying the matrix b into a for the purpose of using matrix a in further computations:
#pragma omp parallel for default(shared) private(j) schedule(static,1) for( j = 0; j < n; j += nb ) // copy_matrix(m,n,b,ldb,a,lda) copies matrix b:m-by-n // with leading dimension ldb into matrix a:m-by-n with // leading dimension lda copy_matrix( m, min( nb, n-j ), &b[j*ldb], ldb, &a[j*lda], lda );
At last, a matrix can remain distributed as a result of the previous multi-threaded function, which access different parts of the matrix from different threads.