Performance of Classic Matrix Multiplication Algorithm on Intel® Xeon Phi™ Processor System

Introduction

Matrix multiplication (MM) of two matrices is one of the most fundamental operations in linear algebra. The algorithm for MM is very simple, it could be easily implemented in any programming language, and its performance significantly improves when different optimization techniques are applied.

Several versions of the classic matrix multiplication algorithm (CMMA) to compute a product of square dense matrices are evaluated in four test programs. Performance of these CMMAs is compared to a highly optimized 'cblas_sgemm' function of the Intel® Math Kernel Library (Intel® MKL)7. Tests are completed on a computer system with Intel® Xeon Phi™ processor 72105 running the Linux Red Hat* operating system in 'All2All' Cluster mode and for 'Flat', 'Hybrid 50-50', and 'Cache' MCDRAM modes.

All versions of CMMAs for single and double precision floating point data types described in the article are implemented in the C programming language and compiled with Intel® C++ Compiler versions 17 and 16 for Linux*6.

The article targets experienced C/C++ software engineers and can be considered as a reference on application optimization techniques, analysis of performance, and accuracy of computations related to MMAs.

If needed, the reader may review the contents of References1 or2 for a description of mathematical fundamentals of MM, because theoretical topics related to MM are not covered in this article.

An Overview of the Classic Matrix Multiplication Algorithm

A fundamental property of any algorithm is its asymptotic complexity (AC)3.

In generic form, AC for MMA can be expressed as follows:

MMA AC = O(N^Omega)

where O stands for operation on a data element, also known in computer science as a Big O; N is one dimension of the matrix, and omega is a matrix exponent which equals 3.0 for CMMA. That is:

CMMA AC = O(N^3)

In order to compute a product of two square matrices using CMMA, a cubic number of floating point (FP) multiplication operations is required. In other words, the CMMA runs in O(N^3) time.

An omega lower than 3.0 is possible, and it means that an MMA computes a product of two matrices faster because an optimization technique, mathematical or programming, is applied and fewer FP multiplication operations are required to compute the product.

A list of several MMAs with different values of omega is as follows:

AlgorithmOmegaNote
Francois Le Gall2.3728639(1)
Virginia Vassilevska Williams2.3728642
Stothers2.3740000
Bini2.7790000
Pan2.7950000
Strassen2.8070000(2)
Classic3.0000000(3)

Table 1. Algorithms are sorted by omega in ascending order.

Total Number of Floating Point Operations

Let's assume that:

M x N is a dimension of a matrix A, or A[M,N]
N x P is a dimension of a matrix B, or B[N,P]
M x P is a dimension of a matrix C, or C[M,P]

There are three relations between M, N and P:

Relation #1: A[...,N] = B[N,...]
Relation #2: A[M,...] = C[M,...]
Relation #3: B[...,P] = C[...,P]

If one of these three relations is not met, the product of two matrices cannot be computed.

In this article only square matrices of dimension N, where M = N = P, will be considered. Therefore:

A[N,N] is the same as A[M,N]
B[N,N] is the same as B[N,P]
C[N,N] is the same as C[M,P]

The following table shows how many multiplications are needed to compute a product of two square matrices of different Ns for three algorithms from Table 1 with omega = 2.3728639 (1), omega = 2.807 (2) and omega = 3.0 (3).

NOmega = 2.3728639 (1)Omega = 2.807 (2)Omega = 3.0 (3)
128100,028822,1262,097,152
256518,1145,753,46616,777,216
5122,683,66840,264,358134,217,728
102413,900,553281,781,1761,073,741,824
204872,000,4651,971,983,0428,589,934,592
4096372,939,61113,800,485,78068,719,476,736
81921,931,709,09196,579,637,673549,755,813,888
1638410,005,641,390675,891,165,0934,398,046,511,104
3276851,826,053,9654,730,074,351,66235,184,372,088,832
65536268,442,548,03433,102,375,837,652281,474,976,710,656

Table 2.

For example, to compute a product of two square dense matrices of dimension N equal to 32,768, Francois Le Gall (1) MMA needs ~51,826,053,965 multiplications and Classic (3) MMA needs ~35,184,372,088,832 multiplications.

Imagine the case of the product of two square matrices where N equals 32,768 needs to be computed on a very slow computer system. It means that if the Francois Le Gall MMA completes the processing in one day, then the classic MMA will need ~679 days on the same computer system, or almost two years. This is because the Francois Le Gall MMA needs ~679x fewer multiplications to compute a product:

~35,184,372,088,832 / ~51,826,053,965 = ~678.9

In the case of using a famous Strassen (2) MMA, ~91 days would be needed:

~4,730,074,351,662 / ~51,826,053,965 = ~91.3

In many software benchmarks the performance of an algorithm, or some processing, is measured in floating point operations per second (FLOPS), and not in elapsed time intervals, like days, hours, minutes, or seconds. That is why it is very important to know an exact total number (TN) of FP operations completed to calculate a FLOPS value.

With modern C++ compilers, it is very difficult to estimate an exact TN of FP operations per unit of time completed at run time due to extensive optimizations of generated binary codes. It means that an analysis of binary codes could be required, and this is outside of the scope of this article.

However, an estimate value of the TN of FP operations, multiplications and additions, for CMMA when square matrices are used can be easily calculated. Here are two simple examples:

Example 1: N = 2 Multiplications	= 8				// 2 * 2 * 2 = 2^3
Additions	= 4				// 2 * 2 * 1 = 2^2*(2-1)
TN FP Ops	= 8 + 4 = 12

Example 2: N = 3 Multiplications	= 27				// 3 * 3 * 3 = 3^3
Additions	= 18				// 3 * 3 * 2 = 3^2*(3-1)
TN FP Ops	= 27 + 18 = 45

It is apparent that the TN of FP operations to compute a product of two square matrices can be calculated using a simple formula:

TN FP Ops = (N^3) + ((N^2) * (N-1))

Note: Take into account that in the versions of the MMA used for sparse matrices, no FP operations are performed if the matrix element at position (i,j) is equal to zero.

Implementation Complexity

In the C programming language only four lines of code are needed to implement a core part of the CMMA:

for( i = 0; i < N; i += 1 )
for( j = 0; j < N; j += 1 )
for( k = 0; k < N; k += 1 )
C[i][j] += A[i][k] * B[k][j];

Therefore, CMMA's implementation complexity (IC) could be rated as very simple.

Declarations of all intermediate variables, memory allocations, and initialization of matrices are usually not taken into account.

More complex versions of MMA, like Strassen or Strassen-Winograd, could have several thousands of code lines.

Optimization Techniques

In computer programming, matrices could be represented in memory as 1-D or 2-D data structures.

Here is a static declaration of matrices A, B, and C as 1-D data structures of a single precision (SP) FP data type (float):

float fA[N*N];
float fB[N*N];
float fC[N*N];

and this is what a core part of the CMMA looks like:

for( i = 0; i < N; i += 1 )
for( j = 0; j < N; j += 1 )
for( k = 0; k < N; k += 1 )
C[N*i+j] += A[N*i+k] * B[N*k+j];

Here is a static declaration of matrices A, B, and C as 2-D data structures of a single precision (SP) FP data type (float):

float fA[N][N];
float fB[N][N];
float fC[N][N];

and this is what the core part of CMMA looks like:

for( i = 0; i < N; i += 1 )
for( j = 0; j < N; j += 1 )
for( k = 0; k < N; k += 1 )
C[i][j] += A[i][k] * B[k][j];

Many other variants of the core part of CMMA are possible and they will be reviewed.

Memory Allocation Schemes

In the previous section of this article, two examples of a static declaration of matrices A, B, and C were given. In the case of dynamic allocation of memory for matrices, explicit calls to memory allocation functions need to be made. In this case, declarations and allocations of memory can look like the following:

Declaration of matrices A, B, and C as 1-D data structures:

__attribute__( ( aligned( 64 ) ) ) float *fA;
__attribute__( ( aligned( 64 ) ) ) float *fB;
__attribute__( ( aligned( 64 ) ) ) float *fC;

and this is how memory needs to be allocated:

fA = ( float * )_mm_malloc( N * sizeof( float ), 64 );
fB = ( float * )_mm_malloc( N * sizeof( float ), 64 );
fC = ( float * )_mm_malloc( N * sizeof( float ), 64 );

Note: Allocated memory blocks are 64-byte aligned, contiguous, and not fragmented by an operating system memory manager; this improves performance of processing.

Declaration of matrices A, B, and C as 2-D data structures:

__attribute__( ( aligned( 64 ) ) ) float **fA;
__attribute__( ( aligned( 64 ) ) ) float **fB;
__attribute__( ( aligned( 64 ) ) ) float **fC;

and this is how memory needs to be allocated:

fA = ( float ** )calloc( N, sizeof( float * ) );
fB = ( float ** )calloc( N, sizeof( float * ) );
fC = ( float ** )calloc( N, sizeof( float * ) );
for( i = 0; i < N; i += 1 )
{
fA[i] = ( float * )calloc( N, sizeof( float ) );
fB[i] = ( float * )calloc( N, sizeof( float ) );
fC[i] = ( float * )calloc( N, sizeof( float ) );
}

Note: Allocated memory blocks are not contiguous and can be fragmented by an operating system memory manager, and fragmentation can degrade performance of processing.

In the previous examples, a DDR4-type RAM memory was allocated for matrices. However, on an Intel Xeon Phi processor system5 a multichannel DRAM (MCDRAM)-type RAM memory could be allocated as well, using functions from a memkind library11 when MCDRAM mode is configured to 'Flat' or 'Hybrid'. For example, this is how an MCDRAM-type RAM memory can be allocated:

fA = ( float * )hbw_malloc( N * sizeof( float ) );
fB = ( float * )hbw_malloc( N * sizeof( float ) );
fC = ( float * )hbw_malloc( N * sizeof( float ) );

Note: An 'hbw_malloc' function of the memkind library was used instead of an '_mm_malloc' function.

On an Intel Xeon Phi processor system, eight variants of memory allocation for matrices A, B, and C are possible:

Matrix AMatrix BMatrix CNote
DDR4DDR4DDR4(1)
DDR4DDR4MCDRAM(2)
DDR4MCDRAMDDR4
DDR4MCDRAMMCDRAM
MCDRAMDDR4DDR4
MCDRAMDDR4MCDRAM
MCDRAMMCDRAMDDR4
MCDRAMMCDRAMMCDRAM

Table 3.

It is recommended to use MCDRAM memory as much as possible because its bandwidth is ~400 GB/s, and it is ~5 times faster than the ~80 GB/s bandwidth of DDR4 memory5.

Here is an example of how 'cblas_sgemm' MMA performs for two memory allocation schemes (MASs) (1) and (2):

Matrix multiplication C=A*B where matrix A (32768x32768) and matrix B (32768x32768)
Allocating memory for matrices A, B, C: MAS=DDR4:DDR4:DDR4
Initializing matrix data
Matrix multiplication started
Matrix multiplication completed at 50.918 seconds
Allocating memory for matrices A, B, C: MAS=DDR4:DDR4:MCDRAM
Initializing matrix data
Matrix multiplication started
Matrix multiplication completed at 47.385 seconds

It is clear that there is a performance improvement of ~7 percent when an MCDRAM memory was allocated for matrix C.

Loop Processing Schemes

A loop processing scheme (LPS) describes what optimization techniques are applied to the 'for' statements of the C language of the core part of CMMA. For example, the following code:

for( i = 0; i < N; i += 1 )						// loop 1
for( j = 0; j < N; j += 1 )					// loop 2
for( k = 0; k < N; k += 1 )				// loop 3
C[i][j] += A[i][k] * B[k][j];

corresponds to an LPS=1:1:1, and it means that loop counters are incremented by 1.

Table 4 below includes short descriptions of different LPSs:

LPSNote
1:1:1Loops not unrolled
1:1:23rd loop unrolls to 2-in-1 computations
1:1:43rd loop unrolls to 4-in-1 computations
1:1:83rd loop unrolls to 8-in-1 computations
1:2:12nd loop unrolls to 2-in-1 computations
1:4:12nd loop unrolls to 4-in-1 computations
1:8:12nd loop unrolls to 8-in-1 computations

Table 4.

For example, the following code corresponds to an LPS=1:1:2, and it means that counters 'i' and 'j' for loops 1 and 2 are incremented by 1, and counter 'k' for loop 3 is incremented by 2:

for( i = 0; i < N; i += 1 )						// :1
{
for( j = 0; j < N; j += 1 )					// :1
{
for( k = 0; k < N; k += 2 )				// :2 (unrolled loop)
{
C[i][j] += A[i][k  ] * B[k   ][j];
C[i][j] += A[i][k+1] * B[k+1][j];
}
}
}

Note: A C++ compiler could unroll loops as well if command line-options for unrolling are used. A software engineer should prevent such cases when source code unrolling is used at the same time, because it prevents vectorization of inner loops, and degrades performance of processing.

Another optimization technique is the loop interchange optimization technique (LIOT). When the LIOT is used, a core part of CMMA looks as follows:

for( i = 0; i < N; i += 1 )						// loop 1
for( k = 0; k < N; k += 1 )					// loop 2
for( j = 0; j < N; j += 1 )				// loop 3
C[i][j] += A[i][k] * B[k][j];

It is worth noting that counters 'j' and 'k' for loops 2 and 3 were exchanged.

The loops unrolling and LIOT allow for improved performance of processing because elements of matrices A and B are accessed more efficiently.

Compute Schemes

A compute scheme (CS) describes the computation of final or intermediate values and how elements of matrices are accessed.

In a CMMA an element (i,j) of the matrix C is calculated as follows:

C[i][j] += A[i][k] * B[k][j]

and its CS is ij:ik:kj.

However, elements of matrix B are accessed in a very inefficient way. That is, the next element of matrix B, which needs to be used in the calculation, is located at a distance of (N * sizeof (datatype)) bytes. For very small matrices this is not critical because they can fit into CPU caches. However, for larger matrices it affects performance of computations, which can be significantly degraded, due to cache misses.

In order to solve that problem and improve performance of computations, a very simple optimization technique is used. If matrix B is transposed, the next element that needs to be used in the calculation will be located at a distance of (sizeof (datatype)) bytes. Thus, access to the elements of matrix B will be similar to the access to the elements of matrix A.

In a transpose-based CMMA, an element (i,j) of the matrix C is calculated as follows:

C[i][j] += A[i][k] * B[j][k]

and its CS is ij:ik:jk. Here B[j][k] is used instead of B[k][j].

It is very important to use the fastest possible algorithm for the matrix B transposition before processing is started. In Appendix B an example is given on how much time is needed to transpose a square matrix of 32,768 elements, and how much time is needed to compute the product on an Intel Xeon Phi processor system.

Another optimization technique is the loop blocking optimization technique (LBOT) and it allows the use of smaller subsets of A, B, and C matrices to compute the product. When the LBOT is used, a core part of CMMA looks as follows:

for( i = 0; i < N; i += BlockSize )
{
for( j = 0; j < N; j += BlockSize )
{
for( k = 0; k < N; k += BlockSize )
{
for( ii = i; ii < ( i+BlockSize ); ii += 1 )
for( jj = j; jj < ( j+BlockSize ); jj += 1 )
for( kk = k; kk < ( k+BlockSize ); kk += 1 )
C[ii][jj] += A[ii][kk] * B[kk][jj];
}
}
}

Note: A detailed description of LBOT can be found at10.

Table 5 shows four examples of CSs:

CSNote
ij:ik:kjDefault
ij:ik:jkTransposed
iijj:iikk:kkjjDefault LBOT
iijj:iikk:jjkkTransposed LBOT

Table 5.

Error Analysis

In any version of MMA many FP operations need to be done in order to compute values of elements of matrix C. Since FP data types SP or DP have limited precision4, rounding errors accumulate very quickly. A common misconception is that rounding errors can occur only in cases where large or very large matrices need to be multiplied. This is not true because, in the case of floating point arithmetic (FPA), a rounding error is also a function of the range of an input value, and it is not a function of the size of input matrices.

However, a very simple optimization technique allows improvement in the accuracy of computations.

If matrices A and B are declared as an SP FP data type, then intermediate values could be stored in a variable of DP FP data type:

for( i = 0; i < N; i += 1 )
{
for( j = 0; j < N; j += 1 )
{
double sum = 0.0;
for( k = 0; k < N; k += 1 )
{
sum += ( double )( A[i][k] * B[k][j] );
}
C[i][j] = sum;
}
}

The accuracy of computations will be improved, but performance of processing can be lower.

An error analysis (EA) is completed using the mmatest4.c test program for different sizes of matrices of SP and DP FP data types (see Table 6 in Appendix C with results).

Performance on the Intel® Xeon Phi™ Processor System

Several versions of the CMMA to compute a product of square dense matrices are evaluated in four test programs. Performance of these CMMAs is compared to a highly optimized 'cblas_sgemm' function of the Intel MKL7. Also see Appendix D for more evaluations. Figure 1. Performance tests for matrix multiply algorithms on Intel® Xeon Phi™ processor using mmatest1.c with KMP_AFFINITY environment variable set to 'scatter', 'balanced', and 'compact'. A lower bar height means faster processing.

Here are the names of source files with a short description of tests:

mmatest1.c - Performance tests matrix multiply algorithms on an Intel Xeon Phi processor.
mmatest2.c - Performance tests matrix multiply algorithms on an Intel Xeon Phi processor in one MCDRAM mode ('Flat') for DDR4:DDR4:DDR4 and DDR4:DDR4:MCDRAM MASs.
mmatest3.c - Performance tests matrix multiply algorithms on an Intel Xeon Phi processor in three MCDRAM modes ('All2All', 'Flat', and 'Cache') for DDR4:DDR4:DDR4 and MCDRAM:MCDRAM:MCDRAM MASs. Note: In 'Cache' MCDRAM mode, MCDRAM:MCDRAM:MCDRAM MAS cannot be used.
mmatest4.c - Verification matrix multiply algorithms accuracy of computations on an Intel Xeon Phi processor.

OpenMP* Product Thread Affinity Control

OpenMP* product compiler directives can be easily used to parallelize processing and significantly speed up processing. However, it is very important to execute OpenMP threads on different logical CPUs of modern multicore processors in order to utilize their internal resources as best as possible.

In the case of using the Intel C++ compiler and Intel OpenMP run-time libraries, the KMP_AFFINITY environment variable provides flexibility and simplifies that task. Here are three simple examples of using the KMP_AFFINITY environment variable:

KMP_AFFINITY = scatter
KMP_AFFINITY = balanced
KMP_AFFINITY = compact

These two screenshots of the Htop* utility12 demonstrate how OpenMP threads are assigned (pinned) to Intel Xeon Phi processor 72105 logical CPUs during processing of an MMA using 64 cores of the processor: Screenshot 1. KMP_AFFINITY = scatter or balanced. Note: Processing is faster when compared to KMP_AFFINITY = compact. Screenshot 2. KMP_AFFINITY = compact. Note: Processing is slower when compared to KMP_AFFINITY = scatter or balanced.

Recommended Intel® C++ Compiler Command-Line Options

Here is a list of Intel C++ Compiler command-line options that a software engineer should consider, which can improve performance of processing of CMMAs:

O3
fp-model
parallel
unroll
unroll-aggressive
opt-streaming-stores
opt-mem-layout-trans

Os
openmp
ansi-alias
fma
opt-matmul
opt-block-factor
opt-prefetch

The reader can use 'icpc -help' or 'icc -help' to learn more about these command-line options.

Conclusion

Application of different optimization techniques to the CMMA were reviewed in this article.

Three versions of CMMA to compute a product of square dense matrices were evaluated in four test programs. Performance of these CMMAs was compared to a highly optimized 'cblas_sgemm' function of the Intel MKL7.

Tests were completed on a computer system with an Intel® Xeon Phi processor 72105 running the Linux Red Hat operating system in 'All2All' Cluster mode and for 'Flat', 'Hybrid 50-50', and 'Cache' MCDRAM modes.

It was demonstrated that CMMA could be used for cases when matrices of small sizes, up to 1,024 x 1,024, need to be multiplied.

It was demonstrated that performance of MMAs is higher when MCDRAM-type RAM memory is allocated for matrices with sizes up to 16,384 x 16,384 instead of DDR4-type RAM memory.

Advantages of using CMMA to compute the product of two matrices are as follows:

• In any programming language, simple to implement to run on CPUs or GPUs9
• Highly portable source codes when implemented in C, C++, or Java programming languages
• Simple to integrate with existing software for a wide range of computer platforms
• Simple to debug and troubleshoot
• Predictable memory footprint at run time
• Easy to optimize using parallelization and vectorization techniques
• Low overheads and very good performance for matrices of sizes ranging from 256 x 256 to 1,024 x 1,024 (see Figures 1 through 5)
• Very good accuracy of computations for matrices of sizes ranging from 8 x 8 to 2,048 x 2,048 (see Table 6 in Appendix C)

Disadvantages of using CMMA to compute a product of two matrices are as follows:

• Poor performance for large matrices with sizes greater than 2048 x 2048
• Poor performance when implemented using high-level programming languages due to processing overheads
• Reduced accuracy of computations for matrices of sizes ranging from 2,048 x 2,048 to 65,536 x 65,536 (see Table 6 in Appendix C)

References

Para obter informações mais completas sobre otimizações do compilador, consulte nosso aviso de otimização.