# Performance of Classic Matrix Multiplication Algorithm on Intel® Xeon Phi™ Processor System

## Introduction

Matrix multiplication (MM) of two matrices is one of the most fundamental operations in linear algebra. The algorithm for MM is very simple, it could be easily implemented in any programming language, and its performance significantly improves when different optimization techniques are applied.

Several versions of the classic matrix multiplication algorithm (CMMA) to compute a product of square dense matrices are evaluated in four test programs. Performance of these CMMAs is compared to a highly optimized 'cblas_sgemm' function of the Intel® Math Kernel Library (Intel® MKL)7. Tests are completed on a computer system with Intel® Xeon Phi™ processor 72105 running the Linux Red Hat* operating system in 'All2All' Cluster mode and for 'Flat', 'Hybrid 50-50', and 'Cache' MCDRAM modes.

All versions of CMMAs for single and double precision floating point data types described in the article are implemented in the C programming language and compiled with Intel® C++ Compiler versions 17 and 16 for Linux*6.

The article targets experienced C/C++ software engineers and can be considered as a reference on application optimization techniques, analysis of performance, and accuracy of computations related to MMAs.

If needed, the reader may review the contents of References1 or2 for a description of mathematical fundamentals of MM, because theoretical topics related to MM are not covered in this article.

## An Overview of the Classic Matrix Multiplication Algorithm

A fundamental property of any algorithm is its asymptotic complexity (AC)3.

In generic form, AC for MMA can be expressed as follows:

MMA AC = O(N^Omega)

where O stands for operation on a data element, also known in computer science as a Big O; N is one dimension of the matrix, and omega is a matrix exponent which equals 3.0 for CMMA. That is:

CMMA AC = O(N^3)

In order to compute a product of two square matrices using CMMA, a cubic number of floating point (FP) multiplication operations is required. In other words, the CMMA runs in O(N^3) time.

An omega lower than 3.0 is possible, and it means that an MMA computes a product of two matrices faster because an optimization technique, mathematical or programming, is applied and fewer FP multiplication operations are required to compute the product.

A list of several MMAs with different values of omega is as follows:

AlgorithmOmegaNote
Francois Le Gall2.3728639(1)
Virginia Vassilevska Williams2.3728642
Stothers2.3740000
Bini2.7790000
Pan2.7950000
Strassen2.8070000(2)
Classic3.0000000(3)

Table 1. Algorithms are sorted by omega in ascending order.

## Total Number of Floating Point Operations

Let's assume that:

M x N is a dimension of a matrix A, or A[M,N]
N x P is a dimension of a matrix B, or B[N,P]
M x P is a dimension of a matrix C, or C[M,P]

There are three relations between M, N and P:

Relation #1: A[...,N] = B[N,...]
Relation #2: A[M,...] = C[M,...]
Relation #3: B[...,P] = C[...,P]

If one of these three relations is not met, the product of two matrices cannot be computed.

In this article only square matrices of dimension N, where M = N = P, will be considered. Therefore:

A[N,N] is the same as A[M,N]
B[N,N] is the same as B[N,P]
C[N,N] is the same as C[M,P]

The following table shows how many multiplications are needed to compute a product of two square matrices of different Ns for three algorithms from Table 1 with omega = 2.3728639 (1), omega = 2.807 (2) and omega = 3.0 (3).

NOmega = 2.3728639 (1)Omega = 2.807 (2)Omega = 3.0 (3)
128100,028822,1262,097,152
256518,1145,753,46616,777,216
5122,683,66840,264,358134,217,728
102413,900,553281,781,1761,073,741,824
204872,000,4651,971,983,0428,589,934,592
4096372,939,61113,800,485,78068,719,476,736
81921,931,709,09196,579,637,673549,755,813,888
1638410,005,641,390675,891,165,0934,398,046,511,104
3276851,826,053,9654,730,074,351,66235,184,372,088,832
65536268,442,548,03433,102,375,837,652281,474,976,710,656

Table 2.

For example, to compute a product of two square dense matrices of dimension N equal to 32,768, Francois Le Gall (1) MMA needs ~51,826,053,965 multiplications and Classic (3) MMA needs ~35,184,372,088,832 multiplications.

Imagine the case of the product of two square matrices where N equals 32,768 needs to be computed on a very slow computer system. It means that if the Francois Le Gall MMA completes the processing in one day, then the classic MMA will need ~679 days on the same computer system, or almost two years. This is because the Francois Le Gall MMA needs ~679x fewer multiplications to compute a product:

~35,184,372,088,832 / ~51,826,053,965 = ~678.9

In the case of using a famous Strassen (2) MMA, ~91 days would be needed:

~4,730,074,351,662 / ~51,826,053,965 = ~91.3

In many software benchmarks the performance of an algorithm, or some processing, is measured in floating point operations per second (FLOPS), and not in elapsed time intervals, like days, hours, minutes, or seconds. That is why it is very important to know an exact total number (TN) of FP operations completed to calculate a FLOPS value.

With modern C++ compilers, it is very difficult to estimate an exact TN of FP operations per unit of time completed at run time due to extensive optimizations of generated binary codes. It means that an analysis of binary codes could be required, and this is outside of the scope of this article.

However, an estimate value of the TN of FP operations, multiplications and additions, for CMMA when square matrices are used can be easily calculated. Here are two simple examples:

Example 1: N = 2

```	Multiplications	= 8				// 2 * 2 * 2 = 2^3
Additions	= 4				// 2 * 2 * 1 = 2^2*(2-1)
TN FP Ops	= 8 + 4 = 12```

Example 2: N = 3

```	Multiplications	= 27				// 3 * 3 * 3 = 3^3
Additions	= 18				// 3 * 3 * 2 = 3^2*(3-1)
TN FP Ops	= 27 + 18 = 45```

It is apparent that the TN of FP operations to compute a product of two square matrices can be calculated using a simple formula:

TN FP Ops = (N^3) + ((N^2) * (N-1))

Note: Take into account that in the versions of the MMA used for sparse matrices, no FP operations are performed if the matrix element at position (i,j) is equal to zero.

## Implementation Complexity

In the C programming language only four lines of code are needed to implement a core part of the CMMA:

```for( i = 0; i < N; i += 1 )
for( j = 0; j < N; j += 1 )
for( k = 0; k < N; k += 1 )
C[i][j] += A[i][k] * B[k][j];```

Therefore, CMMA's implementation complexity (IC) could be rated as very simple.

Declarations of all intermediate variables, memory allocations, and initialization of matrices are usually not taken into account.

More complex versions of MMA, like Strassen or Strassen-Winograd, could have several thousands of code lines.

## Optimization Techniques

In computer programming, matrices could be represented in memory as 1-D or 2-D data structures.

Here is a static declaration of matrices A, B, and C as 1-D data structures of a single precision (SP) FP data type (float):

```	float fA[N*N];
float fB[N*N];
float fC[N*N];```

and this is what a core part of the CMMA looks like:

```	for( i = 0; i < N; i += 1 )
for( j = 0; j < N; j += 1 )
for( k = 0; k < N; k += 1 )
C[N*i+j] += A[N*i+k] * B[N*k+j];```

Here is a static declaration of matrices A, B, and C as 2-D data structures of a single precision (SP) FP data type (float):

```	float fA[N][N];
float fB[N][N];
float fC[N][N];```

and this is what the core part of CMMA looks like:

```	for( i = 0; i < N; i += 1 )
for( j = 0; j < N; j += 1 )
for( k = 0; k < N; k += 1 )
C[i][j] += A[i][k] * B[k][j];```

Many other variants of the core part of CMMA are possible and they will be reviewed.

## Memory Allocation Schemes

In the previous section of this article, two examples of a static declaration of matrices A, B, and C were given. In the case of dynamic allocation of memory for matrices, explicit calls to memory allocation functions need to be made. In this case, declarations and allocations of memory can look like the following:

Declaration of matrices A, B, and C as 1-D data structures:

```	__attribute__( ( aligned( 64 ) ) ) float *fA;
__attribute__( ( aligned( 64 ) ) ) float *fB;
__attribute__( ( aligned( 64 ) ) ) float *fC;```

and this is how memory needs to be allocated:

```	fA = ( float * )_mm_malloc( N * sizeof( float ), 64 );
fB = ( float * )_mm_malloc( N * sizeof( float ), 64 );
fC = ( float * )_mm_malloc( N * sizeof( float ), 64 );```

Note: Allocated memory blocks are 64-byte aligned, contiguous, and not fragmented by an operating system memory manager; this improves performance of processing.

Declaration of matrices A, B, and C as 2-D data structures:

```	__attribute__( ( aligned( 64 ) ) ) float **fA;
__attribute__( ( aligned( 64 ) ) ) float **fB;
__attribute__( ( aligned( 64 ) ) ) float **fC;```

and this is how memory needs to be allocated:

```	fA = ( float ** )calloc( N, sizeof( float * ) );
fB = ( float ** )calloc( N, sizeof( float * ) );
fC = ( float ** )calloc( N, sizeof( float * ) );
for( i = 0; i < N; i += 1 )
{
fA[i] = ( float * )calloc( N, sizeof( float ) );
fB[i] = ( float * )calloc( N, sizeof( float ) );
fC[i] = ( float * )calloc( N, sizeof( float ) );
}```

Note: Allocated memory blocks are not contiguous and can be fragmented by an operating system memory manager, and fragmentation can degrade performance of processing.

In the previous examples, a DDR4-type RAM memory was allocated for matrices. However, on an Intel Xeon Phi processor system5 a multichannel DRAM (MCDRAM)-type RAM memory could be allocated as well, using functions from a memkind library11 when MCDRAM mode is configured to 'Flat' or 'Hybrid'. For example, this is how an MCDRAM-type RAM memory can be allocated:

```	fA = ( float * )hbw_malloc( N * sizeof( float ) );
fB = ( float * )hbw_malloc( N * sizeof( float ) );
fC = ( float * )hbw_malloc( N * sizeof( float ) );```

Note: An 'hbw_malloc' function of the memkind library was used instead of an '_mm_malloc' function.

On an Intel Xeon Phi processor system, eight variants of memory allocation for matrices A, B, and C are possible:

Matrix AMatrix BMatrix CNote
DDR4DDR4DDR4(1)
DDR4DDR4MCDRAM(2)
DDR4MCDRAMDDR4
DDR4MCDRAMMCDRAM
MCDRAMDDR4DDR4
MCDRAMDDR4MCDRAM
MCDRAMMCDRAMDDR4
MCDRAMMCDRAMMCDRAM

Table 3.

It is recommended to use MCDRAM memory as much as possible because its bandwidth is ~400 GB/s, and it is ~5 times faster than the ~80 GB/s bandwidth of DDR4 memory5.

Here is an example of how 'cblas_sgemm' MMA performs for two memory allocation schemes (MASs) (1) and (2):

```	Matrix multiplication C=A*B where matrix A (32768x32768) and matrix B (32768x32768)
Allocating memory for matrices A, B, C: MAS=DDR4:DDR4:DDR4
Initializing matrix data
Matrix multiplication started
Matrix multiplication completed at 50.918 seconds```
```	Allocating memory for matrices A, B, C: MAS=DDR4:DDR4:MCDRAM
Initializing matrix data
Matrix multiplication started
Matrix multiplication completed at 47.385 seconds```

It is clear that there is a performance improvement of ~7 percent when an MCDRAM memory was allocated for matrix C.

## Loop Processing Schemes

A loop processing scheme (LPS) describes what optimization techniques are applied to the 'for' statements of the C language of the core part of CMMA. For example, the following code:

```	for( i = 0; i < N; i += 1 )						// loop 1
for( j = 0; j < N; j += 1 )					// loop 2
for( k = 0; k < N; k += 1 )				// loop 3
C[i][j] += A[i][k] * B[k][j];```

corresponds to an LPS=1:1:1, and it means that loop counters are incremented by 1.

Table 4 below includes short descriptions of different LPSs:

LPSNote
1:1:1Loops not unrolled
1:1:23rd loop unrolls to 2-in-1 computations
1:1:43rd loop unrolls to 4-in-1 computations
1:1:83rd loop unrolls to 8-in-1 computations
1:2:12nd loop unrolls to 2-in-1 computations
1:4:12nd loop unrolls to 4-in-1 computations
1:8:12nd loop unrolls to 8-in-1 computations

Table 4.

For example, the following code corresponds to an LPS=1:1:2, and it means that counters 'i' and 'j' for loops 1 and 2 are incremented by 1, and counter 'k' for loop 3 is incremented by 2:

```	for( i = 0; i < N; i += 1 )						// :1
{
for( j = 0; j < N; j += 1 )					// :1
{
for( k = 0; k < N; k += 2 )				// :2 (unrolled loop)
{
C[i][j] += A[i][k  ] * B[k   ][j];
C[i][j] += A[i][k+1] * B[k+1][j];
}
}
}```

Note: A C++ compiler could unroll loops as well if command line-options for unrolling are used. A software engineer should prevent such cases when source code unrolling is used at the same time, because it prevents vectorization of inner loops, and degrades performance of processing.

Another optimization technique is the loop interchange optimization technique (LIOT). When the LIOT is used, a core part of CMMA looks as follows:

```	for( i = 0; i < N; i += 1 )						// loop 1
for( k = 0; k < N; k += 1 )					// loop 2
for( j = 0; j < N; j += 1 )				// loop 3
C[i][j] += A[i][k] * B[k][j];```

It is worth noting that counters 'j' and 'k' for loops 2 and 3 were exchanged.

The loops unrolling and LIOT allow for improved performance of processing because elements of matrices A and B are accessed more efficiently.

## Compute Schemes

A compute scheme (CS) describes the computation of final or intermediate values and how elements of matrices are accessed.

In a CMMA an element (i,j) of the matrix C is calculated as follows:

`	C[i][j] += A[i][k] * B[k][j]`

and its CS is ij:ik:kj.

However, elements of matrix B are accessed in a very inefficient way. That is, the next element of matrix B, which needs to be used in the calculation, is located at a distance of (N * sizeof (datatype)) bytes. For very small matrices this is not critical because they can fit into CPU caches. However, for larger matrices it affects performance of computations, which can be significantly degraded, due to cache misses.

In order to solve that problem and improve performance of computations, a very simple optimization technique is used. If matrix B is transposed, the next element that needs to be used in the calculation will be located at a distance of (sizeof (datatype)) bytes. Thus, access to the elements of matrix B will be similar to the access to the elements of matrix A.

In a transpose-based CMMA, an element (i,j) of the matrix C is calculated as follows:

`	C[i][j] += A[i][k] * B[j][k]`

and its CS is ij:ik:jk. Here B[j][k] is used instead of B[k][j].

It is very important to use the fastest possible algorithm for the matrix B transposition before processing is started. In Appendix B an example is given on how much time is needed to transpose a square matrix of 32,768 elements, and how much time is needed to compute the product on an Intel Xeon Phi processor system.

Another optimization technique is the loop blocking optimization technique (LBOT) and it allows the use of smaller subsets of A, B, and C matrices to compute the product. When the LBOT is used, a core part of CMMA looks as follows:

```	for( i = 0; i < N; i += BlockSize )
{
for( j = 0; j < N; j += BlockSize )
{
for( k = 0; k < N; k += BlockSize )
{
for( ii = i; ii < ( i+BlockSize ); ii += 1 )
for( jj = j; jj < ( j+BlockSize ); jj += 1 )
for( kk = k; kk < ( k+BlockSize ); kk += 1 )
C[ii][jj] += A[ii][kk] * B[kk][jj];
}
}
}```

Note: A detailed description of LBOT can be found at10.

Table 5 shows four examples of CSs:

CSNote
ij:ik:kjDefault
ij:ik:jkTransposed
iijj:iikk:kkjjDefault LBOT
iijj:iikk:jjkkTransposed LBOT

Table 5.

## Error Analysis

In any version of MMA many FP operations need to be done in order to compute values of elements of matrix C. Since FP data types SP or DP have limited precision4, rounding errors accumulate very quickly. A common misconception is that rounding errors can occur only in cases where large or very large matrices need to be multiplied. This is not true because, in the case of floating point arithmetic (FPA), a rounding error is also a function of the range of an input value, and it is not a function of the size of input matrices.

However, a very simple optimization technique allows improvement in the accuracy of computations.

If matrices A and B are declared as an SP FP data type, then intermediate values could be stored in a variable of DP FP data type:

```	for( i = 0; i < N; i += 1 )
{
for( j = 0; j < N; j += 1 )
{
double sum = 0.0;
for( k = 0; k < N; k += 1 )
{
sum += ( double )( A[i][k] * B[k][j] );
}
C[i][j] = sum;
}
}```

The accuracy of computations will be improved, but performance of processing can be lower.

An error analysis (EA) is completed using the mmatest4.c test program for different sizes of matrices of SP and DP FP data types (see Table 6 in Appendix C with results).

## Performance on the Intel® Xeon Phi™ Processor System

Several versions of the CMMA to compute a product of square dense matrices are evaluated in four test programs. Performance of these CMMAs is compared to a highly optimized 'cblas_sgemm' function of the Intel MKL7. Also see Appendix D for more evaluations.

Figure 1. Performance tests for matrix multiply algorithms on Intel® Xeon Phi™ processor using mmatest1.c with KMP_AFFINITY environment variable set to 'scatter', 'balanced', and 'compact'. A lower bar height means faster processing.

Here are the names of source files with a short description of tests:

mmatest1.c - Performance tests matrix multiply algorithms on an Intel Xeon Phi processor.
mmatest2.c - Performance tests matrix multiply algorithms on an Intel Xeon Phi processor in one MCDRAM mode ('Flat') for DDR4:DDR4:DDR4 and DDR4:DDR4:MCDRAM MASs.
mmatest3.c - Performance tests matrix multiply algorithms on an Intel Xeon Phi processor in three MCDRAM modes ('All2All', 'Flat', and 'Cache') for DDR4:DDR4:DDR4 and MCDRAM:MCDRAM:MCDRAM MASs. Note: In 'Cache' MCDRAM mode, MCDRAM:MCDRAM:MCDRAM MAS cannot be used.
mmatest4.c - Verification matrix multiply algorithms accuracy of computations on an Intel Xeon Phi processor.

## OpenMP* Product Thread Affinity Control

OpenMP* product compiler directives can be easily used to parallelize processing and significantly speed up processing. However, it is very important to execute OpenMP threads on different logical CPUs of modern multicore processors in order to utilize their internal resources as best as possible.

In the case of using the Intel C++ compiler and Intel OpenMP run-time libraries, the KMP_AFFINITY environment variable provides flexibility and simplifies that task. Here are three simple examples of using the KMP_AFFINITY environment variable:

```	KMP_AFFINITY = scatter
KMP_AFFINITY = balanced
KMP_AFFINITY = compact```

These two screenshots of the Htop* utility12 demonstrate how OpenMP threads are assigned (pinned) to Intel Xeon Phi processor 72105 logical CPUs during processing of an MMA using 64 cores of the processor:

Screenshot 1. KMP_AFFINITY = scatter or balanced. Note: Processing is faster when compared to KMP_AFFINITY = compact.

Screenshot 2. KMP_AFFINITY = compact. Note: Processing is slower when compared to KMP_AFFINITY = scatter or balanced.

## Recommended Intel® C++ Compiler Command-Line Options

Here is a list of Intel C++ Compiler command-line options that a software engineer should consider, which can improve performance of processing of CMMAs:

O3
fp-model
parallel
unroll
unroll-aggressive
opt-streaming-stores
opt-mem-layout-trans

Os
openmp
ansi-alias
fma
opt-matmul
opt-block-factor
opt-prefetch

## Conclusion

Application of different optimization techniques to the CMMA were reviewed in this article.

Three versions of CMMA to compute a product of square dense matrices were evaluated in four test programs. Performance of these CMMAs was compared to a highly optimized 'cblas_sgemm' function of the Intel MKL7.

Tests were completed on a computer system with an Intel® Xeon Phi processor 72105 running the Linux Red Hat operating system in 'All2All' Cluster mode and for 'Flat', 'Hybrid 50-50', and 'Cache' MCDRAM modes.

It was demonstrated that CMMA could be used for cases when matrices of small sizes, up to 1,024 x 1,024, need to be multiplied.

It was demonstrated that performance of MMAs is higher when MCDRAM-type RAM memory is allocated for matrices with sizes up to 16,384 x 16,384 instead of DDR4-type RAM memory.

Advantages of using CMMA to compute the product of two matrices are as follows:

• In any programming language, simple to implement to run on CPUs or GPUs9
• Highly portable source codes when implemented in C, C++, or Java programming languages
• Simple to integrate with existing software for a wide range of computer platforms
• Simple to debug and troubleshoot
• Predictable memory footprint at run time
• Easy to optimize using parallelization and vectorization techniques
• Low overheads and very good performance for matrices of sizes ranging from 256 x 256 to 1,024 x 1,024 (see Figures 1 through 5)
• Very good accuracy of computations for matrices of sizes ranging from 8 x 8 to 2,048 x 2,048 (see Table 6 in Appendix C)

Disadvantages of using CMMA to compute a product of two matrices are as follows:

• Poor performance for large matrices with sizes greater than 2048 x 2048
• Poor performance when implemented using high-level programming languages due to processing overheads
• Reduced accuracy of computations for matrices of sizes ranging from 2,048 x 2,048 to 65,536 x 65,536 (see Table 6 in Appendix C)

## References

1. Matrix Multiplication on Mathworld

http://mathworld.wolfram.com/MatrixMultiplication.html

2. Matrix Multiplication on Wikipedia

https://en.wikipedia.org/wiki/Matrix_multiplication

3. Asymptotic Complexity of an Algorithm

https://en.wikipedia.org/wiki/Time_complexity

4. The IEEE 754 Standard for Floating Point Arithmetic

http://standards.ieee.org/

5. Intel® Many Integrated Core Architecture

6. Intel® C++ Compiler

7. Intel® MKL

8. Intel® Developer Zone Forums

https://software.intel.com/en-us/forum

9. Optimizing Matrix Multiply for Intel® Processor Graphics Architecture Gen 9

https://software.intel.com/en-us/articles/sgemm-ocl-opt

10. Performance Tools for Software Developers Loop Blocking

https://software.intel.com/en-us/articles/performance-tools-for-software-developers-loop-blocking

11. Memkind library

https://github.com/memkind/memkind

12. Htop* monitoring utility

https://sourceforge.net/projects/htop

Performance_CMMA_system.zip

List of all files (sources, test reports, and so on):

Performance_CMMA_system.pdf - Copy of this paper.

mmatest1.c - Performance tests for matrix multiply algorithms on Intel® Xeon Phi processors.

dataset1.txt - Results of tests.

mmatest2.c - Performance tests for matrix multiply algorithms on Intel® Xeon Phi processors for DDR4:DDR4:DDR4 and DDR4:DDR4:MCDRAM MASs.

dataset2.txt - Results of tests.

mmatest3.c - Performance tests for matrix multiply algorithms on Intel® Xeon Phi processors in three MCDRAM modes for DDR4:DDR4:DDR4 and MCDRAM:MCDRAM:MCDRA MASs.

dataset3.txt - Results of tests.

mmatest4.c - Verification of matrix multiply algorithms accuracy of computations on Intel® Xeon Phi processors.

dataset4.txt - Results of tests.

Note:   Intel C++ Compiler versions used to compile tests:
17.0.1 Update 132 for Linux*
16.0.3 Update 210 for Linux*

### Abbreviations

CPU - Central processing unit
GPU - Graphics processing unit
ISA - Instruction set architecture
MIC - Intel® Many Integrated Core Architecture
RAM - Random access memory
DRAM - Dynamic random access memory
MCDRAM - Multichannel DRAM
HBW - High bandwidth memory
DDR4 - Double data rate (generation) 4
SIMD - Single instruction multiple data
SSE - Streaming SIMD extensions
FP - Floating point
FPA - Floating point arithmetic4
SP - Single precision4
DP - Double precision4
FLOPS - Floating point operations per second
MM - Matrix multiplication
MMA - Matrix multiplication algorithm
CMMA - Classic matrix multiplication algorithm
MTA - Matrix transpose algorithm
AC - Asymptotic complexity
IC - Implementation complexity
EA - Error analysis
MAS - Memory allocation scheme
LPS - Loop processing scheme
CS - Compute scheme
LIOT - Loop interchange optimization technique
LBOT - Loop blocking optimization technique
ICC - Intel C++ Compiler6
MKL - Math kernel library7
CBLAS - C basic linear algebra subprograms
IDZ - Intel® Developer Zone8
IEEE - Institute of Electrical and Electronics Engineers4
GB - Gigabytes
TN - Total number

## Appendix

### Appendix A - Technical Specifications of the Intel® Xeon Phi™ Processor System

Summary of the Intel Xeon Phi processor system used for testing:

Process technology: 14nm
Processor name: Intel Xeon Phi processor 7210
Frequency: 1.30 GHz
Packages (sockets): 1
Cores: 64
Processors (CPUs): 256
Cores per package: 64
On-Package Memory: 16 GB high bandwidth MCDRAM (bandwidth ~400 GB/s)
DDR4 Memory: 96 GB 6 Channel (Bandwidth ~ 80 GB/s)
ISA: Intel® AVX-512 (Vector length 512-bit)

Detailed processor specifications:

http://ark.intel.com/products/94033/Intel-Xeon-Phi-Processor-7210-16GB-1_30-GHz-64-core

Summary of a Linux operating system:

[guest@... ~]\$ uname -a

Linux c002-n002 3.10.0-327.13.1.el7.xppsl_1.4.0.3211.x86_64 #1 SMP
Fri Jul 8 11:44:24 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

[guest@... ~]\$ cat /proc/version

Linux version 3.10.0-327.13.1.el7.xppsl_1.4.0.3211.x86_64 (qb_user@89829b4f89a5)
(gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)) #1 SMP Fri Jul 8 11:44:24 UTC 2016

### Appendix B - Comparison of Processing Times for MMAs versus MTA

Comparison of processing times for Intel MKL 'cblas_sgemm' and CMMA vs. MTA:

[Intel MKL & CMMA]

Matrix A [32768 x 32768] Matrix B [32768 x 32768]
MKL - Completed in: 51.2515874 seconds
CMMA - Completed in: 866.5838490 seconds

[MTA]

Matrix size: 32768 x 32768
Transpose Classic - Completed in: 1.730 secs
Transpose Diagonal - Completed in: 1.080 secs
Transpose Eklundh - Completed in: 0.910 secs

When compared processing time of MTA to:
MKL 'cblas_sgemm'. the transposition takes ~2.42 percent of the processing time.
CMMA, the transposition takes ~0.14 percent of the processing time.

### Appendix C - Error Analysis (Absolute Errors for SP FP Data Type)

NMMACalculated SP ValueAbsolute Error
8MKL8.0000800.000000
8CMMA8.0000800.000000
16MKL16.0001600.000000
32CMMA16.0001600.000000
32MKL32.000309-0.000011
32CMMA32.0003200.000000
64MKL64.0006710.000031
128CMMA64.0006410.000001
128MKL128.001160-0.000120
128CMMA128.0012820.000002
256MKL256.002319-0.000241
512CMMA256.0025630.000003
512MKL512.004639-0.000481
512CMMA512.005005-0.000115
1024MKL1024.009521-0.000719
2048CMMA1024.009888-0.000352
2048MKL2048.019043-0.001437
2048CMMA2048.0214840.001004
4096MKL4096.038574-0.002386
8192CMMA4096.037109-0.003851
8192MKL8192.074219-0.007701
8192CMMA8192.0996090.017689
16384MKL16384.14648-0.017356
32768CMMA16384.09961-0.064231
32768MKL32768.335940.008258
32768CMMA32768.10156-0.226118
65536MKL65536.718750.063390
65536CMMA65536.10156-0.553798

Table 6.

### Appendix D - Performance of MMAs for Different MASs

Figure 2. Performance of Intel® MKL 'cbals_sgemm'. KMP_AFFINITY environment variable set to 'scatter'. Cluster mode: 'All2All'. MCDRAM mode: 'Flat'. Test program mmatest2.c. A lower bar height means faster processing.

Figure 3. Performance of Intel® MKL 'cbals_sgemm' vs. CMMA. KMP_AFFINITY environment variable set to 'scatter'. Cluster mode: 'All2All'. MCDRAM mode: 'Flat'. Test program mmatest3.c. A lower bar height means faster processing.

Figure 4. Performance of Intel® MKL 'cbals_sgemm' vs. CMMA. KMP_AFFINITY environment variable set to 'scatter'. Cluster mode: 'All2All'. MCDRAM mode: 'Hybrid 50-50'. Test program mmatest3.c. A lower bar height means faster processing.

Figure 5. Performance of Intel® MKL 'cbals_sgemm' vs. CMMA. KMP_AFFINITY environment variable set to 'scatter'. Cluster mode: 'All2All'. MCDRAM mode: 'Cache'. Test program mmatest3.c. A lower bar height means faster processing.

Sergey Kostrov is a highly experienced C/C++ software engineer and Intel® Black Belt Developer. He is an expert in design and implementation of highly portable C/C++ software for embedded and desktop platforms, scientific algorithms, and high performance computing of big data sets.

For more complete information about compiler optimizations, see our Optimization Notice.

Top

I have been attempting to reproduce the benchmarks as provided in the code. One observation I have is that there is a considerable warm-up time which leads to big overhead on the first algorithm being benchmarked. (In this case, the cblas_sgemm function.)

16 loop counts are often not enough to offset the thread 'warm-up' time. I am not sure what the correct terminology for this would be.

Can anyone confirm this? When benchmarking, is it better to give a 'warm-up' kernel to the threads?

For matrix multiply on KNL it would be essential to test multiple hardware threads per core.

KMP_HW_SUBSET=64C,2T

KMP_HW_SUBSET=64C,4T