# Performance of Classic Matrix Multiplication Algorithm on Intel® Xeon Phi™ Processor System

Por Sergey Kostrov, publicado em 14 de abril de 2017

## Introduction

Matrix multiplication (MM) of two matrices is one of the most fundamental operations in linear algebra. The algorithm for MM is very simple, it could be easily implemented in any programming language, and its performance significantly improves when different optimization techniques are applied.

Several versions of the classic matrix multiplication algorithm (CMMA) to compute a product of square dense matrices are evaluated in four test programs. Performance of these CMMAs is compared to a highly optimized 'cblas_sgemm' function of the Intel® Math Kernel Library (Intel® MKL)^{7}. Tests are completed on a computer system with Intel® Xeon Phi™ processor 7210^{5} running the Linux Red Hat* operating system in 'All2All' Cluster mode and for 'Flat', 'Hybrid 50-50', and 'Cache' MCDRAM modes.

All versions of CMMAs for single and double precision floating point data types described in the article are implemented in the C programming language and compiled with Intel® C++ Compiler versions 17 and 16 for Linux*^{6}.

The article targets experienced C/C++ software engineers and can be considered as a reference on application optimization techniques, analysis of performance, and accuracy of computations related to MMAs.

If needed, the reader may review the contents of References^{1} or^{2} for a description of mathematical fundamentals of MM, because theoretical topics related to MM are not covered in this article.

## An Overview of the Classic Matrix Multiplication Algorithm

A fundamental property of any algorithm is its asymptotic complexity (AC)^{3}.

In generic form, AC for MMA can be expressed as follows:

MMA AC = O(N^Omega)

where O stands for operation on a data element, also known in computer science as a Big O; N is one dimension of the matrix, and omega is a matrix exponent which equals 3.0 for CMMA. That is:

CMMA AC = O(N^3)

In order to compute a product of two square matrices using CMMA, a cubic number of floating point (FP) multiplication operations is required. In other words, the CMMA runs in O(N^3) time.

An omega lower than 3.0 is possible, and it means that an MMA computes a product of two matrices faster because an optimization technique, mathematical or programming, is applied and fewer FP multiplication operations are required to compute the product.

A list of several MMAs with different values of omega is as follows:

Algorithm | Omega | Note |
---|---|---|

Francois Le Gall | 2.3728639 | (1) |

Virginia Vassilevska Williams | 2.3728642 | |

Stothers | 2.3740000 | |

Coppersmith-Winograd | 2.3760000 | |

Bini | 2.7790000 | |

Pan | 2.7950000 | |

Strassen | 2.8070000 | (2) |

Strassen-Winograd | 2.8070000 | |

Classic | 3.0000000 | (3) |

**Table 1.** Algorithms are sorted by omega in ascending order.

## Total Number of Floating Point Operations

Let's assume that:

M x N is a dimension of a matrix A, or A[M,N]

N x P is a dimension of a matrix B, or B[N,P]

M x P is a dimension of a matrix C, or C[M,P]

There are three relations between M, N and P:

Relation #1: A[...,N] = B[N,...]

Relation #2: A[M,...] = C[M,...]

Relation #3: B[...,P] = C[...,P]

If one of these three relations is not met, the product of two matrices cannot be computed.

In this article only square matrices of dimension N, where M = N = P, will be considered. Therefore:

A[N,N] is the same as A[M,N]

B[N,N] is the same as B[N,P]

C[N,N] is the same as C[M,P]

The following table shows how many multiplications are needed to compute a product of two square matrices of different Ns for three algorithms from Table 1 with omega = 2.3728639 (1), omega = 2.807 (2) and omega = 3.0 (3).

N | Omega = 2.3728639 (1) | Omega = 2.807 (2) | Omega = 3.0 (3) |
---|---|---|---|

128 | 100,028 | 822,126 | 2,097,152 |

256 | 518,114 | 5,753,466 | 16,777,216 |

512 | 2,683,668 | 40,264,358 | 134,217,728 |

1024 | 13,900,553 | 281,781,176 | 1,073,741,824 |

2048 | 72,000,465 | 1,971,983,042 | 8,589,934,592 |

4096 | 372,939,611 | 13,800,485,780 | 68,719,476,736 |

8192 | 1,931,709,091 | 96,579,637,673 | 549,755,813,888 |

16384 | 10,005,641,390 | 675,891,165,093 | 4,398,046,511,104 |

32768 | 51,826,053,965 | 4,730,074,351,662 | 35,184,372,088,832 |

65536 | 268,442,548,034 | 33,102,375,837,652 | 281,474,976,710,656 |

**Table 2.**

For example, to compute a product of two square dense matrices of dimension N equal to 32,768, Francois Le Gall (1) MMA needs ~51,826,053,965 multiplications and Classic (3) MMA needs ~35,184,372,088,832 multiplications.

Imagine the case of the product of two square matrices where *N equals 32,768* needs to be computed on a very slow computer system. It means that if the Francois Le Gall MMA completes the processing in one day, then the classic MMA will need ~679 days on the same computer system, or almost two years. This is because the Francois Le Gall MMA needs ~679x fewer multiplications to compute a product:

~35,184,372,088,832 / ~51,826,053,965 = ~678.9

In the case of using a famous Strassen (2) MMA, ~91 days would be needed:

~4,730,074,351,662 / ~51,826,053,965 = ~91.3

In many software benchmarks the performance of an algorithm, or some processing, is measured in floating point operations per second (FLOPS), and not in elapsed time intervals, like days, hours, minutes, or seconds. That is why it is very important to know an exact total number (TN) of FP operations completed to calculate a FLOPS value.

With modern C++ compilers, it is very difficult to estimate an exact TN of FP operations per unit of time completed at run time due to extensive optimizations of generated binary codes. It means that an analysis of binary codes could be required, and this is outside of the scope of this article.

However, an estimate value of the TN of FP operations, multiplications and additions, for CMMA when square matrices are used can be easily calculated. Here are two simple examples:

Example 1: N = 2

Multiplications = 8 // 2 * 2 * 2 = 2^3 Additions = 4 // 2 * 2 * 1 = 2^2*(2-1) TN FP Ops = 8 + 4 = 12

Example 2: N = 3

Multiplications = 27 // 3 * 3 * 3 = 3^3 Additions = 18 // 3 * 3 * 2 = 3^2*(3-1) TN FP Ops = 27 + 18 = 45

It is apparent that the TN of FP operations to compute a product of two square matrices can be calculated using a simple formula:

TN FP Ops = (N^3) + ((N^2) * (N-1))

Note: Take into account that in the versions of the MMA used for sparse matrices, no FP operations are performed if the matrix element at position (i,j) is equal to zero.

## Implementation Complexity

In the C programming language only four lines of code are needed to implement a core part of the CMMA:

for( i = 0; i < N; i += 1 ) for( j = 0; j < N; j += 1 ) for( k = 0; k < N; k += 1 ) C[i][j] += A[i][k] * B[k][j];

Therefore, CMMA's implementation complexity (IC) could be rated as very simple.

Declarations of all intermediate variables, memory allocations, and initialization of matrices are usually not taken into account.

More complex versions of MMA, like Strassen or Strassen-Winograd, could have several thousands of code lines.

## Optimization Techniques

In computer programming, matrices could be represented in memory as 1-D or 2-D data structures.

Here is a static declaration of matrices A, B, and C as 1-D data structures of a single precision (SP) FP data type (float):

float fA[N*N]; float fB[N*N]; float fC[N*N];

and this is what a core part of the CMMA looks like:

for( i = 0; i < N; i += 1 ) for( j = 0; j < N; j += 1 ) for( k = 0; k < N; k += 1 ) C[N*i+j] += A[N*i+k] * B[N*k+j];

Here is a static declaration of matrices A, B, and C as 2-D data structures of a single precision (SP) FP data type (float):

float fA[N][N]; float fB[N][N]; float fC[N][N];

and this is what the core part of CMMA looks like:

for( i = 0; i < N; i += 1 ) for( j = 0; j < N; j += 1 ) for( k = 0; k < N; k += 1 ) C[i][j] += A[i][k] * B[k][j];

Many other variants of the core part of CMMA are possible and they will be reviewed.

## Memory Allocation Schemes

In the previous section of this article, two examples of a static declaration of matrices A, B, and C were given. In the case of dynamic allocation of memory for matrices, explicit calls to memory allocation functions need to be made. In this case, declarations and allocations of memory can look like the following:

Declaration of matrices A, B, and C as 1-D data structures:

__attribute__( ( aligned( 64 ) ) ) float *fA; __attribute__( ( aligned( 64 ) ) ) float *fB; __attribute__( ( aligned( 64 ) ) ) float *fC;

and this is how memory needs to be allocated:

fA = ( float * )_mm_malloc( N * sizeof( float ), 64 ); fB = ( float * )_mm_malloc( N * sizeof( float ), 64 ); fC = ( float * )_mm_malloc( N * sizeof( float ), 64 );

Note: Allocated memory blocks are 64-byte aligned, contiguous, and not fragmented by an operating system memory manager; this improves performance of processing.

Declaration of matrices A, B, and C as 2-D data structures:

__attribute__( ( aligned( 64 ) ) ) float **fA; __attribute__( ( aligned( 64 ) ) ) float **fB; __attribute__( ( aligned( 64 ) ) ) float **fC;

and this is how memory needs to be allocated:

fA = ( float ** )calloc( N, sizeof( float * ) ); fB = ( float ** )calloc( N, sizeof( float * ) ); fC = ( float ** )calloc( N, sizeof( float * ) ); for( i = 0; i < N; i += 1 ) { fA[i] = ( float * )calloc( N, sizeof( float ) ); fB[i] = ( float * )calloc( N, sizeof( float ) ); fC[i] = ( float * )calloc( N, sizeof( float ) ); }

Note: Allocated memory blocks are not contiguous and can be fragmented by an operating system memory manager, and fragmentation can degrade performance of processing.

In the previous examples, a DDR4-type RAM memory was allocated for matrices. However, on an Intel Xeon Phi processor system^{5} a multichannel DRAM (MCDRAM)-type RAM memory could be allocated as well, using functions from a memkind library^{11} when MCDRAM mode is configured to 'Flat' or 'Hybrid'. For example, this is how an MCDRAM-type RAM memory can be allocated:

fA = ( float * )hbw_malloc( N * sizeof( float ) ); fB = ( float * )hbw_malloc( N * sizeof( float ) ); fC = ( float * )hbw_malloc( N * sizeof( float ) );

Note: An 'hbw_malloc' function of the memkind library was used instead of an '_mm_malloc' function.

On an Intel Xeon Phi processor system, eight variants of memory allocation for matrices A, B, and C are possible:

Matrix A | Matrix B | Matrix C | Note |
---|---|---|---|

DDR4 | DDR4 | DDR4 | (1) |

DDR4 | DDR4 | MCDRAM | (2) |

DDR4 | MCDRAM | DDR4 | |

DDR4 | MCDRAM | MCDRAM | |

MCDRAM | DDR4 | DDR4 | |

MCDRAM | DDR4 | MCDRAM | |

MCDRAM | MCDRAM | DDR4 | |

MCDRAM | MCDRAM | MCDRAM |

**Table 3.**

It is recommended to use MCDRAM memory as much as possible because its bandwidth is ~400 GB/s, and it is ~5 times faster than the ~80 GB/s bandwidth of DDR4 memory^{5}.

Here is an example of how 'cblas_sgemm' MMA performs for two memory allocation schemes (MASs) (1) and (2):

Matrix multiplication C=A*B where matrix A (32768x32768) and matrix B (32768x32768) Allocating memory for matrices A, B, C: MAS=DDR4:DDR4:DDR4 Initializing matrix data Matrix multiplication started Matrix multiplication completed at 50.918 seconds

Allocating memory for matrices A, B, C: MAS=DDR4:DDR4:MCDRAM Initializing matrix data Matrix multiplication started Matrix multiplication completed at 47.385 seconds

It is clear that there is a performance improvement of ~7 percent when an MCDRAM memory was allocated for matrix C.

## Loop Processing Schemes

A loop processing scheme (LPS) describes what optimization techniques are applied to the 'for' statements of the C language of the core part of CMMA. For example, the following code:

for( i = 0; i < N; i += 1 ) // loop 1 for( j = 0; j < N; j += 1 ) // loop 2 for( k = 0; k < N; k += 1 ) // loop 3 C[i][j] += A[i][k] * B[k][j];

corresponds to an LPS=1:1:1, and it means that loop counters are incremented by 1.

Table 4 below includes short descriptions of different LPSs:

LPS | Note |
---|---|

1:1:1 | Loops not unrolled |

1:1:2 | 3rd loop unrolls to 2-in-1 computations |

1:1:4 | 3rd loop unrolls to 4-in-1 computations |

1:1:8 | 3rd loop unrolls to 8-in-1 computations |

1:2:1 | 2nd loop unrolls to 2-in-1 computations |

1:4:1 | 2nd loop unrolls to 4-in-1 computations |

1:8:1 | 2nd loop unrolls to 8-in-1 computations |

**Table 4.**

For example, the following code corresponds to an LPS=1:1:2, and it means that counters 'i' and 'j' for loops 1 and 2 are incremented by 1, and counter 'k' for loop 3 is incremented by 2:

for( i = 0; i < N; i += 1 ) // :1 { for( j = 0; j < N; j += 1 ) // :1 { for( k = 0; k < N; k += 2 ) // :2 (unrolled loop) { C[i][j] += A[i][k ] * B[k ][j]; C[i][j] += A[i][k+1] * B[k+1][j]; } } }

Note: A C++ compiler could unroll loops as well if command line-options for unrolling are used. A software engineer should prevent such cases when source code unrolling is used at the same time, because it prevents vectorization of inner loops, and degrades performance of processing.

Another optimization technique is the loop interchange optimization technique (LIOT). When the LIOT is used, a core part of CMMA looks as follows:

for( i = 0; i < N; i += 1 ) // loop 1 for( k = 0; k < N; k += 1 ) // loop 2 for( j = 0; j < N; j += 1 ) // loop 3 C[i][j] += A[i][k] * B[k][j];

It is worth noting that counters 'j' and 'k' for loops 2 and 3 were exchanged.

The loops unrolling and LIOT allow for improved performance of processing because elements of matrices A and B are accessed more efficiently.

## Compute Schemes

A compute scheme (CS) describes the computation of final or intermediate values and how elements of matrices are accessed.

In a CMMA an element (i,j) of the matrix C is calculated as follows:

C[i][j] += A[i][k] * B[k][j]

and its CS is ij:ik:kj.

However, elements of matrix B are accessed in a very inefficient way. That is, the next element of matrix B, which needs to be used in the calculation, is located at a distance of (N * sizeof (datatype)) bytes. For very small matrices this is not critical because they can fit into CPU caches. However, for larger matrices it affects performance of computations, which can be significantly degraded, due to cache misses.

In order to solve that problem and improve performance of computations, a very simple optimization technique is used. If matrix B is transposed, the next element that needs to be used in the calculation will be located at a distance of (sizeof (datatype)) bytes. Thus, access to the elements of matrix B will be similar to the access to the elements of matrix A.

In a transpose-based CMMA, an element (i,j) of the matrix C is calculated as follows:

C[i][j] += A[i][k] * B[j][k]

and its CS is ij:ik:jk. Here B[j][k] is used instead of B[k][j].

It is very important to use the fastest possible algorithm for the matrix B transposition before processing is started. In Appendix B an example is given on how much time is needed to transpose a square matrix of 32,768 elements, and how much time is needed to compute the product on an Intel Xeon Phi processor system.

Another optimization technique is the loop blocking optimization technique (LBOT) and it allows the use of smaller subsets of A, B, and C matrices to compute the product. When the LBOT is used, a core part of CMMA looks as follows:

for( i = 0; i < N; i += BlockSize ) { for( j = 0; j < N; j += BlockSize ) { for( k = 0; k < N; k += BlockSize ) { for( ii = i; ii < ( i+BlockSize ); ii += 1 ) for( jj = j; jj < ( j+BlockSize ); jj += 1 ) for( kk = k; kk < ( k+BlockSize ); kk += 1 ) C[ii][jj] += A[ii][kk] * B[kk][jj]; } } }

Note: A detailed description of LBOT can be found at^{10}.

Table 5 shows four examples of CSs:

CS | Note |
---|---|

ij:ik:kj | Default |

ij:ik:jk | Transposed |

iijj:iikk:kkjj | Default LBOT |

iijj:iikk:jjkk | Transposed LBOT |

**Table 5.**

## Error Analysis

In any version of MMA many FP operations need to be done in order to compute values of elements of matrix C. Since FP data types SP or DP have limited precision^{4}, rounding errors accumulate very quickly. A common misconception is that rounding errors can occur only in cases where large or very large matrices need to be multiplied. This is not true because, in the case of floating point arithmetic (FPA), a rounding error is also a function of the range of an input value, and it is not a function of the size of input matrices.

However, a very simple optimization technique allows improvement in the accuracy of computations.

If matrices A and B are declared as an SP FP data type, then intermediate values could be stored in a variable of DP FP data type:

for( i = 0; i < N; i += 1 ) { for( j = 0; j < N; j += 1 ) { double sum = 0.0; for( k = 0; k < N; k += 1 ) { sum += ( double )( A[i][k] * B[k][j] ); } C[i][j] = sum; } }

The accuracy of computations will be improved, but performance of processing can be lower.

An error analysis (EA) is completed using the mmatest4.c test program for different sizes of matrices of SP and DP FP data types (see Table 6 in Appendix C with results).

## Performance on the Intel® Xeon Phi™ Processor System

Several versions of the CMMA to compute a product of square dense matrices are evaluated in four test programs. Performance of these CMMAs is compared to a highly optimized 'cblas_sgemm' function of the Intel MKL^{7}. Also see Appendix D for more evaluations.

**Figure 1.** Performance tests for matrix multiply algorithms on Intel® Xeon Phi™ processor using mmatest1.c with KMP_AFFINITY environment variable set to 'scatter', 'balanced', and 'compact'. A lower bar height means faster processing.

Here are the names of source files with a short description of tests:

**mmatest1.c** - Performance tests matrix multiply algorithms on an Intel Xeon Phi processor.**mmatest2.c** - Performance tests matrix multiply algorithms on an Intel Xeon Phi processor in one MCDRAM mode ('Flat') for DDR4:DDR4:DDR4 and DDR4:DDR4:MCDRAM MASs.**mmatest3.c** - Performance tests matrix multiply algorithms on an Intel Xeon Phi processor in three MCDRAM modes ('All2All', 'Flat', and 'Cache') for DDR4:DDR4:DDR4 and MCDRAM:MCDRAM:MCDRAM MASs. Note: In 'Cache' MCDRAM mode, MCDRAM:MCDRAM:MCDRAM MAS cannot be used.**mmatest4.c** - Verification matrix multiply algorithms accuracy of computations on an Intel Xeon Phi processor.

## OpenMP* Product Thread Affinity Control

OpenMP* product compiler directives can be easily used to parallelize processing and significantly speed up processing. However, it is very important to execute OpenMP threads on different logical CPUs of modern multicore processors in order to utilize their internal resources as best as possible.

In the case of using the Intel C++ compiler and Intel OpenMP run-time libraries, the KMP_AFFINITY environment variable provides flexibility and simplifies that task. Here are three simple examples of using the KMP_AFFINITY environment variable:

KMP_AFFINITY = scatter KMP_AFFINITY = balanced KMP_AFFINITY = compact

These two screenshots of the Htop* utility^{12} demonstrate how OpenMP threads are assigned (pinned) to Intel Xeon Phi processor 7210^{5} logical CPUs during processing of an MMA using 64 cores of the processor:

**Screenshot 1.** KMP_AFFINITY = scatter or balanced. Note: Processing is faster when compared to KMP_AFFINITY = compact.

**Screenshot 2. **KMP_AFFINITY = compact. Note: Processing is slower when compared to KMP_AFFINITY = scatter or balanced.

## Recommended Intel® C++ Compiler Command-Line Options

Here is a list of Intel C++ Compiler command-line options that a software engineer should consider, which can improve performance of processing of CMMAs:

O3

fp-model

parallel

unroll

unroll-aggressive

opt-streaming-stores

opt-mem-layout-trans

Os

openmp

ansi-alias

fma

opt-matmul

opt-block-factor

opt-prefetch

The reader can use 'icpc -help' or 'icc -help' to learn more about these command-line options.

## Conclusion

Application of different optimization techniques to the CMMA were reviewed in this article.

Three versions of CMMA to compute a product of square dense matrices were evaluated in four test programs. Performance of these CMMAs was compared to a highly optimized 'cblas_sgemm' function of the Intel MKL^{7}.

Tests were completed on a computer system with an Intel® Xeon Phi processor 7210^{5} running the Linux Red Hat operating system in 'All2All' Cluster mode and for 'Flat', 'Hybrid 50-50', and 'Cache' MCDRAM modes.

It was demonstrated that CMMA could be used for cases when matrices of small sizes, up to 1,024 x 1,024, need to be multiplied.

It was demonstrated that performance of MMAs is higher when MCDRAM-type RAM memory is allocated for matrices with sizes up to 16,384 x 16,384 instead of DDR4-type RAM memory.

Advantages of using CMMA to compute the product of two matrices are as follows:

- In any programming language, simple to implement to run on CPUs or GPUs
^{9} - Highly portable source codes when implemented in C, C++, or Java programming languages
- Simple to integrate with existing software for a wide range of computer platforms
- Simple to debug and troubleshoot
- Predictable memory footprint at run time
- Easy to optimize using parallelization and vectorization techniques
- Low overheads and very good performance for matrices of sizes ranging from 256 x 256 to 1,024 x 1,024 (see Figures 1 through 5)
- Very good accuracy of computations for matrices of sizes ranging from 8 x 8 to 2,048 x 2,048 (see Table 6 in Appendix C)

Disadvantages of using CMMA to compute a product of two matrices are as follows:

- Poor performance for large matrices with sizes greater than 2048 x 2048
- Poor performance when implemented using high-level programming languages due to processing overheads
- Reduced accuracy of computations for matrices of sizes ranging from 2,048 x 2,048 to 65,536 x 65,536 (see Table 6 in Appendix C)

## References

**1. Matrix Multiplication on Mathworld**

http://mathworld.wolfram.com/MatrixMultiplication.html

**2. Matrix Multiplication on Wikipedia**

https://en.wikipedia.org/wiki/Matrix_multiplication

**3. Asymptotic Complexity of an Algorithm**

https://en.wikipedia.org/wiki/Time_complexity

**4. The IEEE 754 Standard for Floating Point Arithmetic**

**5. Intel® Many Integrated Core Architecture**

https://software.intel.com/en-us/xeon-phi/x200-processor

http://ark.intel.com/products/94033/Intel-Xeon-Phi-Processor-7210-16GB-1_30-GHz-64-core

https://software.intel.com/en-us/forums/intel-many-integrated-core

**6. Intel® C++ Compiler**

https://software.intel.com/en-us/c-compilers

https://software.intel.com/en-us/forums/intel-c-compiler

**7. Intel® MKL**

https://software.intel.com/en-us/intel-mkl

https://software.intel.com/en-us/intel-mkl/benchmarks

https://software.intel.com/en-us/forums/intel-math-kernel-library

**8. Intel® Developer Zone Forums**

https://software.intel.com/en-us/forum

**9. Optimizing Matrix Multiply for Intel® Processor Graphics Architecture Gen 9**

https://software.intel.com/en-us/articles/sgemm-ocl-opt

**10. Performance Tools for Software Developers Loop Blocking**

https://software.intel.com/en-us/articles/performance-tools-for-software-developers-loop-blocking

**11. Memkind library**

https://github.com/memkind/memkind

**12.** **Htop* monitoring utility**

https://sourceforge.net/projects/htop

### Downloads

List of all files (sources, test reports, and so on):

**Performance_CMMA_system.pdf** - Copy of this paper.

**mmatest1.c** - Performance tests for matrix multiply algorithms on Intel® Xeon Phi processors.

**dataset1.txt** - Results of tests.

**mmatest2.c** - Performance tests for matrix multiply algorithms on Intel® Xeon Phi processors for DDR4:DDR4:DDR4 and DDR4:DDR4:MCDRAM MASs.

**dataset2.txt** - Results of tests.

**mmatest3.c** - Performance tests for matrix multiply algorithms on Intel® Xeon Phi processors in three MCDRAM modes for DDR4:DDR4:DDR4 and MCDRAM:MCDRAM:MCDRA MASs.

**dataset3.txt** - Results of tests.

**mmatest4.c** - Verification of matrix multiply algorithms accuracy of computations on Intel® Xeon Phi processors.

**dataset4.txt ** - Results of tests.

Note: Intel C++ Compiler versions used to compile tests:

17.0.1 Update 132 for Linux*

16.0.3 Update 210 for Linux*

### Abbreviations

CPU - Central processing unit

GPU - Graphics processing unit

ISA - Instruction set architecture

MIC - Intel® Many Integrated Core Architecture

RAM - Random access memory

DRAM - Dynamic random access memory

MCDRAM - Multichannel DRAM

HBW - High bandwidth memory

DDR4 - Double data rate (generation) 4

SIMD - Single instruction multiple data

SSE - Streaming SIMD extensions

AVX - Advanced vector extensions

FP - Floating point

FPA - Floating point arithmetic^{4}

SP - Single precision^{4}

DP - Double precision^{4}

FLOPS - Floating point operations per second

MM - Matrix multiplication

MMA - Matrix multiplication algorithm

CMMA - Classic matrix multiplication algorithm

MTA - Matrix transpose algorithm

AC - Asymptotic complexity

IC - Implementation complexity

EA - Error analysis

MAS - Memory allocation scheme

LPS - Loop processing scheme

CS - Compute scheme

LIOT - Loop interchange optimization technique

LBOT - Loop blocking optimization technique

ICC - Intel C++ Compiler^{6}

MKL - Math kernel library^{7}

CBLAS - C basic linear algebra subprograms

IDZ - Intel® Developer Zone^{8}

IEEE - Institute of Electrical and Electronics Engineers^{4}

GB - Gigabytes

TN - Total number

## Appendix

### Appendix A - Technical Specifications of the Intel® Xeon Phi™ Processor System

Summary of the Intel Xeon Phi processor system used for testing:

Process technology: 14nm

Processor name: Intel Xeon Phi processor 7210

Frequency: 1.30 GHz

Packages (sockets): 1

Cores: 64

Processors (CPUs): 256

Cores per package: 64

Threads per core: 4

On-Package Memory: 16 GB high bandwidth MCDRAM (bandwidth ~400 GB/s)

DDR4 Memory: 96 GB 6 Channel (Bandwidth ~ 80 GB/s)

ISA: Intel® AVX-512 (Vector length 512-bit)

Detailed processor specifications:

http://ark.intel.com/products/94033/Intel-Xeon-Phi-Processor-7210-16GB-1_30-GHz-64-core

Summary of a Linux operating system:

[guest@... ~]$ uname -a

Linux c002-n002 3.10.0-327.13.1.el7.xppsl_1.4.0.3211.x86_64 #1 SMP

Fri Jul 8 11:44:24 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

[guest@... ~]$ cat /proc/version

Linux version 3.10.0-327.13.1.el7.xppsl_1.4.0.3211.x86_64 (qb_user@89829b4f89a5)

(gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)) #1 SMP Fri Jul 8 11:44:24 UTC 2016

### Appendix B - Comparison of Processing Times for MMAs versus MTA

Comparison of processing times for Intel MKL 'cblas_sgemm' and CMMA vs. MTA:

**[Intel MKL & CMMA]**

Matrix A [32768 x 32768] Matrix B [32768 x 32768]

Number of OpenMP threads: 64

MKL - Completed in: 51.2515874 seconds

CMMA - Completed in: 866.5838490 seconds

**[MTA]**

Matrix size: 32768 x 32768

Transpose Classic - Completed in: 1.730 secs

Transpose Diagonal - Completed in: 1.080 secs

Transpose Eklundh - Completed in: 0.910 secs

When compared processing time of MTA to:

MKL 'cblas_sgemm'. the transposition takes ~2.42 percent of the processing time.

CMMA, the transposition takes ~0.14 percent of the processing time.

### Appendix C - Error Analysis (Absolute Errors for SP FP Data Type)

N | MMA | Calculated SP Value | Absolute Error |
---|---|---|---|

8 | MKL | 8.000080 | 0.000000 |

8 | CMMA | 8.000080 | 0.000000 |

16 | MKL | 16.000160 | 0.000000 |

32 | CMMA | 16.000160 | 0.000000 |

32 | MKL | 32.000309 | -0.000011 |

32 | CMMA | 32.000320 | 0.000000 |

64 | MKL | 64.000671 | 0.000031 |

128 | CMMA | 64.000641 | 0.000001 |

128 | MKL | 128.001160 | -0.000120 |

128 | CMMA | 128.001282 | 0.000002 |

256 | MKL | 256.002319 | -0.000241 |

512 | CMMA | 256.002563 | 0.000003 |

512 | MKL | 512.004639 | -0.000481 |

512 | CMMA | 512.005005 | -0.000115 |

1024 | MKL | 1024.009521 | -0.000719 |

2048 | CMMA | 1024.009888 | -0.000352 |

2048 | MKL | 2048.019043 | -0.001437 |

2048 | CMMA | 2048.021484 | 0.001004 |

4096 | MKL | 4096.038574 | -0.002386 |

8192 | CMMA | 4096.037109 | -0.003851 |

8192 | MKL | 8192.074219 | -0.007701 |

8192 | CMMA | 8192.099609 | 0.017689 |

16384 | MKL | 16384.14648 | -0.017356 |

32768 | CMMA | 16384.09961 | -0.064231 |

32768 | MKL | 32768.33594 | 0.008258 |

32768 | CMMA | 32768.10156 | -0.226118 |

65536 | MKL | 65536.71875 | 0.063390 |

65536 | CMMA | 65536.10156 | -0.553798 |

**Table 6.**

### Appendix D - Performance of MMAs for Different MASs

**Figure 2.** Performance of Intel® MKL 'cbals_sgemm'. KMP_AFFINITY environment variable set to 'scatter'. Cluster mode: 'All2All'. MCDRAM mode: 'Flat'. Test program mmatest2.c. A lower bar height means faster processing.

**Figure 3.** Performance of Intel® MKL 'cbals_sgemm' vs. CMMA. KMP_AFFINITY environment variable set to 'scatter'. Cluster mode: 'All2All'. MCDRAM mode: 'Flat'. Test program mmatest3.c. A lower bar height means faster processing.

**Figure 4.** Performance of Intel® MKL 'cbals_sgemm' vs. CMMA. KMP_AFFINITY environment variable set to 'scatter'. Cluster mode: 'All2All'. MCDRAM mode: 'Hybrid 50-50'. Test program mmatest3.c. A lower bar height means faster processing.

**Figure 5.** Performance of Intel® MKL 'cbals_sgemm' vs. CMMA. KMP_AFFINITY environment variable set to 'scatter'. Cluster mode: 'All2All'. MCDRAM mode: 'Cache'. Test program mmatest3.c. A lower bar height means faster processing.

## About the Author

Sergey Kostrov is a highly experienced C/C++ software engineer and Intel® Black Belt Developer. He is an expert in design and implementation of highly portable C/C++ software for embedded and desktop platforms, scientific algorithms, and high performance computing of big data sets.