In-place matrix transposition, a standard operation in linear algebra, is a memory bandwidth-bound operation. The theoretical maximum performance of transposition is the memory copy bandwidth. However, due to non-contiguous memory access in the transposition operation, practical performance is usually lower. The ratio of the transposition rate to the memory copy bandwidth is a measure of the transposition algorithm efficiency.
This paper demonstrates and discusses an efficient C language implementation of parallel in-place square matrix transposition. For large matrices, it achieves a transposition rate of 49 GB/s (82% efficiency) on Intel® Xeon® Processors and 113 GB/s (67% efficiency) on Intel® Xeon Phi™ coprocessors. The code is tuned with pragma-based compiler hints and compiler arguments. Thread parallelism in the code is handled by OpenMP*, and vectorization is automatically implemented by the Intel compiler. This approach allows to use the same C code for a CPU and for a MIC architecture executable, both demonstrating high efficiency. For benchmarks, an Intel Xeon Phi 7110P coprocessor is used.
To run the benchmark, execute the script ./benchmark.sh
The included Makefile and script ./benchmark.sh are designed for Linux.
In order for the CPU code to compile, you must have the Intel C++ compiler installed in the system.
In order to compile and run the MIC platform code, you must have an Intel Xeon Phi coprocessor in the system and the MIC Platform Software Stack (MPSS) installed and running
Multithreaded Transposition of Square Matrices code GitHub link: https://github.com/ColfaxResearch/Transposition