Multithreaded Transposition of Square Matrices with Common Code for Intel® Xeon® Processors and Intel® Xeon Phi™ Coprocessors

In-place matrix transposition, a standard operation in linear algebra, is a memory bandwidth-bound operation. The theoretical maximum performance of transposition is the memory copy bandwidth. However, due to non-contiguous memory access in the transposition operation, practical performance is usually lower. The ratio of the transposition rate to the memory copy bandwidth is a measure of the transposition algorithm efficiency.

This paper demonstrates and discusses an efficient C language implementation of parallel in-place square matrix transposition. For large matrices, it achieves a transposition rate of 49 GB/s (82% efficiency) on Intel® Xeon® Processors and 113 GB/s (67% efficiency) on Intel® Xeon Phi coprocessors. The code is tuned with pragma-based compiler hints and compiler arguments. Thread parallelism in the code is handled by OpenMP*, and vectorization is automatically implemented by the Intel compiler. This approach allows to use the same C code for a CPU and for a MIC architecture executable, both demonstrating high efficiency. For benchmarks, an Intel Xeon Phi 7110P coprocessor is used.

To run the benchmark, execute the script ./benchmark.sh

The included Makefile and script ./benchmark.sh are designed for Linux.

In order for the CPU code to compile, you must have the Intel C++ compiler installed in the system.

In order to compile and run the MIC platform code, you must have an Intel Xeon Phi coprocessor in the system and the MIC Platform Software Stack (MPSS) installed and running

Multithreaded Transposition of Square Matrices code GitHub link: https://github.com/ColfaxResearch/Transposition

 

For more complete information about compiler optimizations, see our Optimization Notice.

3 comments

Top
Joseph R.'s picture

Dear Mike P., thanks for the stimulating squib on your square matrix in-place transpositioner. I imagine it as handling 2D problems. Do you think it will lead or has it lead to attempts to transpose a 3D matrix in-place, and if so, what is the estimated efficiency? Thanks again.

Mike P. (Intel)'s picture

So glad you are able to increase your point count with a simple "Thanks"

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.