Multithread-Entwicklung
Detecting Memory Bandwidth Saturation in Threaded Applications
Detecting Memory Bandwidth Saturation in Threaded Applications (PDF 231KB)
Abstract
Superscalar programming 101 (Matrix Multiply) Part 5 of 5
In part 4 we saw the effects of the QuickThread Parallel Tag Team Transpose method of Matrix Multiplication performed on a Dual Xeon 5570 systems with 2 sockets and two L3 caches, each shared by four cores (8 threads). and each processor with four L2 and four L1 caches each shared by one core and 2 threads, we find:

Superscalar programming 101 (Matrix Multiply) Part 4 of 5
In the last installment (Part 3) we saw the effects of the QuickThread Parallel Tag Team method of Matrix Multiplication performed on two single processor systems:

Superscalar programming 101 (Matrix Multiply) Part 3 of 5
By Jim Dempsey
In the previous article (part 2) we have seen that by reorganizing the loops and with use of temporary array we can observe a performance gain with SSE small vector optimizations (compiler does this) but a larger gain came from better cache utilization due to the layout change and array access order. The improvements pushed us into a memory bandwidth limitation whereby the Serial method now outperforms the Parallel method (of the Serial method).
Superscalar Programming 101 (Matrix Multiply) Part 1 of 5
By Jim Dempsey
The subject matter of this article is: How to optimally tune a well known algorithm. We will take this well known (small) algorithm, a common approach to parallelizing this algorithm, a better approach to parallelizing this algorithm, and then produce a fully cache sensitized approach to parallelizing this algorithm. The intention of this article is to teach you a methodology of how to interpret the statistics gathered during test runs and then use those interpretations at improving your parallel code.

