Here are the questions and answers from the Future-Proof Your Applications's Performance With Vectorization technical presentation held February 15, 2012.
In part 4 we saw the effects of the QuickThread Parallel Tag Team Transpose method of Matrix Multiplication performed on a Dual Xeon 5570 systems with 2 sockets and two L3 caches, each shared by four cores (8 threads). and each processor with four L2 and four L1 caches each shared by one core and 2 threads, we find:
In the last installment (Part 3) we saw the effects of the QuickThread Parallel Tag Team method of Matrix Multiplication performed on two single processor systems:
By Jim Dempsey
In the previous article (part 2) we have seen that by reorganizing the loops and with use of temporary array we can observe a performance gain with SSE small vector optimizations (compiler does this) but a larger gain came from better cache utilization due to the layout change and array access order. The improvements pushed us into a memory bandwidth limitation whereby the Serial method now outperforms the Parallel method (of the Serial method).