This workshop is inspired by Dan Grossman’s SIGCSE 2011 workshop on Data Abstractions. We review C/C++ conversions of the original Java-based materials and will include material from the Parallel Algorithms course at Kent State. The workshop will appeal to data-structure and algorithms course instructors. Workshop topics will include divide and conquer approaches, work sharing concepts, and a scoped locking scheme in OpenMP for C++ classes.
In part 4 we saw the effects of the QuickThread Parallel Tag Team Transpose method of Matrix Multiplication performed on a Dual Xeon 5570 systems with 2 sockets and two L3 caches, each shared by four cores (8 threads). and each processor with four L2 and four L1 caches each shared by one core and 2 threads, we find:
In the last installment (Part 3) we saw the effects of the QuickThread Parallel Tag Team method of Matrix Multiplication performed on two single processor systems:
By Jim Dempsey
In the previous article (part 2) we have seen that by reorganizing the loops and with use of temporary array we can observe a performance gain with SSE small vector optimizations (compiler does this) but a larger gain came from better cache utilization due to the layout change and array access order. The improvements pushed us into a memory bandwidth limitation whereby the Serial method now outperforms the Parallel method (of the Serial method).