Superscalar programming 101 (Matrix Multiply) Part 4 of 5

In the last installment (Part 3) we saw the effects of the QuickThread Parallel Tag Team method of Matrix Multiplication performed on two single processor systems:

Where the Intel Q6600 (4 core – no HT) with two cores (two threads) sharing L1 and L2 caches attained a 40x to 50x improvement over serial method, and in Intel Core i7 920 (4 core – with HT) and with four cores (eight threads) sharing one L3 cache and one core (two threads) sharing L1 and L2 caches attained 70x to 80x improvement. Let’s see how this performs using two processors, each similar to Core i7 920.

When run on a Dual Xeon 5570 systems with 2 sockets and two L3 caches, each shared by four cores (8 threads). and each processor with four L2 and four L1 caches each shared by one core and 2 threads, we find:

Fig 17

for a scale to serial of 140x to 150x in the N = 700 to 1344 range. The performance is almost twice that of the Core i7 920. This was somewhat expected.

There are some interesting observations to be made about this performance profile. While the 2x speed-up was expected, the Parallel Transpose method performed as well as the Parallel Tag Team method with N = 700 to 1024, then drops off precipitously. This is about half of the performance peak range of the Parallel Tag Team method (700 to 1344).

Why are the plateaus the same height?
What is the interpretation of the reason for the drop-off difference?

The plateaus are the same height for the same reason we saw in Fig 5 and Fig 6 where the Serial Transpose and the Parallel Transpose performance were essentially the same (yellow and red lines in Fig 17 above). The reason being: a resource bandwidth limitation. In Fig 5 and Fig 6 the limiting resource appeared to be memory bandwidth (due to Parallel Tag Team method having ample room to out perform Parallel Transpose). Due to the relative equalities of the plateaus (in N = 700 to 1024) some other resource than memory band width appears to be the limiting factor. This leaves cache access overhead or SSE Floating Point bottleneck.

Both of these bottlenecks will tend to clip the height of the performance curve but not the width. You can observe in the chart above that the two Parallel Tag Team methods managed to double the breadth of the peak performance curve thus permitting larger matrices to be handled effectively by the program. The reason for the increase in the breadth (larger matrixes handled) is principally due to more effective reuse of cached data due to the solution path through the problem (sequence in which computations are made).

The insight learned from Fig 17 is: When your problem working data set exceeds that of the cache system, you may find some paths to the solution more efficient than a simple nested loop.

In the 5th article we will explore how we can extend the performance curve to handle larger matrixes. Will this involve more cores/CPUs and/or different solution path? You will have to wait for the next installment (Part 5) to find out.

Jim Dempsey
For more complete information about compiler optimizations, see our Optimization Notice.