# Superscalar programming 101 (Matrix Multiply) Part 5 of 5

In part 4 we saw the effects of the QuickThread Parallel Tag Team Transpose method of Matrix Multiplication performed on a Dual Xeon 5570 systems with 2 sockets and two L3 caches, each shared by four cores (8 threads). and each processor with four L2 and four L1 caches each shared by one core and 2 threads, we find:

Fig 18 (17 on part-4)

The Intel Many Core Testing Laboratory was kind enough to provide me some time using their systems. /en-us/

Running the same method (sans Cilk++) on a 4 processor Intel Xeon 7560, each processor with 8 cores plus Hyper Threading (total of 32 cores, 64 threads) we observe:

Fig. 19

In this chart we do not see a plateau in the scaling. This is due to the problem size at N=2048 being fully contained within the system cache. Caution, keep in mind that the above chart represents the scaling to the cache insensitive Serial method.

When comparing this to the cache sensitive Serial Transpose method we find a different set of results:

Fig 20

The sharp change in slope at N=1500-1700 is mainly due to the drop in performance of the reference data of the Serial Transpose method, rather than due to improvement in PTTx.

Looking at scaling factors (parallel performance / number of hardware threads) is often used as a decision factor in making a purchase. Let’s look at the scaling factor charts:

Fig 21

We find that as the problem size increases we observe a nice positive slope on the scaling factor. This looks exceptionally good. Too good to be believed. It is important to remember that the Serial method is not cache sensitive and is not a valid base line for comparison.

When we produce the scaling factor as compared to the cache friendly Serial Transpose method we find a completely different picture:

Fig 22

This chart will really deflate your programmers ego. After all this hard work, we find that the scaling factor to a cache sensitive Serial Transpose method does not pay off (factor crosses 1) until N = 1824.

Comparing the factors of the 2x Intel Xeon 5570, as factored against the Serial Transpose we find:

Fig 23

As expected, both systems can attain super scaling at different problem sizes. This is due to the different amount of cache memory on each system.

Although scaling factor provides a good perspective as to return on investment with respect to purchase of more processors, the scaling factor of one processor architecture is not meaningful when compared to a different processor archetecture. A company ought to be interested in total return on investment, and this includes a time element.

When looking at the time element, we get a completely different picture. When comparing the fastest method (QuickThread Parallel Tag Team Transpose with SSE intrinsic functions) we find:

Fig 24

When including time, as a determination for cost benefit, we find that there is a rather drastic transition in the cost benefit ratio as you cross a particular threshold in the problem size (N=1400). The point being made here is to use appropriately sized test cases when making evaluations for purchase decisions. The cost/benefit and performance curves will not always be suitable for extrapolation.

When we run the fastest method (QuickThread Parallel Tag Team Transpose with SSE intrinsic functions) to larger matrix sizes we find

Fig 25

Matrixes up to N = 3000 to 4096 can be handled with 4 processors (32 cores / 64 threads), larger matrixes may require additional processors and/or a revised method.

Conclusions up to this point:

The fastest method: Parallel Tag Team Transpose with SSE intrinsic functions, relies on the QuickThread ability to schedule affinity pinned threads by cache level proximity. The ability to coordinate work using cache sensitivity can pay off big in your optimization strategies.

Larger matrix sizes could be handled in an improved manner with the same number of processors (4x 8 core with HT) when combined with an additional tiling strategy which will include additional overhead. This is typically called the divide and concur method, often used by parallel programmers.

Taking the matrix at N = 5200, and splitting it in two (both axis) yields a tile of N = 2600 and four such tiles. This requires 4 x 2 = 8 iterations using the smaller tiles. The matrix at N = 2600 took approximately 0.33 seconds to compute, therefore estimated computation time would be at 0.33 x 8 = 2.64. An estimated 10x improvement over the un-tiled method, but which may not be fast enough for your purposes.

Would divide and concur be the best strategy to use?

This depends on the interpretation of best.

In terms of relative performance return for effort in programming, this may be so. However, in looking at Fig 18 (17 on part-4), and comparing the Cilk++ to QuickThread Parallel Tag Team XMM method, we have demonstrated that by paying particular attention to cache locality, specifically, what’s in L1, L2 and L3 caches, and when it is in those caches, that you can attain an additional 1.4x to 2.5x performance boost in performance.

I will attempt to lay out the strategy which I believe will make effective use of the system caches. While the sketch below won’t show the specific method, it will demonstrate the general plan of attack.

The current Parallel Tag Team (transpose) method divides the work by L3, then subsequently L2 regions and then takes an L1 friendly path in producing the results. This strategy works exceptionally well up until the size of the matrix reaches a point where the execution path begins to evict data from the L3 cache. It is my postulation that by employing a method where you follow the same path of the Parallel Tag Team (transpose) method, but impose a clipping technique on the distance from the diagonal, that you can minimize L3 cache evictions. The chart below illustrates a clipped L2 path through the current L3 workspace.

Fig 26

In the above chart, the general execution path follows the arrow. The colored (red) cells indicate the output cells who’s results have been computed. The white cells indicate those output cells that have yet to be computed.

In the current Parallel Tag Team (transpose) method all of the above cells would have been colored, in the proposed method for large matrixes, a clipping technique limits the distance from the diagonal of the output cells to be computed while processing the diagonal. N.B. The above is a simplification of a 1P system.

In processing of a large matrix, had the computations included the empty cells, the computation would first suffer L2 cache evictions to L3, then at some size, eviction of L3. This is (postulation) possibly confirmed by Fig 25.

Fig 27

In above Fig 27 (Fig 25 with arrows added), the red arrow depicts L2 evictions and the blue arrow depicts L3 evictions.

Back to Fig 26. Upon completion of output cells in the Fig 26 we will find:

Fig 28

Where the X’s mark the cells in the output L2 zone who’s results have been completed. The blue cells represent columns (stored as rows) in the m2t array that are still residing in L2 cache, and the red cells represent row cells in the m1 array that are in the L2 cache. Additionally, (not depicted by colorization) some portion of the bottom row(s) and right most column(s) are still residing in the L1 cache. The remaining un-X’d white may, or may not, be residing in the L3 cache.

The next computation sequence (subject to verification) ought to follow the sequence as depicted by the arrows in Fig 29:

Fig 29

The red and blue ends of the output matrix should be processed in an alternating sequence as you progress along the arrows towards the first diagonal.

In the earlier mentioned divide and concur method (tiling), you would process 4 smaller tiles twice each or 8x the time of a smaller tile, presumably of a size found optimal for L2 cache size. The tiling method might benefit from L2 residual data resulting in a 6x to 8x run time of the smaller matrix as opposed to 8x the run time.

In the proposed method (call it cross diagonal), and for the size range depicted above in Figs 28 and 29, and based upon my prior experience with the Parallel Tag Team Transpose technique, it is estimated that it may be possible to produce the result in 1.5x to 2x the time of the smaller matrix. Potentially besting the divide and concur method (tiling) by a factor of 4x. It should be stressed that the actual differences may vary from this estimate. Extrapolation, as mentioned earlier, often does not follow the curve established by present data.

I hope you have found my series of articles insightful. This article cannot convey the detail of the QuickThread Parallel Tag Team Transpose XMM method whereas the code can convey this detail. For those interested in obtaining a copy of the code and a demo license for QuickThread feel free to contact me at my email address below. QuickThread runs on Windows and Linux systems. Both x32 and x64 for Windows but only x64 for Linux (Ubuntu and Red Hat).

Jim Dempsey