This year, a simple matrix multiplication problem was posed to the students and we set up an internal contest, to obtain the fastest serial code. Many versions were submitted, and we finally obtained 20x of improvement over the most naïve implementation. The students learned a lot about compiler optimizations, and above all, the effect of the caches in the performance of the code.
The objective of this exercise was to extrapolate this work to a massive multicore architecture. Having 32 cores to perform the matrix multiplication under the QuickPath memory communication architecture provided a complex enough scenario to explore different solutions.