Matrix Multiplication, Performance, and Scalability in OpenMP: Student Challenge




High Performance Computing: Model Methods and Means, first semester 2010, Fa.M.A.F., National University of Córdoba, Argentina.
Lic. Nicolás Wolovick

Motivation and Objectives
I am the teacher of HPCMMM10 course at FaMAF, National University of Córdoba, and this course is a satellite of Dr. Sterling's CSC7600 at LSU. In our second year, we have gained momentum in HPC education and training as well as consulting within our University.

In the beginning of this year, a simple, widely known and studied problem was posed to the students: matrix multiplication. We made an internal contest, to obtain the fastest serial code. Many versions were submitted, and we finally obtained 20x of improvement over the most naïve implementation. The students learned a lot about compiler optimizations, and above all, the effect of the caches in the performance of the code.

The objective was to extrapolate this exercise to a massive multicore architecture, like the one provided by four Nehalem EX processors that Intel introduced in the first quarter this year. Having 32 cores to perform the matrix multiplication under the QuickPath memory communication architecture provided a complex enough scenario to explore different solutions.

Bases
The students were introduced to the problem and given a kickstart code with a naïve C using an OpenMP implementation of the problem and a few rules to follow:

  • Use the 32 physical cores and HyperThreading (HT) was not allowed, since compute nodes had HT disabled.
  • You could use any language accepting OpenMP (C, C++, Fortran), as long as the datatype used to store the matrix element was a 32 bit single precision IEEE754  (float in C).
  • The winner will be the student obtaining the best time for OMP_NUM_THREADS=32 having a reasonable scalability. If the 32 processor time difference was less than 5%, then the winner will be the one obtaining the lowest sum of all points in the curve of 1, 2, 4, 8, 16, 32 processors.
  • The execution should be reproducible in the teacher’s account. This implies using the compilers installed in the server, not handcrafting code, and not using stochastic code with high variance.
  • The result should be correct for OMP_NUM_THREADS={1,2,4,8,16,32} with respect to the trivial implementation given by the teacher.
  • Submit all jobs using PBS.

Hardware and Software
A small cluster of a master node and two compute nodes, all of them featuring a motherboard with four Intel® Xeon® Processors X7560, 24MB of global L3 cache, 256KB of per core L2 cache and 32KB of split program and data L1 cache. Each of the compute nodes had 64GB of RAM.

The operating system was RHEL* 5.4, using a PBS Pro* version 10.2 as a resource manager. There were three compiler suites, namely the Intel® Compiler Suite v11.1, GNU* Compiler Collection 4.1.1 and 4.4.3.

Student Activity
There were twelve prospective students joining the challenge. Some of the students were from the current year HPC course and some of the 2009 HPC course.

The trial period was not optimal with respect to our educational calendar, since many students were engaged in mid-term exams in the period of May 10 to May 28 when the core of trial took place. This precluded a deep engagement with the challenge, and only half of the students tried to improve the one second mark of the trivial kernel.

Kickstart Code
Students were provided with a kickstart code to fix some initial parameters. The multiplication C = A*B had to be done with 4096*4096 matrices, A and B initialized with random elements, and a zeroed C. The elements were IEEE754 single precision floating point numbers, and the walltime measurement included only kernel multiplication time.

The matrix multiplication kernel is given below.

// matrix multiplication, naïve
#pragma omp parallel for private(j,k) shared(a,b,c)
for (i=0; i<N; i++)
for (j=0; j<N; j++)
for (k=0; k<N; k++)

c[INDEX(i,j)] += a[INDEX(i,k)]*b[INDEX(k,j)];

For the standard implementation, the 32 core walltimes were:

Compiler, options

Walltime

gcc-4.4 -O3

37.0s

icc-11 -O3

5.17s

icc-11  -fast -xSSE4.2 -O3

1.07s

The gcc times could be greatly improved if we changed the loop order to i,k,j. This way, the compiler was able to vectorize the loop and take advantage of SSE instructions. For the Intel Compiler, all simple optimizations like loop interchange and transposition of the B matrix, did not affect the walltime at all.

  • Results
  • Two students obtained outstanding results: Carlos Bederián and Miguel Montes.

Miguel approach was simple but effective, he took the base version and he replaced malloc’ed matrices by globally defined ones, avoiding all parameter passing and letting the compiler to optimize the a[i][j] references. This way he obtained 0.41s under ICC with flags “-fast -xSSE4.2” using the 32 cores. Miguel also tried Strassen algorithm obtaining 0.45s.

Carlos explored the program space in a comprehensive way, from reproducing both Miguel’s result to testing a mix of Strassen and direct multiplication. His best implementation got 0.387s, using SSE intrinsics, implementing ideas of GotoBLAS2 library for a block that fits into L1 cache, an external loop that distributes the B matrix between the processors improving the efficiency of L3 cache, and finally using KMP_AFFINITY=compact. The scaling table for the best code was:

Cores

1

2

4

8

16

32

Walltime

10.603s

5.404s

2.843s

1.598s

0.731s

0.387s

Rel.Eff.

100%

98%

95%

88%

109%

94%

Abs.Eff.

100%

98%

94%

82%

90%

85%

We remarked on the high efficiency of the solution (85% of linear speedup), and the strange super-scalability from 8 to 16 cores.

He also implemented an external Strassen algorithm, with handmade SSE inner blocks. This is the summarizing table.

Cores

1

2

4

8

16

32

Walltime

7.095

4.198

2.281

1.290

0.722

0.426

Rel.Eff.

100%

84%

92%

88%

89%

84%

Abs.Eff.

100%

84%

77%

68%

61%

52%

We can see that although the beginning is better than the previous version, the scalability is not good achieving an absolute efficiency of 52% for 32 cores.

Two students with physics background tested Fortran90 trivial implementations, and this time was not given to the teacher.

One of them using  Intel® Fortan Compiler with flags -xT -O3 -no-prec-div -opt-malloc-options=2 -openmp -Zp4 -align, and setting the following environmental variables KMP_LIBRARY='throughput', KMP_AFFINITY=granularity=fine,compact,1,0; and he obtained a respectable 0.484s for 32 cores.

Conclusion
The contest was very interesting since the hardware was new; in fact, it was not released commercially at that start of our research.

There were two major drawbacks. First, the occupancy of our students precluded them to put more effort into this challenge; and second, the matrix multiplication problem is one of the flag applications where all compiler optimizations are targeted to. This was evidenced at the meager speedup obtained from a trivial code appropriately tuned using compiler options and OpenMP environment variables (0.41s) with respect to a handmade SSE intrinsic code (0.387s).

Nevertheless this approach is not completely useless since the fastest implementation tested was GotoBLAS2, having a walltime of 0.283s, and this library does use assembly code targeted to each different architecture. Additionally, Carlos feels he could have obtained a greater speedup had he been given more time; his breakthrough in performance occurred only a few days before deadline.

One of the initial motivations was to explore program patterns to take advantage of Intel® QuickPath Interconnect and the three cache levels. Unfortunately, the compiler provided by Intel is so efficient with respect to this, that there is very little maneuvering to be done by non-experts in microprocessor instruction set and architecture.

All in all, the students felt this was a good opportunity. For the course it represented an important leap in quality and the possibility to test HPC code on very recent hardware.

Bibliography
What Every Programmer Should Know About Memory, Ulrich Drepper, Red Hat, Inc., 2007.

A Case Study on High Performance Matrix Multiplication, André Moré, 2008.

GotoBLAS2 library, Texas Advanced Computing Center.

Anatomy of High-Performance Matrix Multiplication, Kazushige Goto, Robert A. van de Geijn, University of Texas at Austin, ACM TOMS, 2008.

Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione