# Classroom challenge: Matrix Multiplication, Performance and Scalability in OpenMP

A simple, widely known and studied problem was posed to the class students: matrix multiplication. We made an internal contest, which was to obtain the fastest serial code in which the students learned a lot about compiler optimizations, and even more, the effect of caches in code performance. The objective of the contest was to extrapoloate this exercise into a massive multicore architecture. Students were given kickstart code with a naive C using an OpenMP implemention of the problem, and a series of rules.

# Parallel algorithm for finding intersections of line segments in 3-D (Dmitry Vyukov)

The included source code implements a parallel search for intersections of input line segments within a 3-D space, as described in the included problem description text file. Three different methods of solution are initially considered. Complexity analysis and potential parallelization of the first two (brute force search, sweep-line algorithm) are considered and used to eliminate each from further consideration. The third method, Tree Decomposition, is chosen and explained in detail.

# SIMD tuning with ASM pt. 4 - Vectorization & ICC

If you remember from my first post I presented a program. Stripping out the setup, the part we care about - the part that does the work - is this:

```
float x[PTS],

float y[PTS];

for (int i = 0; i < PTS; i++) { // line 13 in orig source

x[i] += y[i];           // line 14 in orig source

}

```

# SIMD tuning with ASM pt. 3 - PS good, SS bad

If you recall where we left off on my post yesterday we compiled a test program with gcc and saw this code for the 'working' part of a loop. (Yes, I will be getting to the Intel C++ compiler next post, but I'll stick with what I've got so far just so we can take baby steps).

```
.LBB52:

.loc 1 14 0

movss   (%rbp,%rax,4), %xmm0