# "Vanilla is the Absence of Chocolate" and Other Lies My Father Told Me

If you've ever taken a course on algorithms, you will recall that the goal of this course is to explore different techniques to perform certain computations and to be able to evaluate an algorithm's efficiency in a logical and mathematically rigorous way. During my CS studies I took this course and I've taught the algorithms courses while at the University of Southern Mississippi (both a serial [undergraduate] and a parallel version [graduate]).

In these courses, we indoctrinate our students with the idea that the better choice for algorithms is the one with the lowest asymptotic upper bound (the big-O notation) on the growth of the number of algorithmic steps executed. Certainly, on average, Quicksort, O(n log n), will be faster than BubbleSort, O(n^2); binary search, O(log n) is better than linear search, O(n), and a hash table search, O(1), is better than either.

Now, however, with parallel programming, we may need to have our students rethink what the best (serial) algorithm is to be used for parallel execution. It may not be the one with the best asymptotic behavior, but the one that is easiet to code in parallel. For example, consider matrix multiplication.  One of the best algorithms devised to date is Strassen's Algorithm. This algorithm quarters the matrices and recursively computes 7 different multiplications of these pieces and combines the results into the final answer. The algorithm uses O(n^2.81) operations where the traditional triple-nested loop algorithm is O(n^3). In parallel, Strassen's algorithm would require some tricky synchronization and concurrency management issues. To parallelize the traditional algorithm I only need to add a single OpenMP pragma.

Better serial algorithms may make use of some shared state to cut the number of steps needed to perform a computation. As the number of cores increases, the cost of computation goes down whereas the cost of synchronization goes up. Thus, utilizing the shared information of a better algorithm is going to be a major cause of lower parallel performance. Alternately, running a less than optimal algorithm on a myriad of cores with minimal synchronization would be best. Another example is to perform redundant work on each core in order to compute shared values locally rather than stalling execution while one core does the computation before sharing the results with all the other cores.

I'm not predicting the end of Algorithms courses as we know them today. They are still a great place to teach how, given an algorithm not seen before, to estimate the execution time and a place where new algorithmic ideas can be presented in order to equip future programmers for when they need to write code. No, I'm just trying to give a gentle "heads up" to the time when a student gives you a blank stare as you proceed to use a sub-optimal algorithm in some parallel coding example and doesn't immediately see why. She may suspect that the whole Algorithms course is a sham and just an excuse to get more tuition from her. When that happens, you might suggest she go think it over while sipping a chocolate shake.

It's what my father would do. And then he'd tell me that vines would grow from my stomach and out my ears if I swallowed a watermelon seed.
Categories:
For more complete information about compiler optimizations, see our Optimization Notice.