This morning, I took a rare break, and attended a tutorial at Supercomputing. I'm glad I did.
The tutorial looked at the pros and cons of mixing MPI and OpenMP* in a single program, and was taught by Rolf Rabenseifner (University of Stuttgart), Georg Hager (University of Erlangen) and Gabriele Jost (Texas Advanced Computing Center/Naval Postgraduate School).
Not too long ago, early experiments with this didn't show it was worth the trouble. But that has all changed, because nodes continue to offer more and more SMP parallelism because of more sockets and more cores per sockets and even more hardware threads per core.
What really struck me - was how much of the tutorial was about "little things" that are simply a fact of life now. Will they go away in the future? I think so - but it really tells us where we are today.
This fascinates me... based on the tutorial, here is my list of the "little things" that I observed the tutorial covering because they prove to be big when thinking about mixing OpenMP and MPI:
- is your MPI implementation thread-safe? using threads (OpenMP) inside an instance of MPI (on a node) might require a newer version, or special link options, and special locking. There are a bewildering array of options in MPI and OpenMP to deal with this, many in the specific "system documentation" for your system.
- Amdahl's Law - when MPI send/receives are done once per node, all the other threads are probably idle - hence the communication aspects of your program are a serial bottleneck; the tutorial covered some examples of how to reduce - including having more MPI connections from each node, and dividing up the communication across more threads, and also the idea of doing work in parallel with communication. Quote the instructors "very hard to do."
- OpenMP is known to the compiler - and this can reduce optimizations when you use it. Okay, this isn't what you probably expected! According to the talk, the IBM Power6 compiler does poorly unless you kick up all the way to optimization level 4 (-O4).
- not all programs need load-balancing - those that benefit from load-balancing often reward the programming effort with much better results, other programs may simply never benefit enough to make any effort worthwhile.
- try to make MPI do it all automatically falls short - you can skip OpenMP, and hope that MPI will reduce unnecessary overhead. This works some of the time - but the optimum is usually, and sometime spectacularity because of so called "mis-matches." Perfectly load-balanced applications lend well to just using MPI everywhere and ignoring OpenMP.
- ccNUMA can be confusing - where memory is allocated (where it is truly local to) can have a profound impact on performance. The most common method is for it to be allocated to be most local to whomever uses it first ("first touch") - which can be quite surprising when init functions and actual usage are separate. This was one example of a "surprise" leading to non-optimal allocation. As to be expected - this leads to the request to have "specific APIs to control memory placement." Of course, this may help experts - but simply punts the problem to more and more complex code.
- oversubscription - sometime a program might indirectly create more threads than it should, but a more troubling reality was that the system might oversubscribe some cores while leaving others idle. That also leads to the request to have "specific APIs to control thread placement." Again, a short-term and urgently needed solution so a programmer can get where they need to go now.
- memory bandwidth available vs. threads to pound on memory - for some nodes, the number of threads to completely saturate memory may be less than the number of cores (or hardware threads) available. These leads to a desire to reduce the number of threads per node ("specific APIs to control threads created"). This seems easiest to understand and implement in an MPI/OpenMP hybrid program - done simply by reducing the number of threads in the OpenMP thread pool on a node.
- copying data, resulting a much larger memory footprint (overall), may be richly rewarded or punished - a local copy helps you get closer to the idea "share nothing" but comes at a cost of copying, and consuming more memory - either of which may be worth it, or may be terrible. Picking the right balance, is part of the "art" of computer science. This challenge tends to favor OpenMP + MPI, because duplication is per MPI instance, and OpenMP tends to share a single copy and programmers often know how to deal with this in a reasonable way. This was an interesting (to me at least) side-effect of programming in a hybrid model - it implies or encourages a certain copy model.
- together all these add up to fragile performance - many times the results have performance that is very good sometimes, and very bad other times - depending on problem size, file buffer sizes, node count - all very frustrating. I think we are entering an era of more and more fragile performance until we figure this out. It reminds me of tuning code to caches, before cache agnostic algorithms. Unfortunately, that was one problem - and solved with one technique. Based on the tutorial - we seem to be facing a flood of problems - and will need solutions for each one. Perhaps it is darkest before the dawn. It felt a bit dark today.