Mixing MPI and OpenMP*, hugging hardware and dealing with it

This morning, I took a rare break, and attended a tutorial at Supercomputing.  I'm glad I did.

The tutorial looked at the pros and cons of mixing MPI and OpenMP* in a single program, and was taught by Rolf Rabenseifner (University of Stuttgart), Georg Hager (University of Erlangen) and Gabriele Jost (Texas Advanced Computing Center/Naval Postgraduate School).

Not too long ago, early experiments with this didn't show it was worth the trouble. But that has all changed, because nodes continue to offer more and more SMP parallelism because of more sockets and more cores per sockets and even more hardware threads per core.

What really struck me - was how much of the tutorial was about "little things" that are simply a fact of life now.  Will they go away in the future?  I think so - but it really tells us where we are today.

This fascinates me... based on the tutorial, here is my list of the "little things" that I observed the tutorial covering because they prove to be big when thinking about mixing OpenMP and MPI:
 

  • is your MPI implementation thread-safe? using threads (OpenMP) inside an instance of MPI (on a node) might require a newer version, or special link options, and special locking. There are a bewildering array of options in MPI and OpenMP to deal with this, many in the specific "system documentation" for your system.
  • Amdahl's Law - when MPI send/receives are done once per node, all the other threads are probably idle - hence the communication aspects of your program are a serial bottleneck; the tutorial covered some examples of how to reduce - including having more MPI connections from each node, and dividing up the communication across more threads, and also the idea of doing work in parallel with communication. Quote the instructors "very hard to do."
  • OpenMP is known to the compiler - and this can reduce optimizations when you use it. Okay, this isn't what you probably expected! According to the talk, the IBM Power6 compiler does poorly unless you kick up all the way to optimization level 4 (-O4).
  • not all programs need load-balancing - those that benefit from load-balancing often reward the programming effort with much better results, other programs may simply never benefit enough to make any effort worthwhile.
  • try to make MPI do it all automatically falls short - you can skip OpenMP, and hope that MPI will reduce unnecessary overhead. This works some of the time - but the optimum is usually, and sometime spectacularity because of so called "mis-matches." Perfectly load-balanced applications lend well to just using MPI everywhere and ignoring OpenMP.
  • ccNUMA can be confusing - where memory is allocated (where it is truly local to) can have a profound impact on performance. The most common method is for it to be allocated to be most local to whomever uses it first ("first touch") - which can be quite surprising when init functions and actual usage are separate. This was one example of a "surprise" leading to non-optimal allocation. As to be expected - this leads to the request to have "specific APIs to control memory placement." Of course, this may help experts - but simply punts the problem to more and more complex code.
  • oversubscription - sometime a program might indirectly create more threads than it should, but a more troubling reality was that the system might oversubscribe some cores while leaving others idle. That also leads to the request to have "specific APIs to control thread placement." Again, a short-term and urgently needed solution so a programmer can get where they need to go now.
  • memory bandwidth available vs. threads to pound on memory - for some nodes, the number of threads to completely saturate memory may be less than the number of cores (or hardware threads) available. These leads to a desire to reduce the number of threads per node ("specific APIs to control threads created"). This seems easiest to understand and implement in an MPI/OpenMP hybrid program - done simply by reducing the number of threads in the OpenMP thread pool on a node.
  • copying data, resulting a much larger memory footprint (overall), may be richly rewarded or punished - a local copy helps you get closer to the idea "share nothing" but comes at a cost of copying, and consuming more memory - either of which may be worth it, or may be terrible. Picking the right balance, is part of the "art" of computer science. This challenge tends to favor OpenMP + MPI, because duplication is per MPI instance, and OpenMP tends to share a single copy and programmers often know how to deal with this in a reasonable way. This was an interesting (to me at least) side-effect of programming in a hybrid model - it implies or encourages a certain copy model.
  • together all these add up to fragile performance - many times the results have performance that is very good sometimes, and very bad other times - depending on problem size, file buffer sizes, node count - all very frustrating. I think we are entering an era of more and more fragile performance until we figure this out. It reminds me of tuning code to caches, before cache agnostic algorithms. Unfortunately, that was one problem - and solved with one technique. Based on the tutorial - we seem to be facing a flood of problems - and will need solutions for each one. Perhaps it is darkest before the dawn. It felt a bit dark today. 


 

I enjoyed the tutorial - but like many things at the Supercomputing conference each year - I feel like I learn more about today by understanding what we cannot do easily. Mixing OpenMP/MPI is not bad, but it did highlight a lot of opportuntiies for improvements in the future.

 

The instructors summed it up "you can do it now, it is not as easy as a pure MPI implementation." and "the efficiency of a hybrid solution is not free - it it strongly depends on the amount of work in the source code development."

 

Full employment for experts in other words.

 

I don't know about you, but I can live with that for now.
For more complete information about compiler optimizations, see our Optimization Notice.

2 comments

Top
jlperla's picture

Jim and James,

Thanks for the comments and ideas. I am a little scared to hear that mixing the two is difficult. I am especially scared from the first point that I may not be able to launch threads or use TBB or boost::threads from within an MPI process. Is this an issue if I am using the latest intel compilers, etc.?

A lot of the scientific computing I have is either embarrassingly parallel, or has long execution times per task compared to the amount of communication. I find task-based with C++0X functional programming (e.g. TBB or std::async) style far more intuitive than anything in MPI.

Are there any good libraries or coding approaches that enable me to program in that style and still use MPI to launch nodes/etc. and behave as master-slave from my main program? I have boost::mpi up and running but it still involves unnatural programming even if it helps with marshaling objects. Will things like TBB ever work over MPI directly?

What about having the nodes launched through MPI, but having the slave processes listen in as servers for some sort of RPC mechanism? Is this defensible? What libraries should I be using for async RPCs with modern C++?

Thanks for your help,

-Jesse

James,

I think there may be more benefit in mixing MPI with task based tools such as Threading Building Blocks (or QuickThread). In this model you would slightly bend your programming style of how you handle the task based side. The MPI side would perform a little more setup work to discover the topology. The MPI control program would launch its companion programs on the other systems. These companion programs start their tasking system (TBB or QuickThread), the tasking system reports back to the MPI control program as to available processing resources to the tasking system. Once the statistics are returned the MPI control program can now disburse work to each MPI node in batches (batch size equal to task pool size or some multiple there of).

Effectively you would be using the tasking system to virtualize MPI nodes. In doing so, and for applications with I/O or messaging overhead, you can virtualize more nodes than you have hardware threads.

You would have to pick the right applications to use this technique but I think with prudent choices of division of work, you can reduce the messaging latencies and quantity of passed data (one copy of static data and/or initial data passed to n tasks, n copies of modified data can be consolidated on the tasking system and returned as one larger data set).

Jim Dempsey

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.