Hybrid Parallelism: Parallel Distributed Memory and Shared Memory Computing

There are two principal methods of parallel computing: distributed memory computing and shared memory computing. As more processor cores are dedicated to large clusters solving scientific and engineering problems, hybrid programming techniques combining the best of distributed and shared memory programs are becoming more popular. This trend has been accelerated with the Intel® Xeon Phi™ processor line; placing 244 virtual cores has encouraged many developers to transition to hybrid programming techniques.

This article begins by reviewing shared memory programming techniques, and then distributed memory MPI programming. Finally, it discusses hybrid shared memory/distributed memory programming and includes an example.

Shared Memory Computing

Large symmetric multi-processor systems offered more compute resources to solve large computationally intense problems. Scientists and engineers threaded their software to solve problems faster than they could on single processors systems. Multi-core processors made the advantages of threading ubiquitous. By applying two processors or two cores, a problem theoretically could be solved in half the time; with eight processors or cores, 1/8 the time would be the maximum attainable. The opportunities merited changes to software to take advantage of the compute resources in multi-processor and multi-core systems. Threading is the most popular shared memory programming technique. In the threading model, all the resources belong to the same process. Each thread has its own address pointer and stack, yet they share a common address space and system resources. The common shared memory access makes it easy for a developer to divide up work, tasks, and data. The disadvantage is that because all resources are available to all threads, this allows for data races.

A data race occurs when two or more threads access the same memory address and at least one of the threads alters the value in memory. The results of the computation can be altered depending on whether the writing thread completes its write before or after the reading thread reads the value. Mutexes, barriers, and locks were designed to control execution flow, protect memory, and prevent data races. This creates other problems as a deadlock can happen preventing any forward progression in the code, or contention for mutexes or locks restricts execution flow, which becomes a bottleneck. Mutexes and locks are not a simple cure-all. If not used correctly, there can still be data races. Placing locks around a code segment rather than around specific memory references is the most common error. In addition, tracking design of the thread flow through all the mutexes, locks, and barriers becomes complicated and difficult for developers to maintain and understand, especially with multiple shared objects or dynamically linked libraries. Threading abstractions were designed to ease the programming and control.

Threading Abstractions

The most popular higher-level threading abstraction in the engineering and science communities is OpenMP*. The original OpenMP design was built on a fork-join parallel structure where work was forked off over the thread pool in a parallel region and joined back together in a sequential region and possibly repeating the fork to a parallel region and joining again one or more times. This provided a common thread pool and avoided the overhead cost of creating and destroying new threads for every task.

In a thread pool, the threads are created and remain until the program ends. Do loops or for loops are the most common usage model in OpenMP. When a for or DO loop is marked as a parallel region, the OpenMP runtime library will automatically decompose the loop into tasks for each OpenMP thread to execute. An example is shown in Table 1. Threading abstractions such as OpenMP made threaded programming easier to track, understand, and maintain.

// A pragma defines beginning of a parallel region and specifies iterations of the following for loop should be spread across openmp threads and is the extent of the parallel region

#pragma omp parallel for
for (i=0; i < n; i++)
{
  . . . ;
  computations to be completed ;
  . . . ;
}

C$ directive defines beginning and end of a
C$ parallel region. Iterations (I) are
C$ spread across openmp threads

!$ OMP PARALLEL DO
DO I=1,N
  . . . 
  computation work is completed
  . . .
ENDDO
!$ OMP END PARALLEL DO

Table 1: Example of OpenMP* parallel region in C and Fortran*

Some of the newer OpenMP constructs include parallel tasks as well as the capability to assign work to a coprocessor or accelerator. Newer threading abstractions introduced include Intel® Threading Building Blocks (Intel® TBB). While OpenMP works with both C/C++ and Fortran*, Intel TBB is based on C++ templates, a generic programming model, so it is limited to C++. Because Intel TBB contains a rich feature set, while still providing a clear abstraction that is easy to follow, it has become quite popular with C++ programmers.

Another threading abstraction, Intel® Cilk™ Plus, relies on tighter compiler integration and as such offers compelling performance for some cases. Intel Cilk Plus relies on a parallel fork-join construct. Nested parallelism is natural for Intel Cilk Plus and Intel TBB, compared to OpenMP where nested parallelism must be identified and declared by the developer. This makes Intel TBB and Intel Cilk Plus ideal to include in libraries. Intel now offers both OpenMP (traditional) and Intel TBB versions of its Intel® Math Kernel Library (Intel® MKL) (Intel TBB was introduced as an option in Intel MKL 11.3).

Popular Threading AbstractionsGeneral Properties
OpenMP*

Structured fork-join parallel model, supports parallel for, tasks, sections
Supports offload
Nested parallelism to be explicitly identified
C, C++, and Fortran*

Intel® Threading Building Blocks

Supports parallel for, pipelines, general graphs and dependencies, tasks, optimized reader/writer locks and more
Nested/recursive parallelism natural
Template-based; C++ only

Intel® Cilk™ Plus

Structured fork-join parallel model
Supports parallel for and fork commands
Nested/recursive parallelism natural
C/C++

Table 2: Popular threading models.

Distributed Memory Programming

Many applications sought greater computational power than what was available in a single multi-processor system. This led to connecting several systems together to form clusters of computers to work together to solve a single computational workload. These systems frequently linked the systems together with a proprietary “fabric,” with each platform in the cluster having its own private memory area within the cluster. The program had to explicitly define the data to be shared with another platform and deliver it to the other platform in the cluster.

With the distributed computing approach, explicit message passing programs were written. In this approach a program explicitly packaged data and sent it to another system in the cluster. The other system had to explicitly request the data and bring the data into its process. An approach linking workstations called parallel virtual machines was developed, allowing programs to work across a network of workstations. Each vendor with a proprietary fabric supported its own message passing library.

The community quickly consolidated around the Message Passing Interface (MPI). MPI required programmers to explicitly handle the decomposition of the problem across the cluster as well as make sure messages were sent and received in the proper order. This approach increased the size of scientific problems that could be solved as there was no longer a single-system memory limitation, and the computational power of many systems combined allowed problems previously too large, too complex or too computationally demanding to be solved. A major addition to MPI was the support for one-sided data movement or remote direct memory access. This allows data movement so that the sending process and the receiving process do not need to synchronize over their calls to send messages and receive messages. The shared memory regions (called windows) for data access must be explicitly set up and defined.

MPI message passing does not lend itself well to incremental parallel programming. Once you begin distributing memory across remote systems there is no reason to go through the work to bring it back to a central platform for the next phase of the application, so MPI programs typically take more upfront work to get running than some other models like OpenMP. However, there is no indication that a performance-tuned fully parallel program is written faster in OpenMP than MPI. Some prefer to program first in MPI, and then convert the MPI to threads if they want a threaded version. Developing in MPI forces a developer to really think about the parallelism and consider the parallel architecture or parallel design of their application and make sure the parallel code is designed well from the beginning.

An advantage of MPI programming is that an application is no longer limited to the amount of memory on one system or the number of processors and cores on one system. An additional advantage is that it requires developers to create a good decomposition of the data and the program. MPI does not really experience data races as in threads, but a developer could still write an MPI program that deadlocks where all the MPI processes are waiting for an event or message that will not happen due to poorly planned dependencies. In MPI, the developer explicitly writes the send and receive messages. Figure 1 shows how this might be done.


Figure 1: MPI processes for send/receive data

Since MPI processes can execute as multiple processes on the same platform or spread across multiple platforms, it would seem a developer could just program in MPI and it would run whether on a single platform or on multiple platforms. MPI library memory consumption is an important consideration. The MPI runtime library maintains buffers for controlling messages being sent and received. As the number of MPI processes increases, the MPI library must be prepared to send or receive from any other process in the application. Configuring for ever-increasing numbers of processes means the runtime library consumes more memory and needs to track more message destinations and receipt locations.

As cluster sizes increased MPI developers recognized the memory expansion growth of the MPI runtime library. A further complication is that as memory consumed by the MPI library increases, memory consumption is replicated on every system in the cluster, as this is not shared space. The article in reference shows how MPI libraries were improved to minimize unnecessary memory consumption 1.

The Intel® Xeon Phi™ coprocessor (code-named KNC), with over 60 processors with four-way symmetric multithreading (SMT) is designed to have about 240 active threads or processes. If a developer only uses MPI and runs an MPI application across four cards, this will result in 960 MPI processes. Based on the charts in the referenced paper1, this will take about 50 MB of memory per MPI rank or about 48 GB total (or 12 GB/card). Granted the data used above are considered to be the top end of possible MPI memory consumption. So consider the case where the MPI library only needs to use 34 percent of the possible 50 MB—about 17 MB. Let’s consider that maybe an application only uses three of the possible SMT on KNC—720 MPI processes each requiring 17 MB of memory for a total of 12.2 GB of memory, or 3 GB per card. The top KNC cards only have 16 GB of memory. For problems well below the worst-case memory consumption, the MPI library will consume 3 GB out of 16 GB or over 1/8 of the memory and that doesn’t include the memory consumed by the OS, the binary, and other services. This greatly reduces the amount of memory available for data for the problem being solved. This memory consumption of the MPI libraries is one driving factor for the movement to hybrid programming. Pavan Balaji wrote As the amount of memory per core decreases, applications will be increasingly motivated to use a shared-memory programming model on multicore nodes, while continuing to use MPI for communication among address spaces.”

The community is well aware of this potential memory consumption and the steps required 3,4.

When combining threading and MPI into a single program, developers also need to be aware of thread safety and awareness of the MPI libraries. The MPI standard specifies four models for threaded software:  

  • MPI_THREAD_SINGLE. Code is sequential; only one thread is running (if all MPI calls are in sequential regions, this works with OpenMP).
  • MPI_THREAD_FUNNELED. Only one thread makes any calls into the MPI library. For OpenMP, this means that calls can be made inside a parallel region, but the OpenMP omp master directives/pragma should be used to ensure that the master thread makes all the MPI calls.
  • MPI_THREAD_SERIALIZED. All threads may make MPI library calls, but the developer placed controls so that only one thread is active in an MPI call at any given time.
  • MPI_THREAD_MULTIPLE. Any thread may make any MPI call at any time.

The hybrid code I have reviewed always uses the first model: MPI_THREAD_SINGLE. The MPI invocations occur in sequential regions of the MPI code. It is easier to use the first three models above. The model MPI_THREAD_MULTIPLE requires more consideration as messages are to MPI processes and not to threads. If two threads are sending and receiving messages to the same MPI processes, the code must be designed such that the order of the MPI messages will be correct regardless of which thread makes the send/receive call, in any order from either end of two points (receiver or sender).

There is extra overhead when MPI_THREAD_MULTIPLE is used for message passing. Data measured on Linux* clusters reports these difference may be minimal 5. If the MPI_THREAD_MULTIPLE has good design behind it, it may work well. Note that the report above is benchmark data, not application data.

Hybrid Example Code

The NAS Parallel Benchmarks* provide widely available reference implementations of sample parallel codes 7. The Multi-Zone versions of the NAS Parallel Benchmarks include hybrid MPI/OpenMP reference implementations. The goal of the Multi-Zone port was to better reflect fluid dynamics code in use such as OpenFlow*. The code was modified so that a solution is completed in each zone independently, and then zone boundary values are exchanged across all zones at each time step. The local zone solution boundary exchange is repeated for each time step. For this article, results were collected for the class C problem size on an Intel Xeon Phi coprocessor card with various configurations of MPI processes and OpenMP threads. The class C problem size is small enough to run on one process on a single KNC core and also large enough to run on 240 cores. Running one MPI rank and varying number of threads showed fair scalability. Holding the effort to one thread and increasing the number of MPI processes showed better scalability. This is not a reflection of shared memory versus distributed memory programming; it is an artifact of the parallelism level.

In a report using an SMP/MLP model (yet another parallel programming model using a multi-process shared memory region) comparing MPI with one OpenMP thread to SMP/MLP with one OpenMP thread, the SMP/MLP code gave slightly higher performance than the MPI for SP-MZ class C problems 6. Thus there is no inherent performance superiority of shared memory versus distributed memory programming styles. For the SP-MZ hybrid MPI/OpenMP decomposition running on an Intel Xeon Phi coprocessor, the results showed MPI scaling better than the hybrid until about 100 MPI processes. After this the hybrid MPI/OpenMP decompositions showed better results.

If the graph (see Figure 2) were to begin from 1 MPI/1 OpenMP thread, the time to run sequentially creates such a large range that the differences for the number of threads/processes are indistinguishable. For this reason, the displayed chart begins at 50 total threads. The total number of threads is equal to the number of MPI processes multiplied by the number of OpenMP threads per process.


Figure 2: Performance depending on ratio and number of MPI processes to OpenMP* threads.

The best performance was achieved for seven OpenMP threads with 32 MPI processes (224 threads total). The SP-MZ code allows for lots of independent computation with well-defined message interfaces. Applications with more message passing may have a different optimum ratio (1:2, 1:4, 1:7, 1:8, and so on). Developers should look for thread work load balance as well as MPI work load balance to collect data and measure to determine the best ratio. The important take away from the chart is at some point the code stopped running faster just by increasing MPI processes, but performance continued to improve performance using a hybrid model.

In this NAS SP-MZ code, the parallelism is at two different levels. That is, the OpenMP parallelism is nested below the MPI parallelism. An alternative would be to put the OpenMP parallelism at the same level as the MPI parallelism. It is anticipated that there exists cases for which one design or the other will be superior.  There may also be cases where some threads may be at the same level as the MPI processes for tasks and each of these higher level workers employs multiple worker threads at a low level.

Ideally there would be a set of rules and data measurements to guide the developer through hybrid programming. The development environment is not at that level yet. Developers are recommended to exercise good judgment and good design. Sometimes OpenMP was added to codes for parallelism in an opportunistic fashion rather than following a good design pattern (that is, OpenMP pragmas were placed where DO or for loops were found rather than by considering how to best parallelize the code).  Good parallel design will be required for good scaling. Good parallel design can be expressed in MPI or a threading method. There are many tools to assist the developer 8,9. Intel® Advisor XE provides a means to recognize, design, and model parallelism. Intel® VTune™ Amplifier XE measures the performance and behavior of OpenMP parallel regions. Intel® Trace Analyzer displays MPI performance analysis information to understand behavior of MPI code behavior. TAU Performance System* and ParaProf* also collect performance data and events for OpenMP and MPI and display performance data. All of these tools can help developers understand performance of their code to improve design and performance.

Summary

As developers continue to seek performance they are encouraged to explore hybrid programming techniques, which offer the opportunity to improve resource consumption, especially memory. Multiple levels of parallelism like that shown in SP-MZ are also likely to produce better performance. The ratio of MPI processes and threads may be application-dependent and will require testing and evaluation. The software should be written in a way that allows the number of threads to be controlled without recompiling (threading abstractions do this). OpenMP and Intel TBB are the most commonly used threading abstractions. Developers should adopt hybrid parallel programming models and move forward to continue delivering more performance.

References

1 Durnov, D. and Steyer, M. Intel MPI Memory Consumption.
2 Balaji, P. et al. MPI on a Million Processors.
3 Goodell, D., Gropp, W., Zhao, X., and Thakur, R. Scalable Memory Use in MPI: A Case Study with MPICH2.
4 Thakur, R. MPI at Exascale.
5 Thakur, R. and Gropp, W. Test Suite for Evaluating Performance of MPI Implementations That Support MPI THREAD MULTIPLE.
6 Jin, H. and Van der Wijngart, R. Performance Characteristics of the Multi-Zone NAS Parallel Benchmarks.
NAS Parallel Benchmarks
8 Intel VTune Amplifier XE, Intel Advisor XE, Intel MKL, and Intel Trace Analyzer are all available in the Intel® Parallel Studio XE Cluster edition
9 TAU Performance System and ParaProf are available from tau.uoregon.edu

 

For more complete information about compiler optimizations, see our Optimization Notice.