Hybrid Parallelism: A MiniFE* Case Study

In my first article, Hybrid Parallelism: Parallel Distributed Memory and Shared Memory Computing, I discussed the chief forms of parallelism: shared memory parallel programming and distributed memory message passing parallel programming. That article explained the basics of threading for shared memory programming and Message Passing Interface* (MPI) message passing. It included an analysis of a hybrid hierarchical version of one of the NAS Parallel Benchmarks*. In that case study the parallelism for threading was done at a lower level than the parallelism for MPI message passing.

This case study examines the situation where the problem decomposition is the same for threading as it is for MPI; that is, the threading parallelism is elevated to the same level as MPI parallelism. The reasons to elevate threading to the same parallel level as MPI message passing is to see if performance can improve because of less overhead in thread libraries than in MPI calls and to check whether the memory consumption is reduced by using threads rather voluminous numbers of MPI invocations.

This paper provides a brief overview of threading and MPI followed by a discussion of the changes in the miniFE*. Performance results are also shared. The performance gains are minimal, but threading consumed less memory, and data sets 15 percent larger could be solved using the threaded model. Indications are that the main benefit is memory usage. This article will be of most interest to those who want to optimize for memory footprint, especially those working on first generation Intel® Xeon Phi™ coprocessors (code-named Knights Corner), where available memory on the card is limited. 

Many examples of hybrid distributed/shared memory parallel programming follow a hierarchical approach MPI distributed memory programming is done at the top and shared memory parallel programming is introduced at multiple regions of the software underneath (OpenMP* is a popular choice). Many software developers design good problem decomposition for MPI programming. However, when using OpenMP, some developers fall back to simply placing pragmas around do or for loops without considering overall problem decomposition and data locality. For this reason some say that if they want good parallel threaded code, they would write the MPI code first and then port it to OpenMP, because they are confident this would force them to implement good problem decomposition with good data locality and such.

This leads to another question: if good performance can be obtained either way does it matter whether a threading model or MPI is used? There are two things to consider. One is performance of course: is MPI or threading inherently faster than the other? The second consideration is memory consumption. When an MPI job begins, an initial number of MPI processes or ranks that will cooperate to complete the overall work are specified. As ever larger problem sizes or data sets are run, the number of systems in a cluster dedicated to a particular job increases, thus the number of MPI ranks increases. As the number of MPI ranks increases, the MPI runtime libraries consume more memory in order to be ready to handle a larger number of potential messages (see Hybrid Parallelism: Parallel Distributed Memory and Shared Memory Computing).

This case study compares both performance and memory consumption. The code used in this case study is miniFE. MiniFE was developed at Sandia National Labs and is now distributed as part of the Montevo project (see montevo.org).

Shared Memory Threaded Parallel Programming

In the threading model, all the resources belong to the same process. Memory belongs to the process so sharing memory between threads can be easy. Each thread must be given a pointer to the common shared memory location. This means that there must be at least one common pointer or address passed into each thread so each thread can access the shared memory regions. Each thread has its own instruction pointer and stack.

A problem with threaded software is the potential for data race conditions. A data race occurs when two or more threads access the same memory address and at least one of the threads alters the value in memory. Whether the writing thread completes its write before or after the reading thread reads the value can alter the results of the computation. Mutexes, barriers, and locks were designed to control execution flow, protect memory, and prevent data races. This creates other problems, because deadlock can happen preventing any forward progression in the code, or contention for mutexes or locks restricts execution flow becoming a bottleneck. Mutexes and locks are not a simple cure-all. If not used correctly, data races can still exist. Placing locks that protect code segments rather than memory references is the most common error.

Distributed Memory MPI Parallel Programming

Distributed memory parallel programming models offer a range of methods with MPI. The discussion in this case study uses the traditional explicit message passing interface of MPI. The most commonly used elements of MPI are message passing constructs. The discussion in this case study uses the traditional explicit message passing interface of MPI. Any data that one MPI rank has that may be needed by another MPI rank must explicitly be sent by the first MPI rank to other ranks that need that data. In addition, the receiving MPI rank must explicitly request the data be received before it can access and use the data sent. The developer must define the buffers used to send and receive data as well as pack or unpack them if necessary; if data is received into its desired location, it doesn't need to be unpacked.  

Finite Element Analysis

In finite element analysis a physical domain whose behavior is modeled by partial differential equations is divided into very small regions called elements. A set of basis functions (often polynomials) is defined for each element. The parameters of basis functions approximate the solution to the partial differential equations within each element. The solution phase typically is a step of minimizing the difference between the true value of the physical property and the value approximated by the basis functions. The operations form a linear system of equations for each finite element known as an element stiffness matrix. Each of these element stiffness matrices are added into a global stiffness matrix, which is solved to determine the values of interest. The solution values represent a physical property: displacement, stress, density, velocity, and so on. 

 

The miniFE is representative of more general finite element analysis packages. In a general finite element program the elements may be irregular and of varying size and different physical properties. Only one element type and one domain are used within miniFE. The domain selected is always a rectangular prism that is divided into an integral number of elements along the three major axes: x, y, and z. The rectangular prism is then sliced parallel to the principal plans recursively to create smaller sets of the finite elements. This is illustrated in Figure 1.

MiniFE* Case Study

Figure 1: A rectangular prism divided into finite elements and then split into four subdomains.

 

The domain is divided into several regions or subdomains containing a set of finite elements. Figure 1 illustrates the recursive splitting. The dark purple line near the center of the prism on the left shows where the domain may be divided into two subdomains. Further splitting may occur as shown by the two green lines. The figure on the right shows the global domain split into four subdomains. This example shows the splitting occurring perpendicular to the z-axis. However, the splitting may be done parallel to any plane. The splitting is done recursively to obtain the desired number of subdomains. In the original miniFE code, each of these subdomains is assigned to an MPI rank: one subdomain per each MPI rank. Each MPI rank determines a local numbering of its subdomain and maps that numbering to the numbering of the global domain. For example, an MPI rank may hold a 10×10×10 subdomain. Locally it would number this from 0–9 in the x, y, and z directions. Globally, however, this may belong to the region numbered 100–109 on the x-axis, 220–229 on the y-axis, and 710–719 along the z axis. Each MPI rank determines the MPI ranks with which it shares edges or faces, and then initializes the buffers it will use to send and receive data during the conjugate gradient solver phase used to solve the linear system of equations. There are additional MPI communication costs; when a dot product is formed, each rank calculates its local contribution to the dot product and then the value must be reduced across all MPI ranks.

The miniFE code has options using threading models to parallelize the for loops surrounding many activities. In this mode the MPI ranks do not further subdivide their subdomain recursively into multiple smaller subdomains for each thread. Instead, for loops within the calculations for each subdomain are divided into parallel tasks using OpenMP*, Intel® Cilk™ Plus, or qthreads*. Initial performance data on the original reference code showed that running calculations for a specified problem set on 10 MPI ranks without threading was much faster than running one MPI rank with 10 threads. 

So the division among threads was not as efficient as the division between MPI ranks. Software optimizations should begin with software performance analysis. I used the TAU* Performance System for software performance analysis. The data showed that the waxpy (vector _axpy operations) operations consumed much more time in the hybrid thread/MPI version than the MPI-only version. The waxpy operation is inherently parallel. It doesn't involve any reduction like a dot product, and there is no potential data-sharing problems that would complicate threading. The only reason for the waxpy operation to consume more time is because the threading models used are all fork-join models. That is, the work for each thread is forked off at the beginning of a for loop, and then all the threads join back again at the end. The effort to initiate computation at the fork and then synchronize at the end adds considerable overhead, which was not present in an MPI-only version of the code.

The original miniFE code divided the domain into a number of subdomains that matches the number of MPI ranks (This number is called numprocs). The identifier for each subdomain was the MPI rank number, called myproc. In the new hybrid code the original domain is divided into a number of subdomains that match the total number of threads globally; this is the number of MPI ranks times the number of threads per rank (numthreads). This global count of subdomains is called idivs (idivs = numprocs * numthreads). Each thread is given a local identifier, mythread (beginning at zero of course). The identifier of each subdomain changes from myproc to mypos (mypos = myproc * numthreads + mythread). When there is only one thread per mpi rank, mypos and myproc are equal. Code changes were implemented in each file to change references to numprocs to idivs, and myproc to mypos. A new routine was written between the main program and the routine driver. The main program forks off the number of threads indicated. Each thread begins execution of this new routine, which then calls the driver, and in turn each thread calls all of the subroutines that execute the full code path below the routine driver.

The principle of dividing work into compact local regions, or subdomains, remains the same. For example when a subdomain needs to share data with an adjacent subdomain it loops through all its neighbors and shares the necessary data. The code snippets below show the loops for sharing data from one subdomain to another in the original code and the new code. In these code snippets each subdomain is sending data to its adjacent neighbors with which it shares faces or edges. In the original code each subdomain maps to an MPI rank. These code snippets come from the file exchange_externals.hpp.  The original code is shown below in the first text box. Comments are added to increase clarity. 

Original code showing sends for data exchanges:

// prepare data to send to neighbors by copying data into send buffers
for(size_t i=0; i<total_to_be_sent; ++i) {
  send_buffer[i] = x.coefs[elements_to_send[i]];
}
//loop over all adjacent subdomain neighbors – Send data to each neighbor
Scalar* s_buffer = &send_buffer[0];
for(int i=0; i<num_neighbors; ++i) {
  int n_send = send_length[i];
  MPI_Send(s_buffer, n_send, mpi_dtype, neighbors[i],
MPI_MY_TAG, MPI_COMM_WORLD);
  s_buffer += n_send;
}


New code showing sends for data exchanges:

//loop over all adjacent subdomain neighbors – communicate data to each neighbor
for(int i=0; i<num_neighbors; ++i) {
int n_send = send_length[i];
if (neighbors[i]/numthreads != myproc)
{// neighbor is in different MPI rank pack and send data
for (int ij = ibuf ; ij < ibuf + n_send ; ++ij)
send_buffer[ij] = x.coefs[elements_to_send[ij]] ;
MPI_Send(s_buffer, n_send, mpi_dtype,  
neighbors[i]/numthreads,
MPI_MY_TAG+(neighbors[i]*numthreads)+mythread, MPI_COMM_WORLD);
} else
{//neighbor is another thread in this mpi rank wait until recipient flags it is safe to write then write
while (sg_sends[neighbors[i]%numthreads][mythread]);
stmp = (Scalar *) (sg_recvs[neighbors[i]%numthreads][mythread]);
for (int ij = ibuf ; ij < ibuf + n_send ; ++ij)
stmp[ij-ibuf] = x.coefs[elements_to_send[ij]] ;
// set flag that write completed
sg_sends[neighbors[i]%numthreads][mythread] = 2 ;
}
s_buffer += n_send;
ibuf += n_send ;
}


In the new code each subdomain maps to a thread. So each thread now communicates with threads responsible for neighboring subdomains. These other threads may or may not be in the same MPI rank. The setup of communicating data remains nearly the same. When communication mapping is set up, a vector of pointers is shared within each MPI rank. When communication is between threads in the same MPI rank (process), a buffer is allocated and both threads have access to the pointer to that buffer. When it is time to exchange data, a thread loops through all its neighbors. If the recipient is in another MPI rank, the thread makes a regular MPI send call. If the recipient is in the same process as the sender, the sending thread writes the data to the shared buffer and marks a flag that it completed the write.

Additional changes were also required. By default, MPI assumes only one thread in a process or the MPI rank sends and receives messages. In this new miniFE thread layout each thread may send or receive data from another MPI rank. This required changing MPI_Init to MPI_Init_thread with the setting MPI_THREAD_MULTIPLE. This sets up the MPI runtime library to behave in a thread-safe manner. It is important to remember that MPI message passing is between processes (MPI ranks) not threads, so by default when a thread sends a message to a remote system there is no distinction made between threads on the remote system. One method to handle this would be to create multiple MPI communicators. If there were a separate communicator for each thread in an MPI rank, a developer could control which thread received the message in the other MPI rank by its selection of the communicator. Another method would be to use different tags for each thread so that the tags identify which thread should receive a particular message. The latter was used in this implementation; MPI message tags were used to control which thread received messages. The changes in MPI message tags can be seen in the code snippets as well. In miniFE the sender fills the send buffer in the order the receiver prefers. Thus the receiver does not need to unpack data on receipt and can use the data directly from the receive buffer destination.  

Dot products are more complicated, because they are handled in a hierarchical fashion. First, a local sum is made by all threads in the MPI rank. One thread makes the MPI Allreduce call, while the other threads stop at a thread barrier waiting for the MPI Allreduce to be completed and the data recorded in an appropriate location for all threads to get a copy.

In this initial port all of the data collected here used Intel® Threading Building Blocks (Intel® TBB) thread. This closely match the C++ thread specifications, so it will be trivial to test using standard C++ threads.

Optimizations

The initial threading port achieved the goal matching vector axpy operation execution time. Even though this metric improved, prior to some tuning, the threading model was initially slower than the MPI version. Three principle optimization steps were applied to improve the threaded code performance.

The first step was to improve parallel operations like dot products. The initial port used a simple method of each thread accumulating results such as dot products using simple locks. The first attempt replaced Posix* mutexes with Intel TBB locks and then atomic operations as flags. These steps made no appreciable improvement. Although the simple lock method worked for reductions or gathers for quick development using four threads, it did not scale well when there were a couple of hundred threads. A simple tree was created to add some thread parallelism to reductions such as the dot products. Implementing a simple tree for a parallel reduction offered a significant performance improvement; further improvements may offer small incremental improvements.

The second optimization was to make copies of some global arrays for each thread (this came from an MPI_Allgather). Because none of the threads alters the array values there is no opportunity for race conditions or cache invalidation. From that point the array was used in a read-only mode. So the initial port shared one copy of the array among all threads. For performance purposes, this proved to be wrong; creating a private copy of the array for each thread improved performance. Even after these optimizations, performance of the code with lots of threads still lagged behind the case with only one thread per MPI rank.
This leads to the third and last optimization step. The slow region was in problem setup and initialization. I then realized that the bottleneck was in dynamic memory allocation and a better memory allocator would resolve the bottleneck. The default memory allocation libraries on Linux* do not scale for numerous threads. Several third-party scalable memory allocation libraries are available to resolve this problem. All of them work better than the Linux default memory allocation runtime libraries. I used Intel TBB memory allocator because I am familiar with it and it can be adopted without any code modification by simply using LD_PRELOAD. So at runtime LD_PRELOAD was defined to use the Intel TBB memory allocator designed for parallel software. This change of the memory allocation runtime libraries closed the performance gap. This step substituted the Intel TBB memory allocator for all of the object and dynamic memory creation. This single step provided the biggest performance improvement.   

Performance

This new hybrid miniFE code ran on both the Intel® Xeon Phi™ coprocessor and the Intel® Xeon Phi™ processor. The data collected varied the number of MPI ranks and threads using a problem size that nearly consumed all of the system memory for the two platforms. For the first-generation Intel Xeon Phi coprocessor the MPI rank/thread ratio varied from 1:244 and 244:1. A problem size of 256×256×512 was used for the tests. The results are shown in Figure 2.

MiniFE* Case Study
Figure 2. MiniFE* performance on an Intel® Xeon Phi™ coprocessor.

 

The results show variations in performance based on the different ratios of the MPI-to-thread ratio. Each ratio of MPI to thread ran at least twice, and the fastest time was selected for reporting. More runs were collected for the ratios with slower execution time. The differences in time proved repeatable. Figure 3 shows the same tests on the Intel Xeon Phi Processor using a larger problem size.

MiniFE* Case Study
Figure 3: MiniFE* performance on Intel® Xeon Phi™ Processor

 


The performance on the Intel Xeon Phi Processor showed less performance variation than performance of miniFE on the Intel Xeon Phi coprocessor. No explanation is offered for the differences between the runtime and number of MPI ranks to the number of threads. It may be possible to close those differences by explicitly pinning threads and MPI ranks to specific cores. These tests left process and thread assignment to the OS.

There is much less variation for miniFE performance than was reported for the NAS SP-MZ* benchmark hybrid code as discussed in Hybrid Parallelism: Parallel Distributed Memory and Shared Memory Computing. The NAS benchmark code though did not create subdomains for each thread as was done in this investigation of miniFE. The NAS SP-MZ code did not scale as well with threads as it did with MPI. This case study shows that following the same decomposition, threads do as well as MPI ranks. On the Intel® Xeon Phi™ Product family, miniFE performance was slightly better for using the maximum number of threads and only one MPI rank rather than using the maximum number of MPI ranks with only one thread each. Best performance was achieved with a mixture of MPI ranks and threads.

Memory consumption proves to be the most interesting aspect. The Intel Xeon Phi coprocessor is frequently not set up with a disk to swap pages to virtual memory, which provides an ideal platform to evaluate the size of a problem that can be run with the associated runtime libraries. When running the miniFE hybrid code on the Intel Xeon Phi coprocessor, the largest problem size that ran successfully with one MPI rank for each core was 256×256×512. This is a problem size of 33,554,432 elements. The associated global stiffness matrix contained 908,921,857 nonzero entries. When running with only 1 MPI rank and creating a number of threads that match the number of cores, the same number of subdomains are created and a larger problem size—256×296×512—runs to completion. This larger problem contained 38,797,312 elements, and the corresponding global stiffness matrix had 1,050,756,217 nonzero elements. Based on the number of finite elements, the threading model allows developers to run a model 15 percent larger. Based on nonzero elements in the global stiffness matrix, the model solved a matrix that is 15.6 percent larger. The ability to run a larger problem size is a significant advantage that may appeal to some project teams.

There are further opportunities for optimization of the threaded software (for example, pinning threads and MPI ranks to specific cores and improving parallel reductions). It is felt that the principal tuning has been done and further tuning would probably have minimal changes in performance. The principal motivation to follow the same problem decomposition for threading as for MPI is for the improvement in memory consumption.

Summary

The effort to write code for both threads and MPI is time consuming. Projects such as the Multi-Processor Computer (MPC) framework (see mpc.hpcframework.paratools.com) may make writing code in MPI and running via threads just as efficient in the future. The one-sided communication features of MPI-3 may allow developers to write code more like the threaded version of miniFE, where one thread writes the necessary data to the other threads' desired locations, minimizing the need for MPI runtime libraries to hold so much memory in reserve. When add threading to MPI code, remember the best practices such as watching for scalable runtime libraries and system calls that may not be thread-friendly by default, such as memory allocation or rand(). 

Performance of threaded software performs comparably with MPI when they both follow the same parallel layout: subdomain per MPI rank and subdomain per thread. In cases like miniFE, threading consumes less memory than MPI runtime libraries and allows larger problem sizes to be solved on the same system. For this implementation of miniFE, problem sizes 15 percent larger could be run on the same platform. Those seeking to optimize for memory consumption should consider the same parallel layout for both threading and MPI and will likely benefit from the transition.

Notes

Data collected using the Intel® C++ Compiler 16.0 and Intel® MPI Library 5.1.

Для получения подробной информации о возможностях оптимизации компилятора обратитесь к нашему Уведомлению об оптимизации.
Возможность комментирования русскоязычного контента была отключена. Узнать подробнее.