MPI One-Sided Communication

In this continuation of the blog, Hybrid MPI and OpenMP* Model, I will discuss the use of MPI one-sided communication and demonstrate running a one-sided application in symmetric mode on an Intel® Xeon® host and two coprocessors connected via PCIe.

The standard Message Passing Interface (MPI) has two-sided communication and collective communication models. In these communication models, both sender and receiver have to participate in data exchange operations explicitly, which requires synchronization between the processes.

In two-sided communication, memory is private to each process. When the sender calls the MPI_Send operation and the receiver calls the MPI_Recv operation, data in the sender memory is copied to a buffer then sent over the network, where it is copied to the receiver memory. One drawback of this approach is that the sender has to wait for the receiver to be ready to receive the data before it can send the data. This may cause a delay in sending data. Figure 1 illustrates this situation.

 

Figure 1 A simplified diagram of MPI two-sided communication send/receive. The sender calls MPI_Send but has to wait until the receiver calls MPI_Recv before data can be sent.

To overcome this drawback, the MPI 2 standard introduced Remote Memory Access (RMA), also called one-sided communication because it requires only one process to transfer data. One-sided communication decouples data transfer from system synchronization. The MPI 3.0 standard revised and added extensions to the one-sided communication, adding new functionality to improve the performance of MPI 2 RMA. The Intel® MPI Library 5.0 supports one-sided communication where a process can have direct access to the memory address space of a remote process (MPI_Get/MPI_Put/MPI_Accumulate) without the intervention of that remote process. One-sided communication operations are non-blocking. They benefit many applications because, while a process sends data to a remote process, the remote process can continue to compute (useful work) instead of waiting for the data.

In order to allow other processes to have access into its memory, a process has to explicitly expose its own memory to others.  It does this (MPI_Win_create) by declaring a shared memory region, also called a window. Synchronization in MPI one-sided communication can be achieved with MPI_Win_fence. In a simple word, between two MPI_Win_fence calls, all RMA operations are completed. 

To illustrate MPI one-sided communication, the below sample program shows the use of MPI_Get and MPI_Put using a memory window. Note that error checking is not implemented, since the program is intended only to show how one-sided communication works.

/*
// Copyright 2003-2014 Intel Corporation. All Rights Reserved.
// 
// The source code contained or described herein and all documents related 
// to the source code ("Material") are owned by Intel Corporation or its
// suppliers or licensors.  Title to the Material remains with Intel Corporation
// or its suppliers and licensors.  The Material is protected by worldwide
// copyright and trade secret laws and treaty provisions.  No part of the
// Material may be used, copied, reproduced, modified, published, uploaded,
// posted, transmitted, distributed, or disclosed in any way without Intel's
// prior express written permission.
// 
// No license under any patent, copyright, trade secret or other intellectual
// property right is granted to or conferred upon you by disclosure or delivery
// of the Materials, either expressly, by implication, inducement, estoppel
// or otherwise.  Any license under such intellectual property rights must
// be express and approved by Intel in writing.
*/
#include <stdio.h>
#include <mpi.h>

#define NUM_ELEMENT  4

int main(int argc, char** argv)
{
   int i, id, num_procs, len, localbuffer[NUM_ELEMENT], sharedbuffer[NUM_ELEMENT];
   char name[MPI_MAX_PROCESSOR_NAME];
   MPI_Win win;

   MPI_Init(&argc, &argv);
   MPI_Comm_rank(MPI_COMM_WORLD, &id);
   MPI_Comm_size(MPI_COMM_WORLD, &num_procs);
   MPI_Get_processor_name(name, &len);

   printf("Rank %d running on %s\n", id, name);

   MPI_Win_create(sharedbuffer, NUM_ELEMENT, sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &win);

   for (i = 0; i < NUM_ELEMENT; i++)
   {
      sharedbuffer[i] = 10*id + i;
      localbuffer[i] = 0;
   }

   printf("Rank %d sets data in the shared memory:", id);

   for (i = 0; i < NUM_ELEMENT; i++)
      printf(" %02d", sharedbuffer[i]);

   printf("\n");
 
   MPI_Win_fence(0, win);

   if (id != 0)
      MPI_Get(&localbuffer[0], NUM_ELEMENT, MPI_INT, id-1, 0, NUM_ELEMENT, MPI_INT, win);
   else
      MPI_Get(&localbuffer[0], NUM_ELEMENT, MPI_INT, num_procs-1, 0, NUM_ELEMENT, MPI_INT, win);

   MPI_Win_fence(0, win);

   printf("Rank %d gets data from the shared memory:", id);

   for (i = 0; i < NUM_ELEMENT; i++)
      printf(" %02d", localbuffer[i]);

   printf("\n");

   MPI_Win_fence(0, win);

   if (id < num_procs-1)
      MPI_Put(&localbuffer[0], NUM_ELEMENT, MPI_INT, id+1, 0, NUM_ELEMENT, MPI_INT, win);
   else
      MPI_Put(&localbuffer[0], NUM_ELEMENT, MPI_INT, 0, 0, NUM_ELEMENT, MPI_INT, win);

   MPI_Win_fence(0, win);

   printf("Rank %d has new data in the shared memory:", id);
 
   for (i = 0; i < NUM_ELEMENT; i++)
      printf(" %02d", sharedbuffer[i]);

   printf("\n");

   MPI_Win_free(&win);
   MPI_Finalize();
   return 0;
}

In the sample code, MPI_Init initializes the MPI environment which allows the parallel code. MPI_Comm_rank and MPI_Comm_size return the MPI process identification and the number of MPI processes (or ranks) respectively. MPI_Get_processor_name returns the name and length of the processor name. Each process in the system has a shared memory region, called sharedbuffer, which is an array of 4 integers.  MPI_Win_create is called by all processes to create a window of shared memory; the window specifies all process memory which is available for remote operations. Each process then initializes its portion of the memory window allowing remote processes read/write access to the pre-defined memory. MPI_Put writes data into a memory window on a remote process. MPI_Get reads data from a memory window on a remote process. Between the first two MPI_Win_fence calls, each process reads data from the window on the preceding process and copies it to its local memory; between the second two MPI_Win_fence calls, each process copies data from its local memory to the window of the succeeding process. Each process finally checks its new data (which is already modified by a remote process). MPI_Win_free is called to terminate the memory window and MPI_Finalize to clean up all MPI states and MPI environment; the parallel code ends here.

In the following section, the sample code is compiled and run on an Intel Xeon host system equipped with two Intel® Xeon Phi™ coprocessors. First, we establish a proper environment for the Intel compiler (in this case, Intel® Composer XE 2013 Service Pack 1) and the Intel® MPI Library (in this case, Intel® MPI Library 5.0), then we copy the MPI libraries to the two coprocessors. Next the executables for host and coprocessor are built and the coprocessor executable copied into the directory /tmp on the coprocessors.

% source /opt/intel/composer_xe_2013_sp1.2.144/bin/compilervars.sh intel64

% source /opt/intel/impi/5.0.0.028/bin64/mpivars.sh

% scp /opt/intel/impi/5.0.0.028/mic/bin/* mic0:/bin/

% scp /opt/intel/impi/5.0.0.028/mic/lib/* mic0:/lib64/

% scp /opt/intel/impi/5.0.0.028/mic/bin/* mic1:/bin/

% scp /opt/intel/impi/5.0.0.028/mic/lib/* mic1:/lib64/

% mpiicc mpi_one_sided.c -o mpi_one_sided.host

% mpiicc -mmic mpi_one_sided.c -o mpi_one_sided.mic

% sudo scp ./mpi_one_sided.mic mic0:/tmp/mpi_one_sided.mic

% sudo scp ./mpi_one_sided.mic mic1:/tmp/mpi_one_sided.mic

 

Then we enable MPI communication between host and coprocessors, and activate coprocessor-coprocessor communication:

 

% export I_MPI_MIC=enable

% sudo /sbin/sysctl -w net.ipv4.ip_forward=1

At this point we are ready to launch the application in symmetric mode with one rank on the Intel Xeon host, two ranks on the first coprocessor mic0 and three ranks on the second coprocessor mic1:

% mpirun -host localhost -n 1 ./mpi_one_sided.host : -host mic0 -n 2 -wdir /tmp ./mpi_one_sided.mic : -host mic0 -n 3 -wdir /tmp ./mpi_one_sided.mic

The following figures show the data movement as the program runs. After the window of shared memory is created, each rank initializes its portion of the shared memory. The array is filled with integer numbers that combine the rank number and the element number. The tens digit is set to the originated rank number; the ones digit is set to the element number in the array. Thus, rank 0 places the values of 00, 01, 02, 03 for the first, second, third and fourth entries in its shared array; likewise, rank 3  places the values of 30, 31, 32, 33 for the first, second, third and fourth entries in its shared array, etc. Figure 2 shows the values of the shared memory after each of the 6 ranks fills its shared array sharedbuffer:

Figure 2 Each rank initializes its portion in the shared memory.

Since each process now can have access to the shared memory of others, each rank calls MPI_Get to copy data of the preceding rank’s shared memory to its local array. The local array of each rank now contains the values of the preceding rank.  Thus, since the preceding rank of rank 0 is rank 5, rank 0 gets the values of 50, 51, 52, 53 for the first, second, third and fourth entries in its local array; similarly, since the preceding rank of rank 3 is rank 2, rank 3 places the values of 20, 21, 22, 23 for the first, second, third and fourth entries in its local array, etc. Figure 3 shows the values of the local array localbuffer after each rank gets the values from shared memory sharedbuffer:

Figure 3 Each rank gets the data of the preceding rank from the shared memory.

Next, each rank calls MPI_Put to copy its local array to the succeeding rank’s shared memory. That is, the shared memory of the succeeding rank now contains the values of the local array of its preceding rank. Thus, since the succeeding rank of rank 0 is rank 1, rank 0 copies the values of 50, 51, 52, 53 to the first, second, third and fourth entries in rank 1’s shared array; similarly, since the succeeding rank of rank 3 is rank 4, rank 3 copies the values of 20, 21, 22, 23 to the first, second, third and fourth entries to rank 4’s shared memory of rank 4, etc. Figure 4 shows the values of the shared memory sharedbuffer when each rank copies the values from its local array to the shared memory sharedbuffer:

Figure 4 Each rank writes data (previously read) to the shared memory of the succeeding rank.

Finally, each rank reads its shared array. Thus, rank 0 reads the values of 40, 41, 42, 43 for the first, second, third and fourth entries in its shared array; rank 3  reads the values of 10, 11, 12, 13 for the first, second, third and fourth entries in its shared array, etc. Figure 5 shows the values of the shared memory after each of the 6 ranks reads its shared array sharedbuffer:

Figure 5 Data of each rank in the shared memory changes. 

The output generated by the application is shown below:

Rank 0 running on knightscorner1

Rank 3 running on knightscorner1-mic1

Rank 5 running on knightscorner1-mic1

Rank 1 running on knightscorner1-mic0

Rank 4 running on knightscorner1-mic1

Rank 2 running on knightscorner1-mic0

Rank 0 sets data in the shared memory: 00 01 02 03

Rank 2 sets data in the shared memory: 20 21 22 23

Rank 4 sets data in the shared memory: 40 41 42 43

Rank 1 sets data in the shared memory: 10 11 12 13

Rank 3 sets data in the shared memory: 30 31 32 33

Rank 5 sets data in the shared memory: 50 51 52 53

Rank 4 gets data from the shared memory: 30 31 32 33

Rank 0 gets data from the shared memory: 50 51 52 53

Rank 5 gets data from the shared memory: 40 41 42 43

Rank 3 gets data from the shared memory: 20 21 22 23

Rank 1 gets data from the shared memory: 00 01 02 03

Rank 2 gets data from the shared memory: 10 11 12 13

Rank 0 has new data in the shared memory: 40 41 42 43

Rank 4 has new data in the shared memory: 20 21 22 23

Rank 5 has new data in the shared memory: 30 31 32 33

Rank 1 has new data in the shared memory: 50 51 52 53

Rank 2 has new data in the shared memory: 00 01 02 03

Rank 3 has new data in the shared memory: 10 11 12 13

 

Besides MPI one-sided communication, there are other programing approaches that also support one-sided communication. For example, SHMEM (Symmetric Hierarchy Memory Access) is another approach where one process can have access to a global shared memory. SHMEM also takes advantage of hardware RDMA (Remote Direct Memory Access) to allow a local Processing Element (PE) to access a remote PE’s memory without interrupting the remote PE (the remote PE’s CPU is not involved). SHMEM resembles MPI one-sided communication: each host has the SHMEM library installed; each host runs a copy of the SHMEM application called PE; each host accesses shared memory through APIs such as shem_get(), shmem_put(), shmalloc(), shmem_barrier() to transfer data, allocate memory and synchronize the processes. Like MPI, SHMEM is very easy to use to take advantage of direct memory access; both MPI and SHMEM offer point to point communication and collective communication. Unlike MPI, SHMEM is not yet standardized, although there is an effort by many leading companies in the HPC area to support the OpenSHMEM standard.

CONCLUSION:

In summary, one-sided communication in MPI enables users to take advantage of DMA to access the data of remote processes; thus benefiting applications in which synchronization can be relaxed by reducing data movement. An example of using one-sided communication MPI_Get/MPI_Set was shown and run in the symmetric mode between an Intel Xeon host and Intel Xeon Phi coprocessors using the Intel® MPI Library 5.0. SHMEM programming resembles MPI one-sided communication but SHMEM is not standardized yet. There are ongoing efforts to standardize it (OpenSHMEM) in order to support multiple hardware vendors.

Note: By installing or copying all or any part of the software components in this page, you agree to the terms of the Intel Sample Source Code License Agreement.

 

 

 

 

 

 

 

 

 

 

 

 

For more complete information about compiler optimizations, see our Optimization Notice.