MPI+linked list

MPI+linked list

Dear All,

as suggested by Tim Prince, I ask the question about MPI here. 

I would like to know if it is possible to send linked list from a processor to another in fortran+MPI

Thanks a lot

16 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

MPI sends data based on several factors.  You provide the starting address (via variable name), length (via datatype and count), data layout if necessary (custom datatypes), ad how the data is encoded (datatype).  Since a linked list is not a contiguous datatype, and doesn't necessarily have a pre-defined structure, you cannot do this in a simple manner.  You would need to send each element of the linked list individually.

Dear all,

this means that I have to change strategy, am I right?

This could be a problem because my program work with linked list.

Do you have some suggestion?

Thanks a lot

You could send one element at a time and rebuild the linked list in the receiving rank.  Or you could convert the linked list to an array, send that, and build a new linked list in the receiver.  But there is no trivial way to send a linked list.

Dear James,

thanks a lot for the suggestion. I am totally new in MPI. Now I am learning how to divide the computational domain with MPI_CART_CREATE and MPI_CART_SHIFT. I have also learn the MPI_CART_COORDS.

I think that MPI_CART_SHIFT is a problem with linked list. Am I right?

Where I could learn how to send an array? Now I am only able to send single variable

Thanks a lot

A simple example of sending an array:

int array[10];
int rank;
// Initialize the array values however you prefer
MPI_Comm_rank(&rank,MPI_COMM_WORLD);
if (rank==0)
{
  MPI_Send(array,10,MPI_INT,1,0,MPI_COMM_WORLD);
} else if (rank==1)
{
  MPI_Recv(array,10,MPI_INT,0,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
}

The MPI_Comm_rank call gets the rank number within MPI_COMM_WORLD.  Every rank will have a unique number within any given communicator.  MPI_COMM_WORLD is the default communicator, and will contain all ranks that were part of the initial job.  You can spawn additional ranks after starting which aren't part of MPI_COMM_WORLD, but that's a separate topic.

The conditional breaks execution into three possible paths, based on the rank number.  Any rank higher than rank 1 will not do anything here, and simply move to the next statement.  Rank 0 will send data, rank 1 will receive it.  Let's look at what the communication calls actually do.

The first argument of each is the initial buffer address.  For the sending call, this is where to find the data to send.  For the receiving call, this is where to put the received data.  In this case, the variable array is a pointer to the first element of the array.  Keep in mind that the call here expects an address, not a value.

The next argument is the count.  How many pieces of data will be sent?  The third argument defines the type of data.  Basic MPI datatypes are fairly simple, and sufficient for most cases.  Here, I'm using MPI_INT, which is the corresponding type for an int variable.  These two arguments define how much data to send and how to handle it.  For the most part, the "how to handle it" is irrelevant, but when you are using heterogeneous systems, this can become a factor.  I wouldn't worry about it at this time, other than to make sure you're sending the correct data type.  You can also make custom data types.  But these are still simple structures, boiling down to a combination of basic types, along with constant offsets.

Argument 4 is different between the two calls.  In the sender, this is the destination rank.  In the receiver, this is the sending rank.  Argument 5 is the tag.  This is useful when you have multiple communication calls and want to differentiate them.  Unless you specify otherwise, a sender and receiver must have matching tags.  Argument 6 is the communicator to use.  When you are using custom communicators, this identifies which one to use.

The final argument to MPI_Recv is the status.  Here, I just used MPI_STATUS_IGNORE.  This argument allows you to get more information about the message transmission, especially useful for error handling.

Now, to your second point.  MPI_CART_CREATE and MPI_CART_SHIFT.  These will not help with linked lists.  These functions are intended to make a Cartesian grid from your ranks.  For example, if you have a 2-dimensional program, with a 2-dimensional data set, and you are splitting your data by location within the data set.  If you have 64 ranks, you could create a grid of 8x8 ranks.  MPI_CART_CREATE allows you to more easily reference neighbor ranks in both directions, rather than having to maintain your own map of which rank is where on your grid.  There are many uses for this, but this is irrelevant to using linked lists.

Dear J.,

I would have another question, and i do not if I could post it here or I have to create a new post.

The question is this one:

When I start a MPI code with:

   CALL MPI_INIT(MPI%iErr)
   CALL MPI_COMM_RANK(MPI_COMM_WORLD, MPI%myrank, MPI%iErr)
   CALL MPI_COMM_SIZE(MPI_COMM_WORLD, MPI%nCPU,   MPI%iErr)

and then I allocate a vector with for example

  ALLOCATE(VECTOR(NX,NY))

 

The question is this one:

how many space I have allocated? I mean, if I have for example four processor the total number of allocated vector are

4*VECTOR(NX,NY) and each vector have a different memory positions, am I right?

Thanks alot

 

Each rank is a completely separate program.  For example, if I run

mpirun -n 4 hostname

I will in reality run hostname 4 times.  Since this isn't an MPI program, there will be no communication between them.  They'll all be started separately, all run, and all finish.

When you call MPI_Init, that is when the ranks begin communicating with each other.  Here, the MPI communication space is initialized.  However, outside of MPI calls, the ranks are still independent of each other.  When you call allocate, each rank will allocate its own copy of vector.  There is no overlap between them, no way for one rank to directly interact with vector in a different rank.  So yes, with 4 ranks, each calling allocate, you will have 4 completely independent instances of vector.  They will not share the same memory space.

Side note.  There are MPI functions that allow direct access to another rank's memory space.  With MPI-3, you can even allocate a "shared" portion of memory.  I highly recommend you become more familiar with MPI before going here though.

Now, if you only wanted to allocate vector in one rank, that can be done fairly easily.  You've already called MPI_Comm_rank.  This gives you the rank number of the current rank within the communicator.  You can use this, along with logical statements, to control which rank does what.

call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)
if (rank .eq. 0) then
  allocate(vector(nx,ny))
end if

Here, only rank 0 would allocate vector.

Dear J.,

perfect, it what I need. Each rank has his own memory and data. 

Now I have another question. In my test program to learn MPI need that some processor communicate. I have this simple program where the nearby cells communicate their values:

send_messageL = Q(MPI%iStart,MPI%jStart:MPI%jEnd)
CALL MPI_ISEND(send_messageL, MsgLength, MPI%AUTO_REAL, LCPU, 1, COMM_CART, send_request(1), MPI%iErr)

In my case I have to communicate not only Q but other information such as particle positions ecc.

It is possible to create a my variable type such as:

type particle
      integer                 :: rx
      integer                 :: ry
      integer                 :: QQ
end type particle

and after that allocate a fix number of particle, to avoid the use of linked-list, and then share them?

Does this slow myprogram?

What do you suggest?

Thanks a lot

 

First question.  Is that possible?  Absolutely.  There are two ways to go about doing that.  The simplest is to send your data as 3*elems integers.

type (particle), dimension(:), allocatable :: parray
integer elems,req,ierr
elems=1000 ! Or whatever other method for setting it you prefer
allocate(parray(elems))
call MPI_ISEND(parray,3*elems,MPI_INTEGER,1,0,MPI_COMM_WORLD,req,ierr)

Each element of parray is 3 integers in size.  Sending 3*elems integers will send all of parray.

You can also create your own MPI datatype.  You can use this to select only portions of the parray type to send.  Then, send that datatype.

Second question, will it slow your performance?  It will actually improve your performance.  Every communication has some overhead.  It is almost always better to send all of your data at once, rather than a bit at a time.  If the dataset is large enough, it is possible that the MPI implementation or the fabric drivers could split the message.  And if the dataset is too large (2 GB), you need to do something else.  Either break it apart yourself or use derived datatypes to overcome internal limitations.  There could be some situations where sending data in pieces is better.  But if you just have an array, and it isn't huge, send it all at once.

Dear J.,

Thanks a lot. Another question. If I have understood correctly, in case of the following variable:

type particle
 integer                 :: rx
 integer                 :: ry
 real                    :: QQ(4)
end type particle

Each element of parray is 2 integers + 4*real size

and in this case I have to create my own MPI datatype. Am I correct?

​The same thing when I receive the message, Is it true?

Correct.  You'll want a datatype of 2 integers followed by 4 reals.  The simplest way to do that would be MPI_TYPE_CREATE_STRUCT

type(particle) dummy ! Used for calculation of displacement
integer lengths(2), types(2), ierr
integer(kind=MPI_ADDRESS_KIND) displacements
integer mpi_particle_type
types(1)=MPI_INTEGER
types(2)=MPI_REAL
lengths(1)=2
lengths(2)=4
displacements(1)=0
displacements(2)=sizeof(dummy%rx)+sizeof(dummy%ry)
call MPI_TYPE_CREATE_STRUCT(2,lengths,displacements,types,mpi_particle_type,ierr)
call MPI_TYPE_COMMIT(mpi_particle_type,ierr)

I tried to be very general in calculating the second element of displacements, so you can see what goes into it.  There are several ways to get the displacements, the method I showed is the most true to what is really needed.

Once you have defined the new type, you have to commit it before you can use it.  Once you're all done with it, call MPI_TYPE_FREE to release it and get the memory back.

To use it, simply use mpi_particle_type as the datatype when necessary.

Dear J.,

really thanks. Sorry but I can not understand. If this is my variable:

type particle
 integer                 :: rx
 integer                 :: ry
 real                    :: QQ(4)
end type particle

I have 2 integer and 4 real.

So why in call MPI_TYPE_CREATE_STRUCT Do you use MPI_TYPE_CREATE_STRUCT(2,....

and not  MPI_TYPE_CREATE_STRUCT(3,....

The same thing for displacements. I expect:

displacements(1)=sizeof(dummy%rx)+sizeof(dummy%ry)
displacements(2)=sizeof(dummy%QQ)

What Do I have not understood?

Thanks a lot again

Dear J.,

I think to have understood. The DISPLACEMENTS variable gives the position where star rx and QQ, so we have:

displacements(1)=0

because the integer part of the type variable start from  0. And we have:

displacements(2)=sizeof(dummy%rx)+sizeof(dummy%ry)

because the real part start from the integer part.

unfortunately, I have another question:

Why do you use the following?

integer mpi_particle_type

why do you not use  real mpi_particle_type, or are the same thing?

Thanks again for you indispensabile help

In Fortran, there are two ways to define MPI objects.  You can use integer, which creates a handle to the real datatype.  Or, as of MPI-3, the C bindings were made available to Fortran, so you could instead use TYPE(MPI_Datatype).  I use integer out of familiarity.

Dear J.,

Ok. It is clear. Really thanks for your explanation. 

Diego

Leave a Comment

Please sign in to add a comment. Not a member? Join today