MPI message rate scaling with number of peers

MPI message rate scaling with number of peers

Hi.

I have some MPI code, where small messages (LEN = 1-128 bytes)
from one [host] node are sent to several peers. When I send messages
1-per-peer, like this:

for (i = 0; i < ITER_NUM; ++i)
{
for (k = 1; k < NODES; ++k)
{
MPI_Isend(S_BUF, LEN, MPI_CHAR,
k, 0, MPI_COMM_WORLD, &reqs[nreqs++]);
}
if (nreqs / WINDOW > 0 || i == ITER_NUM - 1)
{
MPI_Waitall(nreqs, reqs, MPI_STATUSES_IGNORE);
nreqs = 0;
}
}

message rate falls down from 11.5 million messages/sec on 5-nodes config (1 host and 4 peers)
to 6.5 million/sec on 17-nodes setup (1 host and 16 peers). When I try to change cycle order like this:

for (k = 1; k < NODES; ++k)
{
for (i = 0; i < ITER_NUM; ++i)
{
MPI_Isend(S_BUF, LEN, MPI_CHAR,
k, 0, MPI_COMM_WORLD, &reqs[nreqs++]);
}
if (nreqs / WINDOW > 0 || i == ITER_NUM - 1)
{
MPI_Waitall(nreqs, reqs, MPI_STATUSES_IGNORE);
nreqs = 0;
}
}
it works well (stable scaling, 11.5 million/sec).
ITER_NUMis about 100 000, and WINDOW there are MPI_Barrier() and time measurement.
Can someone help me, what are the reasons of message rate degrading?
Please, do not recommend message coalescing. And what should I try to
improve scaling? Eager protocol is used (in MPI), and rendezvous usage
did not help.

Second question - I tried some other test. All nodes form node pairs((0, 1), (2, 3), ... (n - 2, n - 1)) and simple send-recv are used. Whennumber of node pairs grows large (256 and higher), both messagerate and bandwidth per-pair degrade significantly. At the same time,one can expect fat tree to scale nicely in this situation. Any ideas?
System config:

2 x Intel Xeon X5570
InfiniBand QDR (fat tree)
Intel MPI 4.0.1
Intel C++ compiler 12.0

publicaciones de 5 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

Hi ingen,

What type of receive are you using? Are you using the MPD process manager, or Hydra?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

Hi, James.

I am using MPI_Irecv() (but it worked the same way with MPI_Recv(), too).
Hydra (mpiexec.hydra) is used.

Hi ingen,

I'm trying to get some additional information on why this behavior is occurring. I believe you are seeing two effects. The change at 17 ranks is likely due to running on multiple nodes, whereas 16 should run on a single node. This will require a change from shared memory to Infiniband.

The second effect ispossibly due to the network layer. Opening and closing a network connection takes time, and these connections may not stay open between communications. By sending one message to a process at a time, you are frequently opening and closing connections. Sending all messages to one process allows the connection to remain open.

I still need to look into the second issue with the node pairs.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

Thanks for your reply, James.

Sorry, I was not explicit about MPI processes mapping - each process is on different node (I am sure). And there are total 16 processes, first process is communicating with 15 peers.

In IntelMPI reference they say, that I_MPI_DYNAMIC_CONNECTION is set to "off" state by default when using less then 64 MPI procs. So, i think thats not the case here, but I believe there is something in this idea, about connection management. Currently I have no access to the cluster, when I'll try some - I will post the results here.

Inicie sesión para dejar un comentario.