I have some MPI code, where small messages (LEN = 1-128 bytes) from one [host] node are sent to several peers. When I send messages 1-per-peer, like this:
for (i = 0; i < ITER_NUM; ++i) { for (k = 1; k < NODES; ++k) { MPI_Isend(S_BUF, LEN, MPI_CHAR, k, 0, MPI_COMM_WORLD, &reqs[nreqs++]); } if (nreqs / WINDOW > 0 || i == ITER_NUM - 1) { MPI_Waitall(nreqs, reqs, MPI_STATUSES_IGNORE); nreqs = 0; } }
message rate falls down from 11.5 million messages/sec on 5-nodes config (1 host and 4 peers) to 6.5 million/sec on 17-nodes setup (1 host and 16 peers). When I try to change cycle order like this:
for (k = 1; k < NODES; ++k) { for (i = 0; i < ITER_NUM; ++i) { MPI_Isend(S_BUF, LEN, MPI_CHAR, k, 0, MPI_COMM_WORLD, &reqs[nreqs++]); } if (nreqs / WINDOW > 0 || i == ITER_NUM - 1) { MPI_Waitall(nreqs, reqs, MPI_STATUSES_IGNORE); nreqs = 0; } } it works well (stable scaling, 11.5 million/sec). ITER_NUMis about 100 000, and WINDOW there are MPI_Barrier() and time measurement. Can someone help me, what are the reasons of message rate degrading? Please, do not recommend message coalescing. And what should I try to improve scaling? Eager protocol is used (in MPI), and rendezvous usage did not help.
Second question - I tried some other test. All nodes form node pairs((0, 1), (2, 3), ... (n - 2, n - 1)) and simple send-recv are used. Whennumber of node pairs grows large (256 and higher), both messagerate and bandwidth per-pair degrade significantly. At the same time,one can expect fat tree to scale nicely in this situation. Any ideas? System config:
2 x Intel Xeon X5570 InfiniBand QDR (fat tree) Intel MPI 4.0.1 Intel C++ compiler 12.0
I'm trying to get some additional information on why this behavior is occurring. I believe you are seeing two effects. The change at 17 ranks is likely due to running on multiple nodes, whereas 16 should run on a single node. This will require a change from shared memory to Infiniband.
The second effect ispossibly due to the network layer. Opening and closing a network connection takes time, and these connections may not stay open between communications. By sending one message to a process at a time, you are frequently opening and closing connections. Sending all messages to one process allows the connection to remain open.
I still need to look into the second issue with the node pairs.
Sincerely, James Tullos Technical Consulting Engineer Intel Cluster Tools
Sorry, I was not explicit about MPI processes mapping - each process is on different node (I am sure). And there are total 16 processes, first process is communicating with 15 peers.
In IntelMPI reference they say, that I_MPI_DYNAMIC_CONNECTION is set to "off" state by default when using less then 64 MPI procs. So, i think thats not the case here, but I believe there is something in this idea, about connection management. Currently I have no access to the cluster, when I'll try some - I will post the results here.
MPI message rate scaling with number of peers
Hi.
I have some MPI code, where small messages (LEN = 1-128 bytes)
from one [host] node are sent to several peers. When I send messages
1-per-peer, like this:
for (i = 0; i < ITER_NUM; ++i)
{
for (k = 1; k < NODES; ++k)
{
MPI_Isend(S_BUF, LEN, MPI_CHAR,
k, 0, MPI_COMM_WORLD, &reqs[nreqs++]);
}
if (nreqs / WINDOW > 0 || i == ITER_NUM - 1)
{
MPI_Waitall(nreqs, reqs, MPI_STATUSES_IGNORE);
nreqs = 0;
}
}
message rate falls down from 11.5 million messages/sec on 5-nodes config (1 host and 4 peers)
to 6.5 million/sec on 17-nodes setup (1 host and 16 peers). When I try to change cycle order like this:
for (k = 1; k < NODES; ++k)
{
for (i = 0; i < ITER_NUM; ++i)
{
MPI_Isend(S_BUF, LEN, MPI_CHAR,
k, 0, MPI_COMM_WORLD, &reqs[nreqs++]);
}
if (nreqs / WINDOW > 0 || i == ITER_NUM - 1)
{
MPI_Waitall(nreqs, reqs, MPI_STATUSES_IGNORE);
nreqs = 0;
}
}
it works well (stable scaling, 11.5 million/sec).
ITER_NUMis about 100 000, and WINDOW there are MPI_Barrier() and time measurement.
Can someone help me, what are the reasons of message rate degrading?
Please, do not recommend message coalescing. And what should I try to
improve scaling? Eager protocol is used (in MPI), and rendezvous usage
did not help.
Second question - I tried some other test. All nodes form node pairs((0, 1), (2, 3), ... (n - 2, n - 1)) and simple send-recv are used. Whennumber of node pairs grows large (256 and higher), both messagerate and bandwidth per-pair degrade significantly. At the same time,one can expect fat tree to scale nicely in this situation. Any ideas?
System config:
2 x Intel Xeon X5570
InfiniBand QDR (fat tree)
Intel MPI 4.0.1
Intel C++ compiler 12.0