Hi.
I have some MPI code, where small messages (LEN = 1-128 bytes)
from one [host] node are sent to several peers. When I send messages
1-per-peer, like this:
for (i = 0; i < ITER_NUM; ++i)
{
for (k = 1; k < NODES; ++k)
{
MPI_Isend(S_BUF, LEN, MPI_CHAR,
k, 0, MPI_COMM_WORLD, &reqs[nreqs++]);
}
if (nreqs / WINDOW > 0 || i == ITER_NUM - 1)
{
MPI_Waitall(nreqs, reqs, MPI_STATUSES_IGNORE);
nreqs = 0;
}
}
message rate falls down from 11.5 million messages/sec on 5-nodes config (1 host and 4 peers)
to 6.5 million/sec on 17-nodes setup (1 host and 16 peers). When I try to change cycle order like this:
for (k = 1; k < NODES; ++k)
{
for (i = 0; i < ITER_NUM; ++i)
{
MPI_Isend(S_BUF, LEN, MPI_CHAR,
k, 0, MPI_COMM_WORLD, &reqs[nreqs++]);
}
if (nreqs / WINDOW > 0 || i == ITER_NUM - 1)
{
MPI_Waitall(nreqs, reqs, MPI_STATUSES_IGNORE);
nreqs = 0;
}
}
it works well (stable scaling, 11.5 million/sec).
ITER_NUMis about 100 000, and WINDOW there are MPI_Barrier() and time measurement.
Can someone help me, what are the reasons of message rate degrading?
Please, do not recommend message coalescing. And what should I try to
improve scaling? Eager protocol is used (in MPI), and rendezvous usage
did not help.
Second question - I tried some other test. All nodes form node pairs((0, 1), (2, 3), ... (n - 2, n - 1)) and simple send-recv are used. Whennumber of node pairs grows large (256 and higher), both messagerate and bandwidth per-pair degrade significantly. At the same time,one can expect fat tree to scale nicely in this situation. Any ideas?
System config:
2 x Intel Xeon X5570
InfiniBand QDR (fat tree)
Intel MPI 4.0.1
Intel C++ compiler 12.0



