Intel MPI with IMB-MPI1 on all the nodes produces reg_mr Cannot allocate memory

Intel MPI with IMB-MPI1 on all the nodes produces reg_mr Cannot allocate memory

Hi,

we have a little cluster with 8 nodes (each one 12 cores). We have 2 blades. In one blade there are 4 nodes. All these nodes are connected with infiniband.

Intel MPI ist installed and configured with shm:ofa.

I'm starting the following test on all the cores of the cluster:
mpirun -np 96 IMB-MPI1

It generates "normal" results for all the sub-tests. But there is a problem with:
#----------------------------------------------------------------
# Benchmarking Alltoall
# #processes = 96
#----------------------------------------------------------------

it gives:
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.11 0.15 0.12
1 1000 42.13 42.15 42.14
2 1000 43.61 43.62 43.62
4 1000 52.55 52.57 52.56
8 1000 62.75 62.78 62.77
16 1000 68.49 68.52 68.50
32 1000 80.11 80.13 80.12
64 1000 111.07 111.10 111.09
128 1000 181.19 181.25 181.23
256 1000 368.36 368.52 368.44
512 1000 328.78 328.83 328.80
1024 1000 602.03 603.65 602.17
2048 1000 5873.23 5873.65 5873.45
4096 1000 6000.28 6000.59 6000.43
8192 1000 6965.62 6965.84 6965.75
16384 943 10429.38 10429.66 10429.52
32768 400 25244.62 25245.83 25245.13
65536 223 44969.48 44972.04 44970.70
131072 118 84991.07 84997.68 84994.67
262144 60 167439.02 167466.40 167451.96
524288 31 330707.68 330769.06 330739.70
1048576 16 658785.06 659147.81 658966.23
2097152 8 1314571.62 1315755.52 1315313.50
n08:3914: reg_mr Cannot allocate memory
n08:3914: reg_mr Cannot allocate memory
n08:3915: reg_mr Cannot allocate memory
...

I'm seeing these "reg_mr Cannot allocate memory" for all the nodes...

What is exactly this problem and how can I solve it ?

Thx a lot!
Best regards

publicaciones de 5 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

Hi Guillaume,

You are probably using Mellanox HCAs. This message usually means that there is not enough memory for buffers. It depends on how much memory you have on a node. Alltoall requires a lot of memory for internal buffers and you just need to limit max size of the messages for IMB.

You can also try the following trick: add the following line to the /etc/modprobe.conf:
options mlx4_core log_mtts_per_seg=5

It should reduce memory consumed by communication functions.

Regards!
Dmitry

Normal
0

false
false
false

RU
X-NONE
X-NONE

add the following line to /etc/modprobe.conf:

options mlx4_core log_mtts_per_seg=5

Hi!

Thx for your useful answer. I will try your ideas! But where can I find how limit the size of the message for IMB. I had the idea, but I couldn't find how...I'm too stupid to google correctly...

Best regards!
Guillaume

You need to provide a file with the explicit list of message lengths to include. I think the default behavior is to include all of them if no file is provided.

$ ./IMB-MPI1 -h...- msglenthe argument after -msglen is a lengths_file, an ASCII file, containing any set of nonnegativemessage lengths, 1 per line...

For instance, Intel Cluster Checker use the following list of msglen values to get a quick but still representative sample of results.

$ cat IMB_msglen01244194304

Note that you usually get best latency with a zero payload, and the best bandwidth with a really big payload.As usual a would recommend some experimentation to optimize those values.

Hi!

Great! thx a lot!

Deje un comentario

Por favor inicie sesión para agregar un comentario. ¿No es socio? Únase ya