application fails when more than 1 node is used

application fails when more than 1 node is used

Hi,

I'm trying to run an application with 64 proccess (4 nodes). with ofa I got this type of errors

[42] trying to free memory block that is currently involved to uncompleted data transfer operation
free mem - addr=0x4678680 len=4857680
RTC entry - addr=0x4678680 len=4857680 cnt=1
Assertion failed in file ../../i_rtc_cache.c at line 1338: 0
internal ABORT - process 42
[44] trying to free memory block that is currently involved to uncompleted data transfer operation
free mem - addr=0x4b059e0 len=4857680
RTC entry - addr=0x4b059e0 len=4857680 cnt=1
Assertion failed in file ../../i_rtc_cache.c at line 1338: 0
internal ABORT - process 44
[54] trying to free memory block that is currently involved to uncompleted data transfer operation
free mem - addr=0x44bda20 len=4857680
RTC entry - addr=0x44bda20 len=4857680 cnt=1

while with dapl I got:

mn85:7977:4b489740: 22385958 us(22385958 us!!!): reg_mr Cannot allocate memory
mn85:7971:883f1740: 22387532 us(22387532 us!!!): reg_mr Cannot allocate memory
mn85:7980:b41fa740: 22387325 us(22387325 us!!!): reg_mr Cannot allocate memory
mn85:797c:cc9d3740: 22386986 us(22386986 us!!!): mn85:797b:70f7740: 22387216 us(22387216 us!!!): reg_mr Cannot allocate memory
reg_mr Cannot allocate memory
mn85:7974:b074b740: 22387763 us(22387763 us!!!): reg_mr Cannot allocate memory
mn85:7979:4ca6c740: 22389086 us(22389086 us!!!): reg_mr Cannot allocate memory
mn85:797e:c15a2740: 22388867 us(22388867 us!!!): reg_mr Cannot allocate memory
mn85:7971:883f1740: 22390126 us(2594 us): reg_mr Cannot allocate memory
mn85:7976:5bee3740: 22389524 us(22389524 us!!!): reg_mr Cannot allocate memory
mn85:7971:883f1740: 22391260 us(1134 us): reg_mr Cannot allocate memory
mn85:7971:883f1740: 22391539 us(279 us): reg_mr Cannot allocate memory
mn85:7971:883f1740: 22391908 us(369 us): reg_mr Cannot allocate memory
mn85:7971:883f1740: 22392231 us(323 us): reg_mr Cannot allocate memory
mn82:7b6a:8d5c2740: 22402315 us(22402315 us!!!): reg_mr Cannot allocate memory
mn82:7b6a:8d5c2740: 22402582 us(267 us): reg_mr Cannot allocate memory

anny suggestions?

thank you in advance,

7 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

Hi José Luis,

Have a look at this post: http://software.intel.com/en-us/forums/topic/329053

I had a similar problem some time ago, and I manage to solve it.

Let me know if it also works for you.

Iván Santos Tejido Dpto. Electricidad y Electrónica Universidad de Valladolid, Spain

Hi José Luis,

Have a look at this post: http://software.intel.com/en-us/forums/topic/329053

I had a similar problem some time ago, and I manage to solve it.

Let me know if it also works for you.

Iván Santos Tejido Dpto. Electricidad y Electrónica Universidad de Valladolid, Spain

By the way, here are some other modifications that I had to do:

I found that our Infiniband switch has a limit for the maximum amount of registerable memory. In our case we have a Mellanox switch, and people from Mellanox recomnend to set the value of:

(2^log_num_mtt)*(2^log_mtts_per_seg)*PAGE_SIZE

at least the double of the physical available memory at the nodes (link1, link2). You can check the values of these parameters with:

getconf PAGE_SIZE 
cat /sys/module/mlx4_core/parameters/log_num_mtt 
cat /sys/module/mlx4_core/parameters/log_mtts_per_seg

Mellanox people only recommend to change log_num_mtt. To do it you have to edit the file /etc/modprobe.conf and adding at the end of the file the line: options mlx4_core log_num_mtt=24. Then you have to restart the Infiniband network by doing the following in all the nodes of the cluster:

Stop opensm service: /etc/init.d/opensmd stop 
Restart IB: /etc/init.d/openibd restart 
Start opensm: /etc/init.d/opensmd start 
Check the changes: cat /sys/module/mlx4_core/parameters/log_num_mtt

Iván Santos Tejido Dpto. Electricidad y Electrónica Universidad de Valladolid, Spain

Hi José Luis,

There must have been some problems with a previous post I made. In that post I just wrote this link:

http://software.intel.com/en-us/forums/topic/329053

In case it could be helpful.

Regards

Iván Santos Tejido Dpto. Electricidad y Electrónica Universidad de Valladolid, Spain

Dear Ivan,

thank you for your help. 

I changed log_num_mtt to 24 and the app is running fine now.

saludos,

José Luis

Dear José Luis,

I am glad to know that it also worked for you.

Saludos,

Iván

Iván Santos Tejido Dpto. Electricidad y Electrónica Universidad de Valladolid, Spain

Accedere per lasciare un commento.