Memory allocation error in running IMB3.2

Memory allocation error in running IMB3.2

Hello, I am now trying to finish Intel MPI Benchmarks (IMB) 3.2 on the following Linux cluster environment. * Intel MPI version 4.0.3.008 * Intel mpiicc version 12.1.0 (gcc version 4.4.5 compatibility) * Red Hat Enterprise Linux Server release 6.1 (Santiago) * Compute node : Kernel 2.6.32-131.0.15.el6.x86_64 * Compute node : 64GB * InfiniBand : Mellanox FDR * Intel MPI Benchmark 3.2 * Number of parallel : 64 In executing 'Gather' test on IMB3.2, following error message related dapl is displayed and program is terminated abnormally. Looking at the stdout, it says that execution was terminated due to the lack of memory. <<<<< stderr message >>>>> [13:compute001] rtc_register failed 196608 [13] error(0x30000): unknown error Assertion failed in file ../../dapl_send_rc.c at line 2515: 0 internal ABORT - process 13 <<<<< stdout message in gather test>>>>> #---------------------------------------------------------------- # Benchmarking Gather # #processes = 64 #---------------------------------------------------------------- #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] 0 1000 0.08 0.08 0.08 4 1000 7.34 7.65 7.52 8 1000 7.71 8.04 7.90 16 1000 7.72 8.04 7.90 32 1000 7.84 8.17 8.02 64 1000 8.33 8.71 8.54 128 1000 9.05 9.43 9.26 256 1000 10.69 11.13 10.94 512 1000 13.82 14.41 14.15 1024 1000 31.95 33.34 32.74 2048 1000 41.31 43.19 42.43 4096 1000 50.46 52.85 51.86 8192 1000 69.83 73.02 71.56 16384 1000 106.32 110.92 108.86 32768 1000 318.69 321.51 320.61 65536 640 680.75 691.92 689.43 131072 320 1053.18 1162.70 1148.60 262144 160 1647.16 2049.80 1987.91 524288 80 4247.39 5140.53 5017.29 compute001:525:abea5b20: 88868811 us(88868811 us!!!): reg_mr Cannot allocate memory compute001:525:abea5b20: 88868977 us(166 us): reg_mr Cannot allocate memory compute001:525:abea5b20: 88869080 us(103 us): reg_mr Cannot allocate memory compute001:525:abea5b20: 88869185 us(105 us): reg_mr Cannot allocate memory : compute001:526:abea5b20: 100784924 us(120 us): reg_mr Cannot allocate memory compute001:526:abea5b20: 100785037 us(113 us): reg_mr Cannot allocate memory compute001:526:abea5b20: 100785149 us(112 us): reg_mr Cannot allocate memory APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1) I am using following command to run IMB using LSF. --------------------------- IMBFLAGS="-npmin 64 -mem 1 -msglen Lengths Sendrecv Exchange Bcast Allgather \ Allgatherv Gather Gatherv Scatter Scatterv Alltoall Alltoallv Reduce \ Reduce_scatter Allreduce Barrier" HOSTS="compute001 compute002 compute003 compute004" $ bsub -m "$HOSTS" -n 64 -q normalq1 -o stdout -e stderr -J IMB -W 1:00 -r \ mpirun -n 64 ./IMB_MPI1 $IMBFLAGS --------------------------- In "-msglen" option, message size is specified in "Lengths" file in text format as below. --------------------------- 0 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 --------------------------- Does anybody know how to deal with this situation? Because it seems be caused due to the lack of memory, I tried to change -mem 1 option to -mem 2 and -mem 3 but situation didn't change. Is it one possible solution to change the line #define MAX_MEM_USAGE 1 in the header file 'IMB_mem_info.h' ? If somebody has any hints regarding this issue, please kindly let me know. I really appreciate it. Best regards

Road
7 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

Hi Road,

Can you run with I_MPI_DEBUG=5 and attach the output as a text file? While this should be working on Version 4.0 Update 3, have you tried using Version 4.1?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Hi James,

Thank you for the reply and useful information.
Yes, I am running the job with I_MPI_DEBUG=5 option.
Please refer to the output including the error above.
Does it include any important messages?

Regarding MPI version, I will run with Version4 update3 and check if it works.

Regards

附件: 

附件尺寸
下载 run.out.64.txt1.7 MB
Road

Hi Road,

Please try setting

ulimit -l unlimited

And see if that helps.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Hi James,

On the compute nodes, max. locked memory is already unlimited.

max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited

Are there any possibilities causing the error?

Regards
Road

Road

Hi Road,

Does the error still occur with less processes per node?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Hi James,

As to the executions of 16 cores (1 node) and 32 cores (2 nodes),
they complete successfully without errors.

Regards
Road

Road

发表评论

登录添加评论。还不是成员?立即加入