IMB-MPI1.4.0.3 fails with Signal 1 hangup errors

IMB-MPI1.4.0.3 fails with Signal 1 hangup errors

We have several new IBM iDataplexes. Some of our codes compiled with Intel 12.1 with INTEL-MPI-4.0.3 would sometimes fail with this error:

"APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)"

I can consistenly replicate this error with the Intel IMB-MPI1.4.0.3 benchmark system on two nodes (32 cores).

The error above happens in the Allgatherv benchmark using 32 processes after te 8192 byte size messages (see below).

*BUT*, if I were to JUST RUN an Allgatherv benchmark, it works with no problems. It appears a previous MPI funciton call is setting the system in some state to cause Allfatherv to fail.

#----------------------------------------------------------------
# Benchmarking Allgatherv
# #processes = 32
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.18 0.19 0.18
1 1000 55.13 55.16 55.14
2 1000 55.58 55.60 55.59
4 1000 55.42 55.44 55.43
8 1000 54.99 55.02 55.01
16 1000 56.93 56.95 56.94
32 1000 60.37 60.37 60.37
64 1000 60.45 60.45 60.45
128 1000 59.13 59.14 59.13
256 1000 152.55 152.59 152.57
512 1000 152.85 152.90 152.88
1024 1000 92.38 92.39 92.39
2048 1000 198.94 199.08 198.98
4096 1000 244.89 245.09 244.97
8192 1000 323.58 323.74 323.70
APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)

3 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

Hi jianni,

Thanks for posting. Unfortunately, this is not enough information to determine the cause of the failure. It's a pretty generic error. Can you set I_MPI_DEBUG=5 and send us the output? That might provide more information on what the Intel MPI Library is doing before it hits the error.

Based on your description below, it seems like this happens when running 32 MPI ranks. Does it happen for any job over 32 ranks or specifically 32? How about below 32 ranks?

It'll be great to know your command line, as well as if you're setting any Intel MPI-specific environment variables. And, if you're running over an InfiniBand network or just tcp (the I_MPI_DEBUG output will give us some of this data). If you are running over IB, it'll be intersting to see if using regular Ethernet imporves the situations (you can do that by setting I_MPI_FABRICS=shm:tcp).

Looking forward to hearing back soon.

Regards,
~Gergana

Gergana Slavova
Technical Consulting Engineer
Intel® Cluster Tools
E-mail: gergana.s.slavova_at_intel.com

Our system is an IBM IDataplex. The login and compute nodes are populated with dual Intel Sandy Bridge 8-core processors. Pershing uses the FDR 10 Infiniband interconnect (Mellanox) in a Fat Tree configuration. Pershing uses IBM’s General Parallel File System (GPFS) to manage its parallel file system that targets IBM's IS4600 (Infinite Storage) RAID arrays. Each compute node has two 8 core processors (16 cores) with its own 64-bit Red Hat Enterprise Linux OS 6.2 (Santiago), sharing 32 GBytes of memory.
The Intel MPI-4.0.3 benchmark runs to completion with TCP when I set "setenv I_MPI_FABRICS shm:tcp". However, the IB mode still failes. I've uploaded the output with MPI_DEBUG set to 5. I would like to emphasize that codes compiled with openmpi and mpich work fine. Thanks.

附件: 

附件尺寸
下载 imb-mpi1.4.0.3.txt75.3 KB

发表评论

登录添加评论。还不是成员?立即加入