MPI Error - Assertion failed

MPI Error - Assertion failed

Leonardo Oliveira的头像

Hello everyone,
My name is Leonardo from Brazil and this is my first post.
I'm running an MPI program with Intel implementation (Intel MPI Library for Linux, 64-bit applications, Version 4.0 Update 1 Build 20100910).
The "program" is well-know algorithm called k-means. The algorithm is used to identify natural clusters within sets of data points; its input is a set of data points and an integer k, and its output is an assignment of each point to one of k clusters.
When I run it with 160 data-points on 3 nodes, everything goes fine. With 1.6K was ok too, but when I run it with 160K data-points the fowlling error appears:

Assertion failed in file ../../dapl_module_poll.c at line 3608: *p_vc_unsignal_sr_before_read == 0
internal ABORT - process 1
[2:super3] unexpected disconnect completion event from [1:super3]
Assertion failed in file ../../dapl_module_util.c at line 2682: 0
internal ABORT - process 2
[0:super3] unexpected disconnect completion event from [1:super3]
Assertion failed in file ../../dapl_module_util.c at line 2682: 0
internal ABORT - process 0
srun: error: super3: tasks 0-2: Exited with exit code 1
srun: Terminating job step 75987.0

I have no idea what can be...
Has anyone experienced this before? Or any idea what is going on? ...

Thanks and sorry about my english.

Obrigado,
Leonardo Fernandes

9 帖子 / 0 new
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项
James Tullos (Intel)的头像

Hi Leonardo,

Additional information will be needed to determine the exact cause of this error message. What program are you using? What command line are you using to run the program? How are you compiling the program?What is your operating system version? What is your hardware configuration(processor, memory, node interconnect method, etc.)?

My immediate guess based on the behavior you are seeing is that you are overusing a system resource somewhere, since you are not experiencing this until you are at a larger number of data points.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

Leonardo Oliveira的头像

Hi James, thank you for reply.
I'll get all this information about our cluster. But let me update...
I changed the interface from Infiniband to Ethernet (I_MPI_FABRICS=tcp), and after that, I could increase the number of data-points (and nodes)... but another problem appeared.

Fatal error in MPI_Probe: Other MPI error, error stack:
MPI_Probe(113)................
.: MPI_Probe(src=MPI_ANY_SOURCE, tag=0, MPI_COMM_WORLD, status=0x2b99c1574108) failed
MPIDI_CH3I_Progress(401)......
.:
MPID_nem_tcp_poll(2332).......
.:
MPID_nem_tcp_connpoll(2582)...
.:
state_commrdy_handler(2208)...
.:
MPID_nem_tcp_recv_handler(
2081): socket closed
slurmd[super10]: * STEP 76081.0 CANCELLED AT 2012-01-31T11:19:04 ***
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: super11: tasks 3-4: Terminated
srun: Terminating job step 76081.0
srun: error: super10: tasks 0-2: Exited with exit code 1

....

James Tullos (Intel)的头像

Hi Leonardo,

Have you been able to get the cluster information? Is there a way I can get a copy of this program to run and attempt to reproduce the behavior you are seeing?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

Leonardo Oliveira的头像

Hi James,
Sorry for the late (as this is a master thesis project..last month I needed to write instead of haking).
Well, in my project I'm using a Haskell framwork for distributed enviroments, that was implemented with TCP sockets (very similar to Erlang), and changed the transport layer to using MPI (so well summarized).
I can send you a code, but you'll need some work to run Haskell code.
-------
About our cluster:
There are 72 Bull NovaScale nodes running GNU/Linux 2.6.18-128. Each one with 2 processors Intel Xeon quad-core.
The MPI implementation is: Intel MPI Library for Linux, 64-bit applications, Version 4.0 Update 1 Build 20100910
Copyright (C) 2003-2010 Intel Corporation. All rights reserved
-------
Returning to the problem...
I set I_MPI_DEBUG=100 and the output was:

[0] MPI startup(): Intel MPI Library, Version 4.0 Update 1 Build 20100910
[0] MPI startup(): Copyright (C) 2003-2010 Intel Corporation. All rights reserved.
[1] MPI startup(): tcp data transfer mode
[9] MPI startup(): tcp data transfer mode
[2] MPI startup(): tcp data transfer mode
[10] MPI startup(): tcp data transfer mode
[3] MPI startup(): tcp data transfer mode
[11] MPI startup(): tcp data transfer mode
[4] MPI startup(): tcp data transfer mode
[12] MPI startup(): tcp data transfer mode
[5] MPI startup(): tcp data transfer mode
[6] MPI startup(): tcp data transfer mode
[13] MPI startup(): tcp data transfer mode
[7] MPI startup(): tcp data transfer mode
[14] MPI startup(): tcp data transfer mode
[0] MPI startup(): tcp data transfer mode
[8] MPI startup(): tcp data transfer mode
[15] MPI startup(): tcp data transfer mode
[1] MPI startup(): Recognition level=1. Platform code=1. Device=4
[1] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[8] MPI startup(): Recognition level=1. Platform code=1. Device=4
[8] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[2] MPI startup(): Recognition level=1. Platform code=1. Device=4
[2] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[10] MPI startup(): Recognition level=1. Platform code=1. Device=4
[10] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[3] MPI startup(): Recognition level=1. Platform code=1. Device=4
[3] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[9] MPI startup(): Recognition level=1. Platform code=1. Device=4
[9] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[4] MPI startup(): Recognition level=1. Platform code=1. Device=4
[4] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[11] MPI startup(): Recognition level=1. Platform code=1. Device=4
[11] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[5] MPI startup(): Recognition level=1. Platform code=1. Device=4
[5] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[12] MPI startup(): Recognition level=1. Platform code=1. Device=4
[12] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[7] MPI startup(): Recognition level=1. Platform code=1. Device=4
[7] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[13] MPI startup(): Recognition level=1. Platform code=1. Device=4
[13] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[6] MPI startup(): Recognition level=1. Platform code=1. Device=4
[15] MPI startup(): Recognition level=1. Platform code=1. Device=4
[15] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[6] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[14] MPI startup(): Recognition level=1. Platform code=1. Device=4
[0] MPI startup(): Recognition level=1. Platform code=1. Device=4
[0] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)

Device_reset_idx=0
[0] MPI startup(): Allgather: 1: 0-128 & 16-511
[0] MPI startup(): Allgather: 1: 0-16 & 0-2147483647
[0] MPI startup(): Allgather: 4: 17-512 & 0-2147483647
[0] MPI startup(): Allgather: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Allgatherv: 1: 0-1024 & 0-2147483647
[0] MPI startup(): Allgatherv: 1: 1024-2048 & 32-511
[0] MPI startup(): Allgatherv: 1: 2048-4096 & 32-63
[0] MPI startup(): Allgatherv: 1: 2048-4096 & 256-511
[0] MPI startup(): Allgatherv: 2: 1024-16384 & 512-2147483647
[0] MPI startup(): Allgatherv: 2: 2048-4096 & 64-255
[0] MPI startup(): Allgatherv: 4: 4096-65536 & 256-511
[0] MPI startup(): Allgatherv: 4: 16384-262144 & 512-2147483647
[0] MPI startup(): Allgatherv: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Allreduce: 1: 0-255 & 0-2147483647
[0] MPI startup(): Allreduce: 1: 256-511 & 0-63
[0] MPI startup(): Allreduce: 1: 256-511 & 256-511
[0] MPI startup(): Allreduce: 2: 512-1048575 & 16-511
[0] MPI startup(): Allreduce: 2: 256-2097151 & 64-255
[0] MPI startup(): Allreduce: 2: 1024-2147483647 & 256-2147483647
[0] MPI startup(): Allreduce: 5: 256-1023 & 512-2147483647
[0] MPI startup(): Allreduce: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoall: 1: 0-16 & 9-2147483647
[0] MPI startup(): Alltoall: 1: 17-256 & 17-2147483647
[0] MPI startup(): Alltoall: 1: 129-512 & 9-64
[0] MPI startup(): Alltoall: 2: 17-128 & 9-16
[0] MPI startup(): Alltoall: 2: 513-1024 & 0-16
[0] MPI startup(): Alltoall: 2: 1025-524288 & 0-8
[0] MPI startup(): Alltoall: 2: 2049-2147483647 & 9-16
[0] MPI startup(): Alltoall: 3: 4097-2147483647 & 33-2147483647
[0] MPI startup(): Alltoall: 3: 4097-16384 & 17-32
[0] MPI startup(): Alltoall: 3: 32769-1048576 & 17-32
[14] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[0] MPI startup(): Alltoall: 3: 2097153-2147483647 & 17-2147483647
[0] MPI startup(): Alltoall: 4: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoallv: 2: 0-2147483647 & 32-2147483647
[0] MPI startup(): Alltoallw: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Barrier: 4: 0-2147483647 & 0-2147483647
[0] MPI startup(): Bcast: 7: 0-2147483647 & 0-2147483647
[0] MPI startup(): Exscan: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Gather: 2: 0-255 & 32-2147483647
[0] MPI startup(): Gather: 2: 256-2048 & 32-127
[0] MPI startup(): Gather: 2: 256-1024 & 128-255
[0] MPI startup(): Gather: 2: 256-511 & 256-511
[0] MPI startup(): Gather: 2: 131072-262143 & 512-2147483647
[0] MPI startup(): Gather: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Gatherv: 1: 0-2147483647 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 1: 1024-65536 & 256-2147483647
[0] MPI startup(): Reduce_scatter: 1: 512-32768 & 16-511
[0] MPI startup(): Reduce_scatter: 1: 5-512 & 16-63
[0] MPI startup(): Reduce_scatter: 1: 128-512 & 64-127
[0] MPI startup(): Reduce_scatter: 1: 256-512 & 128-256
[0] MPI startup(): Reduce_scatter: 2: 32768-131072 & 16-31
[0] MPI startup(): Reduce_scatter: 2: 524288-2147483647 & 16-31
[0] MPI startup(): Reduce_scatter: 2: 1048576-2147483647 & 32-63
[0] MPI startup(): Reduce_scatter: 4: 0-4 & 16-511
[0] MPI startup(): Reduce_scatter: 4: 131072-1048576 & 16-63
[0] MPI startup(): Reduce_scatter: 4: 262144-2147483647 & 64-127
[0] MPI startup(): Reduce_scatter: 4: 524288-2097152 & 128-255
[0] MPI startup(): Reduce_scatter: 4: 1048576-2147483647 & 256-2147483647
[0] MPI startup(): Reduce_scatter: 5: 0-2147483647 & 0-2147483647
[0] MPI startup(): Reduce: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scan: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scatter: 1: 0-63 & 32-2147483647
[0] MPI startup(): Scatter: 1: 64-127 & 32-511
[0] MPI startup(): Scatter: 1: 128-255 & 32-255
[0] MPI startup(): Scatter: 1: 256-511 & 32-127
[0] MPI startup(): Scatter: 1: 512-1023 & 32-63
[0] MPI startup(): Scatter: 2: 128-255 & 256-511
[0] MPI startup(): Scatter: 2: 256-511 & 128-255
[0] MPI startup(): Scatter: 2: 512-2047 & 64-127
[0] MPI startup(): Scatter: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scatterv: 1: 0-2147483647 & 0-2147483647
[0] Rank    Pid      Node name  Pin cpu
[0] 0       21266    super2     n/a
[0] 1       21267    super2     n/a
[0] 2       21268    super2     n/a
[0] 3       21269    super2     n/a
[0] 4       21270    super2     n/a
[0] 5       21271    super2     n/a
[0] 6       21272    super2     n/a
[0] 7       21273    super2     n/a
[0] 8       12678    super3     n/a
[0] 9       12679    super3     n/a
[0] 10      12680    super3     n/a
[0] 11      12681    super3     n/a
[0] 12      12682    super3     n/a
[0] 13      12683    super3     n/a
[0] 14      12684    super3     n/a
[0] 15      12685    super3     n/a
[0] MPI startup(): I_MPI_DEBUG=100
[0] MPI startup(): I_MPI_FABRICS=tcp
Fatal error in MPI_Probe: Other MPI error, error stack:
MPI_Probe(113).................: MPI_Probe(src=MPI_ANY_SOURCE, tag=0, MPI_COMM_WORLD, status=0x2b5b75904468) failed
MPIDI_CH3I_Progress(401).......: 
MPID_nem_tcp_poll(2332)........: 
MPID_nem_tcp_connpoll(2582)....: 
state_commrdy_handler(2208)....: 
MPID_nem_tcp_recv_handler(2081): socket closed
Fatal error in MPI_Probe: Other MPI error, error stack:
MPI_Probe(113).................: MPI_Probe(src=MPI_ANY_SOURCE, tag=4, MPI_COMM_WORLD, status=0x2b1215f664b8) failed
MPIDI_CH3I_Progress(401).......: 
MPID_nem_tcp_poll(2332)........: 
MPID_nem_tcp_connpoll(2582)....: 
state_commrdy_handler(2208)....: 
MPID_nem_tcp_recv_handler(2081): socket closed
Fatal error in MPI_Probe: Other MPI error, error stack:
MPI_Probe(113).................: MPI_Probe(src=MPI_ANY_SOURCE, tag=4, MPI_COMM_WORLD, status=0x2b46d71664b8) failed
MPIDI_CH3I_Progress(401).......: 
MPID_nem_tcp_poll(2332)........: 
MPID_nem_tcp_connpoll(2582)....: 
state_commrdy_handler(2208)....: 
MPID_nem_tcp_recv_handler(2081): socket closed
Fatal error in MPI_Probe: Other MPI error, error stack:
MPI_Probe(113).................: MPI_Probe(src=MPI_ANY_SOURCE, tag=0, MPI_COMM_WORLD, status=0x2b1216204468) failed
MPIDI_CH3I_Progress(401).......: 
MPID_nem_tcp_poll(2332)........: 
MPID_nem_tcp_connpoll(2582)....: 
state_commrdy_handler(2208)....: 
MPID_nem_tcp_recv_handler(2081): socket closed
Fatal error in MPI_Probe: Other MPI error, error stack:
MPI_Probe(113).................: MPI_Probe(src=MPI_ANY_SOURCE, tag=4, MPI_COMM_WORLD, status=0x2ba934a664b8) failed
MPIDI_CH3I_Progress(401).......: 
MPID_nem_tcp_poll(2332)........: 
MPID_nem_tcp_connpoll(2582)....: 
state_commrdy_handler(2208)....: 
MPID_nem_tcp_recv_handler(2081): socket closed
....
srun: error: super3: tasks 8-15: Exited with exit code 1
srun: Terminating job step 77374.0
srun: error: super2: tasks 1-7: Exited with exit code 1

-------------------------------------------------------------------------------

Was 16 copies of the program.

James Tullos (Intel)的头像

Hi Leonardo,

Unfortunately,wedonotofficially support Haskell. I will do what I can to help resolve your issue.

It looks like you are having a network connection issue. What network card(s) are in the nodes? What is the output from ifconfig?

Also, how much memory is your program using? How much is available per node? Have you tried running on more nodes, with less processes per node?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

Leonardo Oliveira的头像

Hi James,
My Haskell code is just a binding to MPI C library (Intel in the case of our cluster)...
Let me test as you said...more nodes with less process. But I don't think this is the problem. Because we have the same Haskell library that uses sockets TCP as low-level comunication...and it works fine.

Leonardo Oliveira的头像

Hi James,
Some problems were resolved, but this one continues.
..and one message left me doubts.

When I run the example with the variable set to:
export I_MPI_FABRICS=dapl
It show me the error:
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_FABRICS=dapl
[0] MPI startup(): I_MPI_PLATFORM=auto
[1:super7] unexpected disconnect completion event from [0:super7]
Assertion failed in file ../../dapl_module_util.c at line 2682: 0
internal ABORT - process 1
[12:super8] unexpected disconnect completion event from [0:super7]
Assertion failed in file ../../dapl_module_util.c at line 2682: 0
internal ABORT - process 12
...

When I set:
export I_MPI_DAPL_UD=on
It shows me:
[7] dapl fabric is not available and fallback fabric is not enabled
[8] dapl fabric is not available and fallback fabric is not enabled
[15] dapl fabric is not available and fallback fabric is not enabled
...

When I run with:
export I_MPI_FABRICS=tcp
It shows me:
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_FABRICS=tcp
[0] MPI startup(): I_MPI_PLATFORM=auto
Fatal error in MPI_Probe: Other MPI error, error stack:
MPI_Probe(113).................: MPI_Probe(src=MPI_ANY_SOURCE, tag=0, MPI_COMM_WORLD, status=0x2ba149084ef0) failed
MPIDI_CH3I_Progress(401).......:
MPID_nem_tcp_poll(2332)........:
MPID_nem_tcp_connpoll(2582)....:
state_commrdy_handler(2208)....:
MPID_nem_tcp_recv_handler(2081): socket closed

It`s dificult understand this errors...
What can I do to find the root of this problem...?

James Tullos (Intel)的头像

Hi Leonardo,

The first and third look like network errors. Are the systems having any connectivity or stability issues?

The second could be due to an incorrect configuration. Can you please send the /etc/dat.conf file from the system?

Can you run with -verbose and send the output?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

登陆并发表评论。