Trace Collector fails on big data

Trace Collector fails on big data

Hello,

I have the following problem with Trace Collector.

When I run my instrumented application, I get this message:

"
[0] Intel Trace Collector INFO: Writing tracefile Konraz29.run.stf in /home/users/vadim/2_new_Konraz/2_new_Konraz_ITAC
Assertion failed in file ../../dapl_module_poll.c at line 3473: rreq != ((void *)0)
internal ABORT - process 8
[24:node1-128-07] unexpected disconnect completion event from [8:node1-128-05]
Assertion failed in file ../../dapl_module_util.c at line 2682: 0
internal ABORT - process 24
...
"
and so on (whole err message is rather big, but almost the same is repeated).

I run this application on 196 proccesses. When I use smaller input (so the application works less time) and smaller number of proccesses, it seems to work OK.

What can be the problem with it?

P.S. I use impi-4.0.1 MPI library and intel-12.0 compiler.

6 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

Hi Vadim,

This happens because of "unexpected disconnect completion event". As you may guess this comes from DAPL module (communication). The reason why it happens is unclear. Can you try to run this application on other set of nodes? (exclude node1-128-05).

Also, it's quite important to know how you compile that application and how you run it.
BTW what version of the Intel TraceCollector and Analyzer do you use?

Could you set I_MPI_FABRICS=shm:dapl and I_MPI_DAPL_UD=on and give it a try.

Regards!
Dmitry

Hello Dmitry,

I've tried to use other nodes, but the same has happened.

Also, I've tried to set I_MPI_FABRICS=shm:dapl and I_MPI_DAPL_UD=on, but got this error messages:
"
[4] dapl fabric is not available and fallback fabric is not enabled
...
"

This message has repeated for several other numbers, not only [4]. The number means process number? If so, then this message is shown not for all processes.

Compile commands:
mpicxx -O3 -DUSE_MPI -c Konraz29.cpp
mpicxx Konraz29.o -L$VT_LIB_DIR -lVT $VT_ADD_LIBS -o Konraz29.run

Run command:
sbatch --partition=test -n 256 impi ./Konraz29.run

Version of ITAC is 8.0.1.009.

Hi Vadim,

Do you have Intel Compiler?
Could you please compile your application in the following way:
mpiicpc -O3 -DUSE_MPI -trace -o Konraz29.run Konraz29.cpp
and run it as usual?

I hope that you are using Intel MPI Library.

To get additional information you can set I_MPI_DEBUG=5. It's very strange that in your first message I see an error from DAPL library but in the previous message I see that DAPL fabric was not available.
Could you try to run your application with I_MPI_FABRICS=shm:tcp in this case.
And after that with I_MPI_FABRICS=shm:dapl. Might be not all nodes in your cluster have Infiniband cards.
Seeting I_MPI_DEBUG may help you to understand what is going on.

Althoough, I hope that you understand that running 256 processes you'll get a hage trace.

Regards!
Dmitry

Hello Dmirty,

I'ver tried to compile with

mpiicpc -O3 -DUSE_MPI -trace -o Konraz29.run Konraz29.cpp

But the same happened.

After that I've used I_MPI_DEBUG=5 and I_MPI_FABRICS=shm:tcp, and got this error message:

Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(176)..........................: MPI_Send(buf=0xc175a0, count=65520, MPI_CHAR, dest=0, tag=1045, comm=0x84000000) failed
MPIDI_CH3I_Progress(401)...............:
MPID_nem_tcp_poll(2332)................:
MPID_nem_tcp_connpoll(2582)............:
state_commrdy_handler(2208)............:
MPID_nem_tcp_recv_handler(2098)........:
MPID_nem_tcp_handle_pkt(1821)..........:
MPIDI_CH3_PktHandler_EagerSend(618)....: failure occurred while posting a receive for message data (MPIDI_CH3_PKT_EAGER_SEND)
MPIDI_CH3U_Receive_data_unexpected(250): Out of memory
Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(176)................: MPI_Send(buf=0xbd23a0, count=65520, MPI_CHAR, dest=8, tag=1045, comm=0x84000000) failed
MPIDI_CH3I_Progress(401).....:
MPID_nem_tcp_poll(2332)......:
MPID_nem_tcp_connpoll(2504)..:
state_commrdy_handler(2213)..:
MPID_nem_tcp_send_queued(122): writev to socket failed - Connection reset by peer
Fatal error in MPI_Send: Other MPI error, error stack:

Also, now I use 64 processes.

Contacting through e-mail to get more information.

Connectez-vous pour laisser un commentaire.