Intel MPI: MPI_Comm_connect with I_MPI_FABRICS tmi results in an error

Intel MPI: MPI_Comm_connect with I_MPI_FABRICS tmi results in an error

Аватар пользователя Stefan S.

Hello,

I have two programs which are connected at runtime via MPI_Comm_connect.

If I use the dapl fabrics everything works fine:

mpirun -genv I_MPI_FABRICS dapl -np 1 ping.exe

mpirun -genv I_MPI_FABRICS dapl -np 1 pong.exe

If I use tmi instead of dapl the programs crash.

MPID_nem_tmi_vc_connect: tmi_connect returns 45
Fatal error in MPI_Comm_connect: Other MPI error, error stack:
MPI_Comm_connect(131)............................: MPI_Comm_connect(port="tag#0$epaddr_size#16$epaddr#02000000000000000305020000000000$", MPI_INFO_NULL, root=0, MPI_COMM_WORLD, newcomm=0x7f1db04115f8) failed
MPID_Comm_connect(206)...........................:
MPIDI_Comm_connect(393)..........................:
MPIDI_Create_inter_root_communicator_connect(134):
MPIDI_CH3_Connect_to_root(274)...................:
MPID_nem_tmi_connect_to_root(813)................:
(unknown)(): Other MPI error

However tmi works fine for MPI-I calls e.g. MPI_Send.

Is there anyway to debug that case? 

4 posts / 0 новое
Последнее сообщение
Пожалуйста, обратитесь к странице Уведомление об оптимизации для более подробной информации относительно производительности и оптимизации в программных продуктах компании Intel.
Аватар пользователя Gergana Slavova (Intel)
Best Reply

Hi Stefan,

There's a known limitation when running dynamic process spawning with the TMI fabric (this customer is having a similar issue).  I'm working with him on finding a solution.  Try setting TMI_PSM_JID=1 before running your job.  That has worked in the past.

Looking forward to hearing back.

Regards,
~Gergana

Gergana Slavova
Technical Consulting Engineer
Intel® Cluster Tools
E-mail: gergana.s.slavova_at_intel.com
Аватар пользователя Stefan S.

Thanks! That seems to connect at least! It is running now.... I get back to you if it runs through!

 

Thanks a lot!

Аватар пользователя Gergana Slavova (Intel)

Glad to hear that, Stefan :)

Just to give you some background on why this is happening.

PSM requires the same UID to be set for all processes in the same job.  Unfortunately, when doing MPI_Comm_spawn, there's no way to pass the job ID to the spawned processes, which is why you're running into this error.

As you found out, the current workaround is setting the following environment variable:

$ export TMI_PSM_JID=1

In reality, any number will work here, it just has to be universal for all the spawned ranks.  Also, please only set this env variable for jobs that use process spawning.  It should be unset for any other MPI jobs that use TMI and do not do process spawning.

Let me know how that run finishes up.

Regards,
~Gergana

Gergana Slavova
Technical Consulting Engineer
Intel® Cluster Tools
E-mail: gergana.s.slavova_at_intel.com

Зарегистрируйтесь, чтобы оставить комментарий.