cluster error: /mpi/intel64/bin/pmi_proxy: No such file or directory found

cluster error: /mpi/intel64/bin/pmi_proxy: No such file or directory found

Hi,

I've installed Intel parallel studio cluster edition in single node installation configuration on the master node cluster of 8 nodes with 8 processors each. I've performed the pre-requisite steps before installation and verified shell connectivity also running the .sshconnectivity and creating machines.LINUX file which gave the result as suggesting all 8 nodes are found as follows:

*******************************************************************************
Node count = 8
Secure shell connectivity was established on all nodes.
See the log output listing "/tmp/sshconnectivity.aditya.log" for details.
Version number: $Revision: 259 $
Version date: $Date: 2012-06-11 23:26:12 +0400 (Mon, 11 Jun 2012) $
*******************************************************************************

machines.LINUX file has the following hostnames:

octopus100.ubi.pt
compute-0-0.local 
compute-0-1.local 
compute-0-2.local 
compute-0-3.local 
compute-0-4.local 
compute-0-5.local 
compute-0-6.local 

I started the installation and installed all the modules in /export/apps/intel directory which can be accessed by all nodes as suggested by the administrator of the cluster. After completing the installation I've added the compilers environmental variable psxevar.sh and mpivars.sh to the bash script as advised in the getting started manual. I then prepared the hostfile with all the nodes of the cluster for running in the mpi environment and verifies the shell connectivity by running .sshconnectivity form the installation directory and it worked like earlier and detected all nodes successfully.

i wanted to check the cluster configuration, so I compiled and executed the test.c program in the mpi/test directory of the instalation. I compiled well but when I executed myprog it returned the error: /mpi/intel64/bin/pmi_proxy: No such file or directory found as follows: 

[aditya@octopus100 Desktop]$ mpiicc -o myprog test.c
[aditya@octopus100 Desktop]$ mpirun -n 2 -ppn 1 -f ./hostfile ./myprog
Intel(R) Parallel Studio XE 2017 Update 4 for Linux*
Copyright (C) 2009-2017 Intel Corporation. All rights reserved.
bash: /export/apps/intel/compilers_and_libraries_2017.4.196/linux/mpi/intel64/bin/pmi_proxy: No such file or directory
^C[mpiexec@octopus100.ubi.pt] Sending Ctrl-C to processes as requested
[mpiexec@octopus100.ubi.pt] Press Ctrl-C again to force abort
[mpiexec@octopus100.ubi.pt] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
[mpiexec@octopus100.ubi.pt] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:252): unable to write data to proxy
[mpiexec@octopus100.ubi.pt] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:174): unable to send signal downstream
[mpiexec@octopus100.ubi.pt] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@octopus100.ubi.pt] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event
[mpiexec@octopus100.ubi.pt] main (../../ui/mpich/mpiexec.c:1147): process manager error waiting for completion

later I referred trouble shooting manual then it suggested running a non-mpi  for hostname and it returned the same error as follows:

[aditya@octopus100 Desktop]$ mpirun -ppn 1 -n 2 -hosts compute-0-0.local, compute-0-1.local hostname
Intel(R) Parallel Studio XE 2017 Update 4 for Linux*
Copyright (C) 2009-2017 Intel Corporation. All rights reserved.
bash: /export/apps/intel/compilers_and_libraries_2017.4.196/linux/mpi/intel64/bin/pmi_proxy: No such file or directory
^C[mpiexec@octopus100.ubi.pt] Sending Ctrl-C to processes as requested
[mpiexec@octopus100.ubi.pt] Press Ctrl-C again to force abort
[mpiexec@octopus100.ubi.pt] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
[mpiexec@octopus100.ubi.pt] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:252): unable to write data to proxy
[mpiexec@octopus100.ubi.pt] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:174): unable to send signal downstream
[mpiexec@octopus100.ubi.pt] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@octopus100.ubi.pt] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event
[mpiexec@octopus100.ubi.pt] main (../../ui/mpich/mpiexec.c:1147): process manager error waiting for completion

When I included the master ode octopus100.ubi.pt it worked only for that node but the rest nodes are not able to run the mpi commands I guess. I think may it is an environmental problem as the cluster nodes are not able to perform mpi communications with the master node.

Please help me resolve this issue so that I can perform some simulations on the cluster.

Thanks,

Aditya

 

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Aditya, what is your OS and hardware environment such as processors? And what is the output of "env | grep I_MPI". Could you please refer thread https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technolog... to see if it helps. Thanks.

Hi,

The output for env | grep I_MPI is as follows:

[aditya@octopus100 ~]$ env | grep I_MPI

MPS_STAT_ENABLE_IDLE=I_MPI_PVAR_IDLE
I_MPI_ROOT=/export/apps/intel/compilers_and_libraries_2017.4.196/linux/mpi

The cluster is made of 8 Intel workstation nodes with 8 Intel xenon processors in each node. The parallel studio cluster edition is installation in a single node on the master which runs Cent6.5 OS. I referred to the other thread and it was about MIC architeture but I don't think my case is about MIC architecture.

I've set the environmental variable for psxevars,sh and mpivars.sh in .bashrc of the user aditya@octopus100.ubi.pt but the installation was done performed as root@octopus100.ubi.pt in shared directory /export/apps/intel according to the cluster administrator.

Pleas kindly advise to resolve this issue.

Thanks,

Aditya 

 

 

 

 

 

 

Hi Aditya, what is the output of "mpirun -n 8 hostname"? Please also have a look at example 3 at https://software.intel.com/en-us/node/561777, are there some problems with network shared drive, or is the mount present? Thanks.

Hi Si,

The output of mpirun -n 8 hostname is as follows:

[aditya@octopus100 Backward Step]$ mpirun -n 8 hostname
octopus100.ubi.pt
octopus100.ubi.pt
octopus100.ubi.pt
octopus100.ubi.pt
octopus100.ubi.pt
octopus100.ubi.pt
octopus100.ubi.pt
octopus100.ubi.pt

I think the installation is not able to command the slave and when the mpirun is executed on a particular node it is returning the pmi_proxy not found error. The shared file system is mounted and access must be good because other software like Fluent and OpenFOAM are working fine. I'm able to run a particular serial code executed by logging into the particular node but when I'm trying to execute the run from the master node on a particular node it doesn't work. I can compile the a code on the master and execute in the slave by logging into the slave node but I'm not able to compile the code form the slave as it returns command not found error. But when i login into the slave node the source  environmental variable in .bashrc are recognized and initialized but the compiler it is not able to invoke the compiler to compile a code but it can execute it. Is this info helpful to decode the configuration problem. I think it is some kind of environmental issue !

Thnaks

Aditya

 

Leave a Comment

Please sign in to add a comment. Not a member? Join today