MPI application fails to run from host machine on coprocessor

MPI application fails to run from host machine on coprocessor

I am trying to run application from host machine on coprocessor but when i execute the command

mpirun -n 2 -host host-name /tmp/test.mic

it hangs on command line and does not show any output.

However when i run directly on coprocessor/host , it works fine. What could be the issue?

32 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hello,
Are there any messages printed before the hang?
Can you please confirm that the environment variable I_MPI_MIC=1 is set before issuing mpirun?
Waht is the output of

$ mpirun -V

Could you try

$ mpirun -n 2 -host mic0 -env I_MPI_DEBUG=3 <your_mic_binary>

Could you try

$ mpirun -n 2 -host localhost -env I_MPI_DEBUG=3 <your_host_binary>

Thanks,
Leo.

 

Thanks for your reply.

mpirun -V

Intel(R) MPI Library for Linux* OS, Version 4.1 Update 3 Build 20131205
Copyright (C) 2003-2013, Intel Corporation. All rights reserved.

mpirun -n 2 -host localhost -env I_MPI_DEBUG=3 <your_host_binary>

this works fine for localhost. but

mpirun -n 2 -host mic0 -env I_MPI_DEBUG=3 <your_host_binary>

after entering this command, it seems that it waits / hangs.

Hi Roshan,

I think Leo suggests you to run

% mpirun -n 2 -host mic0 -env I_MPI_DEBUG=3 <your_mic_binary>

but not

% mpirun -n 2 -host mic0 -env I_MPI_DEBUG=3 <your_host_binary>

You may need to compile your mic binary with the command

% mpiicc -mmic <source code> -o <your_mic_binary>

For example:

% mpiicc -mmic test.c -o test.mic

Besides, you need to transfer the MIC binary to your coprocessor (or NFS mount):

% scp test.mic mic0:/tmp/.

And also pmi_proxy and all MPI libraries

% scp /opt/intel/impi/<version>/mic/bin/pmi_proxy mic0:/bin

% scp /opt/intel/impi/<version>/mic/lib/* mic0:/lib64/.

After enabling the env variable I_MPI_MIC

% export I_MPI_MIC=1

Now you should be able to run it:

% mpirun -n 2 -host mic0 -env=I_MPI_DEBUG=3 /tmp/test.mic

 

 

Perfect. I definitively recommend following Loc’s guidelines step-by-step as described above.

If you still see the silent hang issue after trying these, I’d suggest a step back and making sure that the environment is actually prepared to run MPI:

1. Would you confirm that it is possible to execute ‘hostname’ on mic0 via ssh?  (a fail here would be equivalent to "scp" failing in the above guidelines)

$ ssh mic0 hostname

2.If OK, then can you please try to use mpirun to execute only ‘hostname’ on mic0?  (that is, without any user-compiled binary)

$ setenv I_MPI_MIC 1
$ mpirun -n 2 -host mic0 hostname

Thank you,
Leo.

Hi,

I followed the instructions given by Loc by no success.

when i

"ssh mic0 hostname"

I can see the hostname. Also scp for copying binary works.

When i 

" mpirun -n 2 -host mic0 hostname"

it hangs or does not show any o/p.

Did i miss to set any variables here? I doubt because I can run the application directly on mic0 

Let's me take a look at your /etc/hosts files on host and coprocessor. Would you please display the output from the following commands: 

% hostname
% cat /etc/hosts
% ssh mic0 hostname
% ssh mic0 cat /etc/hosts

 

hostname

gauss

cat /etc/hosts
127.0.0.1       localhost
127.0.1.1       gauss

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
172.31.1.1      gauss-mic0 mic0
172.31.1.254    hostmic0

 ssh mic0 hostname
gauss-mic0

 

ssh mic0 cat /etc/hosts
127.0.0.1       gauss-mic0 mic0 localhost.localdomain localhost
::1             gauss-mic0 mic0 localhost.localdomain localhost

172.31.1.254    host

Hi Roshan,

Look at your /etc/hosts on your host system, there are two additional lines that makes me wonder

127.0.1.1  gaus

and

172.31.1.254 hostmic0

I am not sure how you have these lines in your /etc/hosts

And in the /etc/hosts in the coprocessor, it looks like it misses one line

172.31.1.1   gauss-mic0 mic0 

My suggestion is to remove the above two lines in /etc/hosts in your host system (save it first for backup) . Also, try the following commands in your host system and see if there is any output:

mpirun -host mic0 hostname
mpirun -host 192.131.1.1 hostname
mpirun -host gauss-mic0 hostname

For all three commands, i get "gauss" output. I removed 2 lines from hostmachine /etec/hosts file and added "172.31.1.1   gauss-mic0 mic0 " on co-processor.

After doing this I still not able to run it.

If you run

export I_MPI_MIC=enable; mpirun -host mic0 -n 1 hostname

 

It should respond with

gauss-mic0

 

mpirun -host mic0 -n 1 hostname

this doesn't give any output or it seems that it hangs which is same behaviour as my problem.

It might be worth to verify if your mpirun command is at least issuing the ssh command for the connection. Would you please add the "-v" verbose option to the mpirun command as shown below  and post the output here?

% export I_MPI_MIC=enable; mpirun -v -host mic0 -n 1 hostname

 

Also: can you confirm you can ssh without a password to mic0 ?

 

Thank you,

Leo.

 

 

 

 

I am attaching a file containing output of above command

it does not terminate by itself. i had to kill it by ctrl+z.

and ssh does not work without password.

 

 

Attachments: 

AttachmentSize
Downloadtext/plain temp.txt26.93 KB

Need to set up passwordless SSH to the coprocessor.

Passwordless SSH is a prerequisite for MPI.

 

Now i have setup passwordless ssh. But when i run mpirun command i still get an error messgae:

"[proxy:0:0@gauss-mic0] HYDU_sock_connect (./utils/sock/sock.c:264): unable to connect from "gauss-mic0" to "127.0.1.1" (Connection refused)
[proxy:0:0@gauss-mic0] main (./pm/pmiserv/pmip.c:396): unable to connect to server 127.0.1.1 at port 42947 (check for firewalls!)
^CCtrl-C caught... cleaning up processes
[mpiexec@gauss] HYD_pmcd_pmiserv_send_signal (./pm/pmiserv/pmiserv_cb.c:239): assert (!closed) failed
[mpiexec@gauss] ui_cmd_cb (./pm/pmiserv/pmiserv_pmci.c:127): unable to send SIGUSR1 downstream
[mpiexec@gauss] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@gauss] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:435): error waiting for event
[mpiexec@gauss] main (./ui/mpich/mpiexec.c:900): process manager error waiting for completion"

 

Quote:

roshan c. wrote:

For all three commands, i get "gauss" output. I removed 2 lines from hostmachine /etec/hosts file and added "172.31.1.1   gauss-mic0 mic0 " on co-processor.

After doing this I still not able to run it.

172.0.1.1 is  the IP address of gauss according to your original /etc/hosts on host.

I am guessing that maybe removing these two lines causes this problem. Can you try to put them back in in the host /etc/hosts and try again?

Follow the advice in the output, "check for firewalls!"

It's likely a firewall is preventing the connection from the coprocessor to the host.

See also,

http://software.intel.com/en-us/articles/firewalls-and-mpi

http://software.intel.com/en-us/articles/using-intel-mpi-library-and-intel-xeon-phi-coprocessor-tips

 

Now, i am getting different error message:

when i run " mpirun -n 2 -host mic0 /tmp/test.mic" 

sh: /opt/intel/impi/4.1.3.045/intel64/bin/pmi_proxy: not found

This binary is present on both the machines. 

Hello,

Can you please confirm that I_MPI_MIC is set to either "1" or "enabled"?

Woul you send the output of:

% export I_MPI_MIC=enable; mpirun -v -host mic0 -n 1 hostname |&  grep "Launch arguments"

 

thanks,

Leo.

 

it gives output

"[mpiexec@gauss] Launch arguments: /usr/bin/ssh -x -q mic0 sh -c 'export I_MPI_ROOT="/opt/intel/impi/4.1.3.045" ; export PATH="/opt/intel/impi/4.1.3.045/intel64/bin//../../mic/bin:${I_MPI_ROOT}:${I_MPI_ROOT}/mic/bin:${PATH}" ; exec "$0" "$@"' pmi_proxy --control-port 127.0.1.1:52573 --debug --pmi-connect lazy-cache --pmi-aggregate -s 0 --enable-mic --i_mpi_base_path /opt/intel/impi/4.1.3.045/intel64/bin/ --i_mpi_base_arch 0 --rmk slurm --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 107981940 --proxy-id 0 "

 

when i enabled above variable and run mpirun command i got an output

"/bin/pmi_proxy: line 2: syntax error: unexpected word (expecting ")")"

Thank you for the follow up.

The message '  "/bin/pmi_proxy: line 2: syntax error: unexpected word (expecting ")")"   '   might indicate that copy of  in the mic card is from the "intel64" directory and not from the "mic" binary directory.

It might be worth to try again these copies:

% scp /opt/intel/impi/4.1.3.045/mic/bin/pmi_proxy mic0:/bin

% scp /opt/intel/impi/4.1.3.045/mic/lib/* mic0:/lib64/.

 

and then re-run your test.

Best,

Leo.

 

 

 

Is Intel MPI visible on the coprocessor?

That is, is a directory such as /opt/intel/impi/4.1.3.045 mounted on the coprocessor?

 

Thanks a lot guys. Now I can run app from host on coprocessor. 

However there is one problem, when i try to run on host and coprocessor in one command, i get an error message

:

 mpirun  -n 3 -host gauss ./test.host : -iface mic0 -host mic0 -n 2 /tmp/test.mic 

=====================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)"

This is good progress: great!

I believe bad termination might also be related with the application you’re attempting to run.Have you already tried to run another MPI example and observe the behavior?
For example:

% cp /opt/intel/impi /4.1.3.045/test/test.c .
% mpiicc -mmic test.c -o test_hello.mic
% mpiicc test.c -o test_hello
% scp test_hello.mic mic0:/tmp
% mpirun -n 2 -host localhost ./test_hello : -n 2 -iface mic0 -host mic0 /tmp/test_hello.mic
Hello world: rank 0 of 4 running on Some-Host-Name
Hello world: rank 1 of 4 running on Some-Host-Name
Hello world: rank 2 of 4 running on Some-Host-Name -mic0
Hello world: rank 3 of 4 running on Some-Host-Name -mic0

Best,
Leo.

I tried with other application as well, but no success.

Here is the sample program I am trying to run

"

#include <stdio.h>
#include <mpi.h>


int main (argc, argv)
     int argc;
     char *argv[];
{
  int rank, size;

  MPI_Init (&argc, &argv);	/* starts MPI */
  MPI_Comm_rank (MPI_COMM_WORLD, &rank);	/* get current process id */
  MPI_Comm_size (MPI_COMM_WORLD, &size);	/* get number of processes */
  printf( "Hello world from process %d of %d\n", rank, size );
  MPI_Finalize();
  return 0;
}"

 

Check your MPI setup.  That looks like an MPICH2 error message.

Similar topic:  http://software.intel.com/en-us/forums/topic/405183

But when i run it individually, it works fine. Problem persists only if run simultaneuosly on both machines.

Yes, that could happen.  Try some commands like "which mpirun" and "which mpiexec" to check whether perhaps you're picking up something from some other MPI.

by running which mpirun i got an  output

"/opt/intel/impi/4.1.3.045/intel64/bin/mpirun"

so when I ran mpirun from mic/bin directory and still I got the same error message.

If you're convinced this message is from Intel MPI (which I'm not), then the message is telling you there's an error in your test program.

Hi,
It appears that I am having a similar problem. I followed the thread down to #22. It helped me to improve my /etc/host settings. These are the outputs of some of the commands commenters were asking for:

% mpirun -host mic0 hostname
uhams02a.phys.hawaii.edu

% mpirun -host 192.131.1.1 hostname
uhams02a.phys.hawaii.edu

% mpirun -host gauss-mic0 hostname
uhams02a.phys.hawaii.edu

% hostname
uhams02a.phys.hawaii.edu

% cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
172.31.1.1    uhams02a-mic0.phys.hawaii.edu mic0 #Generated-by-micctrl
172.31.1.254    host uhams02a.phys.hawaii.edu #pvd

% ssh mic0 hostname
uhams02a-mic0.phys.hawaii.edu

% ssh mic0 cat /etc/hosts
127.0.0.1    localhost.localdomain localhost
::1        localhost.localdomain localhost

172.31.1.254    host uhams02a.phys.hawaii.edu
172.31.1.1    uhams02a-mic0.phys.hawaii.edu mic0

I copied:
% scp /opt/intel/impi/4.1.3.045/mic/bin/pmi_proxy mic0:/bin
% scp /opt/intel/impi/4.1.3.045/mic/lib/* mic0:/lib64/

After that I also restarted the card with:
% sudo service mpss stop
% sudo service mpss start

% export I_MPI_MIC=enabexport I_MPI_MIC=enable; mpirun -v -host mic0 -n 1 hostnamele; mpirun -v -host mic0 -n 1 hostname |&  grep "Launch arguments"
"Launch arguments: /usr/local/bin/ssh -x -q mic0 sh -c 'export I_MPI_ROOT="/opt/intel/impi/5.0.0.028" ; export PATH="/opt/intel/impi/5.0.0.028/intel64/bin//../../mic/bin:${I_MPI_ROOT}:${I_MPI_ROOT}/mic/bin:${PATH}" ; exec "$0" "$@"' pmi_proxy --control-port 172.31.1.254:48386 --debug --pmi-connect lazy-cache --pmi-aggregate -s 0 --enable-mic --i_mpi_base_path /opt/intel/impi/5.0.0.028/intel64/bin/ --i_mpi_base_arch 0 --rmk user --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1728469499 --usize -2 --proxy-id 0"

This continues to hang. Any idea of how to proceed is appreciated.

 

Leave a Comment

Please sign in to add a comment. Not a member? Join today