| Thread Tools | Search this thread |
|---|
tahgroupiastate.edu
| October 21, 2009 9:18 AM PDT mpdboot fails to start nodes with different users. | ||||
I am trying to figure out why a few nodes in my cluster are acting differently. We are running Rocks 5.2 with RHEL 5 We use torque/maui as our queing system. They submit jobs that use MPI version 3.2.1.009 When I start a job as a user with this mpdboot --rsh=ssh -d -v -n 16 -f /scr/username/testinput.nodes.mpd I get the ususal --- LAUNCHED mpd on compute-0-13 via compute-0-15 debug: launch cmd= ssh -x -n compute-0-13 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTTYPE=$HOSTTYPE MACHTYPE=$MACHTYPE HOST=$HOST OSTYPE=$OSTYPE /opt/intel/impi/3.2.1.009/bin64/mpd.py -h compute-0-15 -p 41983 --ifhn=10.1.3.241 --ncpus=1 --myhost=compute-0-13 --myip=10.1.3.241 -e -d -s 16 debug: mpd on compute-0-13 on port 58382 RUNNING: mpd on compute-0-13 debug: info for running mpd: {'ip': '10.1.3.241', 'ncpus': 1, 'list_port': 58382, 'entry_port': 41983, 'host': 'compute-0-13', 'entry_host': 'compute-0-15', 'ifhn': '', 'pid': 19147} --- for most nodes however when it gets to here --- LAUNCHED mpd on compute-0-6 via compute-0-11 debug: launch cmd= ssh -x -n compute-0-6 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTTYPE=$HOSTTYPE MACHTYPE=$MACHTYPE HOST=$HOST OSTYPE=$OSTYPE /opt/intel/impi/3.2.1.009/bin64/mpd.py -h compute-0-11 -p 51916 --ifhn=10.1.3.248 --ncpus=1 --myhost=compute-0-6 --myip=10.1.3.248 -e -d -s 16 debug: mpd on compute-0-6 on port 47012 --- mpdboot_compute-0-15.local (handle_mpd_output 828): Failed to establish a socket connection with compute-0-6:47012 : (111, 'Connection refused') mpdboot_compute-0-15.local (handle_mpd_output 845): failed to connect to mpd on compute-0-6 --- I have tried taking compute-0-6 out of the system and it tosses similar errors for compute-0-5 and so forth all the way to compute-0-0 When I run the same job as root mpdboot --rsh=ssh -d -v -n 16 -f /scr/username/testinput.nodes.mpd it starts fine. We have ssh set up so that it does not require a password to log in, and I have successfully attemped logging in without password from the mpdboot node without any problems. I am a relatively new cluster administrator and I was hoping someone could help point me towards the solution to this problem | |||||
|
|||||||||||||
|
|||||||||||||
|
|||||||||||||
|
|||||||||||||
|
|||||||||||||
|
|||||||||||||
|
|||||||||||||
|
|||||||||||||
| 8458 users have contributed to 31571 threads and 100533 posts to date. |
|---|
| In the past 24 hours, we have 17 new thread(s) 131 new posts(s), and 152 new user(s). In the past 3 days, the most popular thread for everyone has been gemm(A,A,A) like possible? The most posts were made to gemm(A,A,A) like possible? The post with the most views is Quoting - rase if (k.eq.0 Please welcome our newest member soundmyth |