mpd daemon prematurely terminating job

mpd daemon prematurely terminating job

Hi everyone,

I am a little out of my depth here so bear with me. I am trying to configure mpirun and mpiexec to run software called Materials Studio on a 1 node, 2 processor, 12 core cluster. The submission scheme is PBS. I had everything set up properly and where I could submit jobs and they would work well but after a few days I ran into issues where I would get this sort of error:

mpiexec_server.org: cannot connect to local mpd (/tmp/mpd2.console_user); possible causes: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option)

It seemed like the daemon for mpd was somehow set up but eventually terminated. I had luck adding this to my submission script:

export PATH=/data1/opt/MD/Linux-x86_64/IntelMPI/bin:$PATH
export LD_LIBRARY_PATH=/data1/opt/MD/Linux-x86_64/IntelMPI/lib:/data1/opt/MD/Linux-x86_64/IntelMPI/bin:/data1/opt/MD/Linux-x86_64/IntelMKL/lib
mpdboot -n 1 -f ~/mpd.hosts
nohup mpd &
/data1/opt/MD/Linux-x86_64/IntelMPI/bin/mpiexec -n 6 /data1/opt/MD/2.0/TaskServer/Tools/vasp5.3.3/Linux-x86_64/vasp_parallel

The job now submits and runs properly but times out after 30 minutes or so. I tried adding '-r ssh' without quotes to the end of the mpdboot line but I am not sure if that is the right strategy to take. Also, I am a little confused about why I need to run this daemon in this script and why I need to call a hosts file when I run- I thought that PBS creates that when the job picks up. Could anyone please give me some advice on where to go next? Basically how can I prevent a job that is running from quitting because of something to do with the mpi daemon. 

Thanks so much for your help!

6 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

Hi Stephen,

Try using the following instead:

. /data1/opt/MD/Linux-x86_64/IntelMPI/bin/mpivars.sh
mpirun -n 6 /data1/opt/MD/2.0/TaskServer/Tools/vasp5.3.3/Linux-x86_64/vasp_parallel

The first line will set up the PATH and LD_LIBRARY_PATH environment variables for you.  By using mpirun (or mpiexec.hydra) instead of mpiexec, you will use Hydra, which is simpler and more scalable than MPD.  Please let me know if this helps.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Hi James,

Thanks so much for the response! Unfortunately, I can't seem to locate mpivars.sh in that folder or in the lib folder. I think it might be due to the version number, here is what I am told the software was compiled with, I believe it is version 3.2.

http://software.intel.com/sites/default/files/m/d/4/1/d/8/Reference_Manu...

Thanks again,

Stephen

Hi Stephen,

Ok, try simply removing the calls to start the MPD within your script.  Start an MPD ring ahead of time, and use that ring for all of your jobs.

Also, would it be possible to get a version compiled with a current version of the Intel® MPI Library and then try running with that version?

James.

Hi Everyone,

Okay so I think I figured out the issue and it ended up being in no way related to a failure on intel mpi. My administrator set up a task with cron several years ago to kill any jobs with a certain string that this program by dumb luck happened to match. Everything now runs beautifully.

Thanks again for all the help,

Stephen

Hi Stephen,

I'm glad everything is working now.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

登陆并发表评论。