Problems running with mpi

Problems running with mpi

Hello,

I`ve installed mpi 3.2 on C2Q 9550 with opensuse 11.1. I`ve done it before on opensuse 11.0 and never had any problems.

So, when I`m trying to start my program I get this:
============================START====================================
sda@abs:~/d.dppc> mpirun -n 4 mdrun_mpi -v
WARNING: Can't read mpd.hosts for list of hosts, start only on current
mpiexec_abs (mpiexec 841): no msg recvd from mpd when expecting ack of request. Please examine the /tmp/mpd2.logfile_sda log file on each node of the ring.
mpdallexit: cannot connect to local mpd (/tmp/mpd2.console_sda_090112.170048_5311); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
=============================END=====================================

The output of /tmp/mpd2.logfile_sda is the following

=============================START===================================
sda@abs:~/d.dppc> less /tmp/mpd2.logfile_sda_090112.170704_5653
logfile for mpd with pid 5688
abs_41924: mpd_uncaught_except_tb handling:
: list index out of range
/opt/intel/impi/3.2/bin64/mpd.py 132 pin_Join_list
list.append(l1[i]+l2[i]+l3[i])
/opt/intel/impi/3.2/bin64/mpd.py 421 pin_CpuList
ordids = pin_Join_list(info['pack_id'],info['core_id'],info['thread_id'],space)
/opt/intel/impi/3.2/bin64/mpd.py 2535 run_one_cli
self.PinList = pin_CpuList(gl_envvars, self.PinCase, self.PinSpace,self.CpuInfo,len(self.RanksToBeRunHere))
/opt/intel/impi/3.2/bin64/mpd.py 2369 do_mpdrun
rv = self.run_one_cli(lorank,msg)
/opt/intel/impi/3.2/bin64/mpd.py 1605 handle_console_input
self.do_mpdrun(msg)
/opt/intel/impi/3.2.0.011/bin64/mpdlib.py 613 handle_active_streams
handler(stream,*args)
/opt/intel/impi/3.2/bin64/mpd.py 1262 runmainloop
rv = self.streamHandler.handle_active_streams(timeout=8.0)
/opt/intel/impi/3.2/bin64/mpd.py 1231 run
self.runmainloop()
/opt/intel/impi/3.2/bin64/mpd.py 2762
mpd.run()
============================END=============================

Starting mpd with "mpd &" or "mpd -n &" did`t help.

I`ve created ~/.mpd.conf file with 700 priorities and secret word in it - did`t help

This are the test responses:

========================START========================
mpdboot -n 2 -v -d
debug: starting
totalnum=2 numhosts=1
there are not enough hosts on which to start all processes
========================END===========================

========================START=========================
mpdboot -n 1 -v -d
debug: starting
running mpdallexit on abs
LAUNCHED mpd on abs via
debug: launch cmd= env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/3.2/bin64/mpd.py --ncpus=1 --myhost=abs -e -d -s 1
debug: mpd on abs on port 42445
RUNNING: mpd on abs
debug: info for running mpd: {'ip': '', 'ncpus': 1, 'list_port': 42445, 'entry_port': '', 'host': 'abs', 'entry_host': '', 'ifhn': ''}
=========================END==========================

========================START=========================
mpiexec -n 2 /bin/hostname
mpiexec_abs: cannot connect to local mpd (/tmp/mpd2.console_sda); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
=========================END==========================

Do you know how to fix this?

THANKS!

8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

If you don't have mpd.hosts in the current directory, you need to point to one with the -f option, in order to start mpd on more than one node.

Quoting - tim18
If you don't have mpd.hosts in the current directory, you need to point to one with the -f option, in order to start mpd on more than one node.

Thank you for answering.

1. I have a single CPU with 4 cores, so in my understanding I sould not have mdp.hosts file. This will default mpirun to run on current node only - the only one I have
2. I have created mpd.hosts file - I am not receiving any complaints about mpd.hosts any more, though this did not resolve the issue.

Hello genesup,

You're correct, if you don't specify an mpd.hosts file, mpirun would only run your program on the current node. The issue is probably rooted when trying to run mpiexec.

Can you either remove the .mpd.conf file you have created (that's automatically setup for you when you first use either mpirun or mpdboot) or change the settings of your file to 600 (which is what it should be)?

As noted in the Release Notes, the Intel MPI Library currently supports OpenSUSE 10.3. I'm wondering if there've been any major changes between 10.3 and 11.0.

Regards,
~Gergana

===================================
Gergana Slavova
Technical Consulting Engineer
Intel Cluster Tools

Gergana Slavova
Technical Consulting Engineer
Intel® Cluster Tools
E-mail: gergana.s.slavova_at_intel.com

I have exactly the same setup (i.e. impi 3.2, opensuse 11.1) on a dell precision
and I exactly see the same problems (i.e. the same error messages in /tmp/mpd2 logfiles)

Bernd

Hello bmohrpriv,

Are you using your own $HOME/.mpd.conf file? If yes, I would suggest removing it and letting Intel MPI Library take care of creation of that file.

If you believe this is a bug with the Intel MPI Library, feel free to submit a bug report at the Intel Premier site: https://premier.intel.com.

It would be great if you let us know ifyou're able to run a simple MPI program on your cluster. For example, the Intel MPI Library provides test examples located in the /test directory. Can you copy one of those to your home dir and try a quick experiment:

$ cp /opt/intel/impi/3.2/test/test.c .
$ mpiicc test.c -o testc
$ mpirun -f mpd.hosts -n 2 ./testc

Or, even simpler, try to run mpirun with hostname instead:

$ mpirun -f mpd.hosts -n 2 /bin/hostname

Regards,
~Gergana

Gergana Slavova
Technical Consulting Engineer
Intel® Cluster Tools
E-mail: gergana.s.slavova_at_intel.com

I finally found the real source of the problem: it is the program "cpuinfo" provided by Intel MPI. Executed on
my OpenSuse 11.1 it reports:

suse11% cpuinfo
Architecture : x86_64
Hyperthreading: disabled
Packages : 0
Cores : 0
Processors : 0
===== Cache sharing =====
Cache Size Processors
L1 32 KB no sharing
L2 6 MB no sharing

The cpuinfo program is used by default by MPD scripts to find out about the processor; ass you can see it reports 0 processors with 0 cores, resulting ultimately in the problem reported in the trhead.

A workaround is to set the environment variable I_MPI_CPUINFO to "proc" telling MPD to use /proc/cpuinfo output instead of executing cpuinfo. With this setting, I finally can use Intel MPI on my OpenSuSe 11.1

Bernd

Just a quick update for this issue. The root cause of the problem is that the system affinity mask size in OpenSuSE 11.1 has been increased. A fix will be included in the upcoming release of the Intel MPI Library to be released soon.

Regards,
~Gergana

Gergana Slavova
Technical Consulting Engineer
Intel® Cluster Tools
E-mail: gergana.s.slavova_at_intel.com

Leave a Comment

Please sign in to add a comment. Not a member? Join today