Are there any plans to have a version of Intel mpi that has tight integration support for the sun gridengine queuing system much in the same way as openmpi has the support now?
Yes, we consider possibility to include such functionality to our product.
Actually, I may provide you some current recommendations how to configure SGEto reach tight integration with Intel MPI Library. Just let me know if you are interesting in it.
As Andrey mentioned, we do havea "manual", in a way, of how to integrate Intel MPI with Sun Grid Engine. The set of instructions are now available online at:
Let us know if this helps, or if you have any questions or problems.
Sorry was out of town for a few days and just getting back to this. Thanks Andrey and Gernana! I will look over the manual instructions and give a try and let you know how it goes.
We followed the directions on the website and setup SGE as suggested by you for tight integration with intel mpi. One of the reasons we are looking to do this is so that SGE can do proper clean up the the MPD python deamons that get left running around on servers after a job gets deleted or killed.
For example with openmpi and sge tight integration all openmpi processes get forked as children of the SGE execd deamon. So when a job gets deleted or killed SGE has full control of the job and can terminate all its openmpi children and clean up.
With intel mpi here is what I see when I submit a job:
grdadmin 4788 1 4788 4694 0 Mar30 ? 00:02:00 /hpc/SGE/bin/lx24-amd64/sge_execdroot 4789 4788 4788 4694 0 Mar30 ? 00:04:15 /bin/ksh /usr/local/bin/load.shgrdadmin 16949 4788 16949 4694 0 09:33 ? 00:00:00 sge_shepherd-1712429 -bgsalmr0 17023 16949 17023 17023 1 09:33 ? 00:00:00 -csh /var/spool/SGE/hpcp7781/job_scripts/1712429salmr0 17127 17023 17023 17023 0 09:33 ? 00:00:00 /bin/sh /hpc/soft/intel/x86_64/ict-3.1.1/impi/3.1/bin64/mpirun -perhost 1 -env Isalmr0 17174 17127 17023 17023 1 09:33 ? 00:00:00 python /hpc/soft/intel/x86_64/ict-3.1.1/impi/3.1/bin64/mpiexec -perhost 1 -envsalmr0 17175 17174 17023 17023 1 09:33 ? 00:00:00 [sh] ...salmr0 17166 1 17165 17165 0 09:33 ? 00:00:00 python /hpc/soft/intel/x86_64/ict-3.1.1/impi/3.1/bin64/mpd.py --ncpus=1 --myhost=hpcp7salmr0 17176 17166 17176 17165 2 09:33 ? 00:00:00 python /hpc/soft/intel/x86_64/ict-3.1.1/impi/3.1/bin64/mpd.py --ncpus=1 --myhost=hpcsalmr0 17178 17176 17178 17165 87 09:33 ? 00:00:04 /bphpc7/vol0/salmr0/MPI-Bench/bin/x86_64/IMB-MPI1.intelmpi.3.1
As you can see my MPI job is running as a forked child of sgeexcd and it under full SGE control. However the MPDs that got started are totally independent precesses and are not forked children of SGE. The problem comes when i type qdelete or try to delete my job or kill it as it is running. At this point SGE will killl all its forked children. But it know nothing about the MPD deamos. As a result after SGE deletes, kills, and cleans up my job I still have this running around on all the nodes that ran the mpi job:
salmr0 17166 1 17165 17165 0 09:33 ? 00:00:00 python /hpc/soft/intel/x86_64/ict-3.1.1/impi/3.1/bin64/mpd.py --ncpus=1 --myhost=hpcp7Each time i submit and delete a job I would get a new python like above hanging around. Any ideas on how to get the clean up of MPDs working properly?
Did you ever came up with a solution for this?
I have the same problem, any solution for this problem?
I'm curious to know why Intel developed their MPI based on MPICH2/MVAPICH2. Why not based on OpenMPI?
OpenMPI was not well developed, and had not supplanted lam, at the time the decision was made, and didn't support Windows until recently. Not all subsequent developments were foreseen. Are you suggesting that cooperative developments between OpenMPI and SGE should have been foreseen? Do you know the future of SGE?
I have the same problem, too. What can I do?
I'm hoping this reply will reach everyone subscribed to this thread.
As a first point of business, I would suggest you give the new Intel MPI Library 4.0 a try. It came out last month and includes quite a few major changes. You can download it, if you still have a valid license, from the Intel Registration Center, or grab an eval copy from intel.com/go/mpi.
Secondly, we have plans to improve our tight integration support with SGE and other schedulers in future releases. So stay tuned.
please have a look at:
for a tight integration with correct accounting and control of all slave tasks by SGE. The Howto was written originally for MPICH2. As Intel MPI is based on MPICH2, the "mpd startup method" also applies to Intel MPI.
Reuit -- Looks like a bad link ... maybe the new gridengine.org has it?
Not exactly what you're looking for, but you can hack the Intel "stock" mpirun script to do a better job of tight integration. A version that I hacked together is available at:
As was noted elsewhere, if the process detaches from sge_shepherd then you've lost tight integration. The script above should keep open connections to each child process -- so they all stay attached to sge_shepherd.