Running parallel job on compute nodes with IB HCA from a master node "NOT" having IB HCA

Running parallel job on compute nodes with IB HCA from a master node "NOT" having IB HCA


Is it possible to run task on compute nodes having InfiniBand HCA from a master node that lacks IB HCA using Torque/Grid Engine?

Please guide if it is possible.

Intel MPI is installed on all cluster machines.

The network configuration is as follows:

Master Node: (2xXeon E5-2450/96GB/CentOS 6.2/NFS Services over Ethernet) - 1 No.

Compute Nodes: (2xXeon E5-2450/96GB/TrueScale QDR Dual-port QLE7342/CentOS 6.2/NFS Client over GbE) - 4 Nos.

IP Addresses (GbE) : Master Node: - Hostname: mnode; Compute Nodes: .. 225 - Hostnames: c00 .. c03

IP Address (ib0): Master Node: N/A; Compute Nodes: .. 225 - /etc/hosts -> c00-ib; c01-ib; c02-ib; c03-ib

Additionally, if mpiexec.hydra can be used, then what is the command-line from master node to directly run without Torque or Grid Engine.


Girish Nair <girishnairisonline at gmail dot com>

Director Supports
NJ Dataprint Private Limited
10 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Girish,

Running outside of the job scheduler will depend on your system's policy.  If it is setup to allow it, then there should be no problem using mpiexec.hydra (or mpirun, which will default to mpiexec.hydra) to run.  Simply specify your hosts and ranks as you normally would.  For some additional information, see the article Controlling Process Placement with the Intel® MPI Library.

If you are going to use InfiniBand* for your job nodes, but are launching from a system without IB, you will need to specify the network interface using either the -iface command line option or the I_MPI_HYDRA_IFACE environment variable.  You'll likely want to use eth0, but this can vary depending on your system configuration.

Also, do not use the IB host names to start your job.  Hydra will attempt to connect via ssh first, which needs to happen through the standard IP channel.  It will handle switching to the IB fabric for your job.  If you want to verify that it correctly launched using IB, run with I_MPI_DEBUG=2 to get fabric selection information.

James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Hi James,
Thanks for your response.

Please correct me if I'm wrongly understood your statement.:

The machinefile would have the entries like:

n01:16 # hostname resolving to eth0 address
n02:16 # hostname resolving to eth0 address
n03:16 # hostname resolving to eth0 address
n04:16 # hostname resolving to eth0 address

while running from the master node not having an IB hardware:

mpiexec.hydra -np 16 -machinefile ./machine.cluster -iface ib0 ./main.out

I apologize if this is too much to ask. Great if an example is provided.

Additionally, does the same command line accept the following alongwith the above command:
mpiexec.hydra ... -genv I_MPI_FABRICS shm:tmi ...

as my understanding is the shm:dapl is default, and I've found that shm:tmi gives me the best performance over IB. The master node obviously will not have /etc/dat.conf file, since it lacks IB HCA.

My advance thanks for your expert advise.

Girish Nair

Director Supports
NJ Dataprint Private Limited

If tmi is always better than DAPL for you, you can set I_MPI_FABRICS=shm:tmi in your environment rather than having to pass it every time.  As for launching, unless you have an interface named ib0 on your master node, you'll want to use:

mpirun -n 16 -machinefile ./machine.cluster -iface eth0 ./main.out

The machinefile you have is correct.  Now, keep in mind, if you run this job with the machinefile you have, all 16 ranks will run on n01.  For more flexibility, I would use a hostfile instead.

$cat hostfile

And run with

mpirun -n <nranks> -ppn <ranks per node> -f hostfile ./main.out

This will run a total of <nranks> ranks, with <ranks per node> ranks placed on each of the nodes.  So, if I wanted to run 16 ranks, with 4 per node, that would be

mpirun -n 16 -ppn 4 -f hostfile ./main.out

This gives more flexibility in process placement.The article I linked shows several other options, and I'll add more information about the hostfile capability.

Hi James,
Ah, that was a quick response from you. Thanks.

Please read my mpiexec.hydra command as -np 64 and not -np 16.

2 quick queries:
a) If -iface eth0 is used, would the job be run on IB on Compute Nodes?
b) Can the environment variable I_MPI_FABRICS be set on Master Node that lacks IB HCA hardware? If no, then should it be set on all Compute Nodes with IB HCA?

Girish Nair

Director Supports
NJ Dataprint Private Limited

Using -iface eth0 sets the interface to be used for launching the ranks, not the communication fabric to be used by MPI.

I_MPI_FABRICS needs to be set wherever you are launching the job.  Hydra will read this before launching and use it when launching the ranks.

Thanks a ton James.

You've effectively cleared all my doubts on this. I'll wait for your notes on hostfile capability whenever you publish it.

Thanks once again.
~Girish Nair

Director Supports
NJ Dataprint Private Limited

The article was updated yesterday, if you can't see the updates, please let me know.

Thank you very much James.

Director Supports
NJ Dataprint Private Limited

Hi James,

Following up with this thread, referencing to your article, and IntelMPI5.0 Linux Reference Manual,  there are three configuration ways to launch MPMD cluster: -hostfile,  -machinefile, -configfile.  

What are the difference between -hostfile and -machinefile ?

Can we use per rank HCA binding in -machinefile or -hostfile (as it can be done with single HCA for MVAPICH mpihydra hostfile) ? We want to control HCA per rank, possibly multiple HCAs per rank. 

I believe machinefile/hostfile and configfile can be used in a single launch cmd. I would very much appreciate some references with detailed examples and explanations on how are all these three used interchangeably?






Leave a Comment

Please sign in to add a comment. Not a member? Join today