IMPI and DAPL fabrics on Infiniband cluster

IMPI and DAPL fabrics on Infiniband cluster

Hello, I have been trying to submit a job in our cluster for a intel17 compiled and impi enabled code. I keep getting trouble at startup when running through PBS.

This is the submission script:

#PBS -N propane_XO2_ramp_dx_p3125cm(IMPI)
#PBS -W umask=0022
#PBS -e /home4/mnv/FIREMODELS_ISSUES/fds/Validation/UMD_Line_Burner/Test_Valgrind/propane_XO2_ramp_dx_p3125cm.err
#PBS -o /home4/mnv/FIREMODELS_ISSUES/fds/Validation/UMD_Line_Burner/Test_Valgrind/propane_XO2_ramp_dx_p3125cm.log
#PBS -l nodes=16:ppn=12
#PBS -l walltime=999:0:0
module purge
module load null modules torque-maui intel/17
export I_MPI_FABRICS=shm:dapl
export I_MPI_DEBUG=100
cd /home4/mnv/FIREMODELS_ISSUES/fds/Validation/UMD_Line_Burner/Test_Valgrind
echo $PBS_O_HOME
echo `date`
echo "Input file: propane_XO2_ramp_dx_p3125cm.fds"
echo " Directory: `pwd`"
echo "      Host: `hostname`"
/opt/intel17/compilers_and_libraries/linux/mpi/bin64/mpiexec   -np 184 /home4/mnv/FIREMODELS_ISSUES/fds/Build/impi_intel_linux_64/fds_impi_intel_linux_64 propane_XO2_ramp_dx_p3125cm.fds

As you can see I'm invoking DAPL and OpenIB-cma as dapl provider. This is what I see on my login node /etc/dat.conf

OpenIB-cma u1.2 nonthreadsafe default dapl.1.2 "ib0 0" ""
OpenIB-cma-1 u1.2 nonthreadsafe default dapl.1.2 "ib1 0" ""
OpenIB-cma-2 u1.2 nonthreadsafe default dapl.1.2 "ib2 0" ""
OpenIB-cma-3 u1.2 nonthreadsafe default dapl.1.2 "ib3 0" ""
OpenIB-bond u1.2 nonthreadsafe default dapl.1.2 "bond0 0" ""
ofa-v2-ib0 u2.0 nonthreadsafe default dapl.2.0 "ib0 0" ""
ofa-v2-ib1 u2.0 nonthreadsafe default dapl.2.0 "ib1 0" ""
ofa-v2-ib2 u2.0 nonthreadsafe default dapl.2.0 "ib2 0" ""
ofa-v2-ib3 u2.0 nonthreadsafe default dapl.2.0 "ib3 0" ""
ofa-v2-bond u2.0 nonthreadsafe default dapl.2.0 "bond0 0" ""

Now logging in to the actual compute nodes I don't see an /etc/dat.conf on these. I don't know if this is normal or there is an issue there.

Anyways, when I submit the job I get the following attached stdout file, where it seems some of the nodes fail to load OpenIB-cma (with no fallback fabrics).

To be sure, some nodes on the cluster use Qlogic infiniband cards and others use Mellanox.

At this point I've tried several combinations, either specifying or not ib fabrics, without success. I'd really appreciate if you help me troubleshooting this.

Thank you,




Downloadapplication/octet-stream propane_XO2_ramp_dx_p3125cm.log139.23 KB
4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

An extra note:

You can see in the attached file that MPI processes that fail to load OpenIB-cma, are not tied to nodes that use a particular qib0:0 or mlx4_0:0 numa map. See for example process [57] or [108].

Thank you,


Hi Marcos. I have run Fire and Smoke simulations quite a few times. Most recently on an Omnipath fabric, but that is another story.  I would suggest getting whoever runs your cluster to set a node property in PBS such that you can choose all Mellanox or all Qlogic cards.  Also can you run with either leaving the I_MPI_FABRICS not set, or using ofa ?










Hi John, thank you for your reply! Yes, we do have dedicated queues for Qlogic (24 nodes I think) and Mellanox (12 nodes I think). We have been trying for some time to be able to run large jobs that span more than one dedicated queue, and have been somewhat successful with openmpi (there have been other issues like a constant memory leak we can't track down to our source code).

I have noted that intel mpi (when it runs) does run quite faster than the openmpi available, hence trying to span impi jobs across both sets of nodes. 

I did try running the job using ofa instead of dapl, and also dapl selecting ofa-v2-ib0 in the above configuration list. The problem has always been that the calculation randomly times out at different communication steps. Also, I run the case using tcp and although extremely slow It has run overnight without interruption.

Best Regards,


Leave a Comment

Please sign in to add a comment. Not a member? Join today