System hangs at intel_mpi_rt test of Intel Cluster Checker.

System hangs at intel_mpi_rt test of Intel Cluster Checker.

Hello,I got a problem when executed Intel Cluster Checker 1.8.The system had no response when executed intel_mpi_rt test module. The log file just reports and stops at showing MPI library version.The system configuration is :OS : RHEL 5.5Intel Cluster Runtime : 3.2-1Intel MPI Library : 4.0.2.003Intel MKL Library : 10.3.4.191C++ compiler : mpiccIs there any checking methods for resolving the problem?Thanks.Best regards,CT

7 posts / novo 0
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.

CT,Thanks for your report.What's your network setup?The test module runs a local hello world with MPI, so chances are that DAPL providers are wrongly configured if you are running OFED software.I would first suggest to run the dat_conf test module (it can be added as a dependency of intel_mpi_rt also).Another option is to manually run an MPI hello world example to reproduce what the tool is doing, you can use the I_MPI_DEBUG to get more details on what's missing.I'm adding some background details below, just in case.-- Andres[man clck-intel_mpi_rt]By default, the test module exercises 4 MPI processes over different network devices by using the shm andthe sock I_MPI_DEVICES (or the shm and tcp I_MPI_FABRICS). Furthermore, if the /etc/dat.conf file or theDAT_OVERRIDE variable are present it also locally exercises the rdma (or dapl) fabric device.The I_MPI_FABRICS style is used if Intel MPI Library 4.x or later is detected.[what the test is trying to do]command: sh -c "source /opt/intel/impi/3.1//bin64/mpivars.sh; mpiexec -n 4 -env I_MPI_FALLBACK_DEVICE 0 -env I_MPI_DEVICE rdssm /tmp/clck-intel_mpi_rt.ic7884/test.impi"output:Hello world: rank 0 of 4 running on compute-00-00.localHello world: rank 1 of 4 running on compute-00-00.localHello world: rank 2 of 4 running on compute-00-00.localHello world: rank 3 of 4 running on compute-00-00.local[running dat_conf]/opt/intel/clck/1.8/cluster-check aic.xml --include_onlydat_conf --verbose 5[hello world example in a similar MPI installation]/opt/intel/impi/4.0.2.003/test/test.c: ASCII C program text

Hi Andres,Thanks for your useful recommendations, I will try to fix the networking settings first.Best regards,CT

Hi Andres,After checking the facbric settings, I can pass the intel_mpi_rt test module. I just change the settings to "shm" from "rdssm". Sorry that I didn't mentioned that I just setup one machine (head node) for thesting, maybe this is the reason why should be set to "shm"(I guess...).Furthermore, I still have two questions:1. Where can I get detail information and definition about settings:sock, shm, ssm, rdma, rdssm?2. If my cluster nodes are connected by Ethernet(no InfinBand, iWARP devices), there is no DAPL and OFED software installed, how should I setup dat.conf file to pass the dat_conf test?Thanks.Best regards,CT

Dear CT,

It would be better to use I_MPI_FABRICS with Intel MPI Library 4.x instead of I_MPI_DEVICE.
The format is:
export I_MPI_FABRICS=shm:dapl
or
mpirun -genv I_MPI_FABRICS shm:dapl -np 222 ./a.out
So, the format is: Local_fabric:remote_fabric. Local_fabric can be any of: {shm, dapl, tcp, ofa, tmi}. Remote_fabric can be: {dapl, tcp, ofa, tmi}.

I hope that this format is more informative and doesn't require additional comments.

/etc/dat.conf lists all available providers on the node and this list depends on the Infiniband cards installed on this particular node.
Setting I_MPI_DAPL_PROVIDER you can select needed provider from the list of available providers.

If there is no IB cards or DAPL (OFED) was not installed there will be no /etc/dat.conf file on the node and you need to use I_MPI_FABRICS=shm:tcp. And of cause there will no dat_conf test.

Regards!
Dmitry

Dear Dmitry,Thanks for your answer.Although I have set"exportI_MPI_FABRICS=shm:tcp" for my system, the Cluster Checker still performs "dat_conf" test automatically. Obviously the test item is failed because of no IB cards/DAPL installed.So, shall I skip this test by settingdat_conf? or there have some settings should be changed in my XML file?Best regards,CT

Anexos: 

AnexoTamanho
Download aic-20111026.104510.out10.03 KB

CT,I think your exclude setting is the best approach to avoid running the test module, as you mention it is not applicable in your setup.As Dmitry mentions, the preferred syntax is the one with I_MPI_FABRICSYou can find more details hereatpage 74.

Deixar um comentário

Faça login para adicionar um comentário. Não é membro? Inscreva-se hoje mesmo!