Unwanted output

Unwanted output

I have a cluster with some  E5410 and some  E5-2660 all infiniband connected with Intel impi. Everything is working, but the E5410 nodes are giving a lot of unwanted output of form (condensed as there is one entry for every core):

node04.cluster:723a:f24164b0: 1094 us(1094 us): open_hca: device mlx4_0 not found

node04.cluster:723a:f24164b0: 28485 us(28485 us): open_hca: getaddr_netdev ERROR: No such file or directory. Is ib0 configured?

node04.cluster:723a:f24164b0: 52940 us(24455 us): open_hca: getaddr_netdev ERROR: Cannot assign requested address. Is ib1 configured?

The mpi tasks are running fine, so this output is more annoying than a problem and there should be a way to avoid it. Suggestions?

14 posts / novo 0
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.

Hi,

Please check the values of the environment variable I_MPI_DEBUG.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

env | grep -e MPI (bash)

Shows only I_MPI_HYDRA_DEBUG=0

Hi,

That seems odd.  I_MPI_ROOT should usually be set, and setting I_MPI_HYDRA_DEBUG=0 is simply setting the default value.  Are you setting I_MPI_DEBUG on the command line?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Hi,

Actually, try unsetting DAPL_DBG.  Those messages are not coming from the Intel® MPI Library, but from the DAPL provider.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

I_MPI_ROOT is set, but since you did not ask I did not mention that.

I am not setting I_MPI_DEBUG in the command line. Worth remembering, the output is only occuring on older E5410 machines, not on newer E5-2660

 If relevant, uname -a returns (for the older then the newer):

Linux node01.cluster 2.6.18-274.17.1.el5 #1 SMP Tue Jan 10 17:25:58 EST 2012 x86_64 x86_64 x86_64 GNU/Linux

Linux node20.cluster 2.6.32-279.9.1.el6.x86_64 #1 SMP Tue Sep 25 21:43:11 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

I set and exported the variables you suggested (the email you sent on the other thread):

env | grep -e I_MPI
I_MPI_DAPL_UD=enable
I_MPI_HYDRA_DEBUG=0
I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u
I_MPI_DAPL_UD_RDMA_MIXED=enable[/plain]
I_MPI_ROOT=/opt/intel/impi/4.1.0.024

I still get the unwanted output. I did find one thing, if I just use one node then I do not get the output only if I use more than one of the older nodes. The unwanted output is easier to test with a short job (albeit still the complicated code).

Hi,

Those settings were for the other thread, regarding the slowdown.  Here, try unsetting DAPL_DBG.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

DAPL_DBG is not set. Should it be set/unset in the command line?

I will try the other settings *for the other thread(, but the nodes are currently in use so it will be some time (a day or more) before I can test that. Worse, it takes 24hrs to do the test.

Hi,

It should not be set.  Is it possible that the DAPL providers on the older nodes are compiled with debug information?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

The vendor of the cluster compiled OFED (which I think is what would be relevant, not sure). Prior to using impi I was using mvapich (and also openmpi) and with neither of these did I see anything similar, which suggests that debug information was not part of the compilation.

N.B., the older and newer nodes are on the same network with the same head node although they are physically connected to different switches with the newer switched daisy-chained to the older switch.

Hi,

Ok.  I'll check with our DAPL developer to see if he has any ideas regarding why you would be getting these messages.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

The answer can be found at http://permalink.gmane.org/gmane.linux.drivers.rdma/4787, it appears that debug output was compiled in. I guess Intel mpi searches various options and if they fail moves on. Setting the environmental variable DAPL_DBG_TYPE 0 removes the output.

Understood.  I'm glad everything is working correctly now.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Deixar um comentário

Faça login para adicionar um comentário. Não é membro? Inscreva-se hoje mesmo!