Bug of Intel MPI?

Bug of Intel MPI?

Dear all,

I am trying to run my program in a cluster with 10 nodes and every node has Windows 7 64bit + Intel MPI 4.1.

I run my program by

mpiexec -n 12 test

or

mpiexec -wdir \\n01\mytest\ -hosts 10 n01 12 n02 12 n03 12 n04 12 n05 12 n06 12 n07 12 n08 12 n09 12 n10 12 \\n01\mytest\test

When ONLY ONE Build Environment window opened, both command line works. However, when two  Build Environment windows opened, in one window the first command line still work but the second one failed with the following error message:

Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(659)......................:
MPID_Init(195).............................: channel initialization failed
MPIDI_CH3_Init(106)........................:
MPID_nem_tcp_post_init(344)................:
MPID_nem_newtcp_module_connpoll(3099)......:
recv_id_or_tmpvc_info_success_handler(1328): read from socket failed - No error
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(659)................:
MPID_Init(195).......................: channel initialization failed
MPIDI_CH3_Init(106)..................:
MPID_nem_tcp_post_init(344)..........:
MPID_nem_newtcp_module_connpoll(3099):
gen_read_fail_handler(1194)..........: read from socket failed - The specified n
etwork name is no longer available.

Is there any bug in Intel MPI, or should I write any special code to let the program work on this condition?

Thanks,

Zhanghong Tang

12 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项
James Tullos (Intel)的头像

Hi Zhanghong,

Please compare the environment variables in the two windows using "set".  Are you attempting to run in both simultaneously?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Dear Dr. Tullos,

Thank you very much for your kindly reply. I compared the environment variables in the two windows and found that the are exactly the same.

Yes. I need to test my program in both simultaneously (with different parameters). But now the problem happened when I opened two windows and run the program only in one window.

Thanks,

Zhanghong Tang

James Tullos (Intel)的头像

Hi Zhanghong,

I've looked over the verbose output you sent.  I think there is a problem with the network path.  Please try either mapping the network path to a local drive or make certain you have properly setup Active Directory*.  See section 3.2.1 of the Intel® MPI Library for Windows* Reference Manual for details on how to setup Active Directory*.

Please let me know if this helps.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Dear Dr. Tullos,

Thank you very much for your kindly reply. As the Intel® MPI Library for Windows* Reference Manual suggested, I downloaded the Remote Server Administration Tools from here:

http://www.microsoft.com/en-us/download/details.aspx?id=7887

and then installed, and then enabled 'Enable Active Directory Administrative Center ' according to here:

http://technet.microsoft.com/en-us/library/dd560652%28v=ws.10%29.aspx

but after that, I tried to 'Open the Computers list in the Active Directory Users and Computers administrative utility' as Intel® MPI Library for Windows* Reference Manual suggested, the errors displayed as attached picture.

I have windows 7 64bit installed in all nodes of the cluster and I found that I can't create a domain from windows 7 64bit system:

http://www.sevenforums.com/network-sharing/125929-how-create-domain-wind...

What should I do next?

Thanks,

Zhanghong Tang

附件: 

附件尺寸
下载 errinfo.jpg38.37 KB

Dear Dr. Tullos,

Thank you very much for your kindly reply.

I tried as your suggested, but failed. I can't 'Open the Computers list in the Active Directory Users and Computers administrative utility' instructed by manual since there is no domain in the cluster with Windows 7 64bit installed on every node. I searched from internet and some people said that we can't setup a domain on Windows 7 64bit system.

Do you have any suggestion? Can I setup the Active Directory in the cluster without a domain installed?

Thanks,

Zhanghong Tang

Dear Dr. Tullos,

I also tried your another solution to map the network path to a local drive, it also failed (even for the local node). The error message is as follows:

forrtl: severe (29): file not found, unit 1, file C:\Windows\system32\parainfo\polesize.dat

My program will read some data from .\parainfo folder. The program works when input the network path as the working path.

Could you please help me to take a look at it?

Thanks,

Zhanghong Tang

James Tullos (Intel)的头像

Hi Zhanghong,

When you are running on the mapped drive, are you specifying a working directory?  Is it the same on all of the systems?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Dear Dr. Tullos,

I mapped the network path as Z driver on all nodes and the command line to run the program is as follows:

mpiexec -wdir Z:\debug\directional -hosts 10 n01 4 n02 4 n03 4 n04 4 n05 4 n06 4 n07 4 n08 4 n09 4 n10 4 Z:\debug\directional\fem

or(single node, I have set current path to Z:\debug\directional):

mpiexec -n 4 fem

and then, the error message shows as I said before.

Thanks,

Zhanghong Tang

James Tullos (Intel)的头像

Hi Zhanghong,

Please try using the mpiexec option -mapall.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Dear James,

Thank you very much for your kindly reply. I tried to add the option -mapall and the results are similar to before, sometimes it works and when running next time, sometimes the following errors displayed:

Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(659)......................:
MPID_Init(195).............................: channel initialization failed
MPIDI_CH3_Init(106)........................:
MPID_nem_tcp_post_init(344)................:
MPID_nem_newtcp_module_connpoll(3099)......:
recv_id_or_tmpvc_info_success_handler(1328): read from socket failed - No error
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(659)................:
MPID_Init(195).......................: channel initialization failed
MPIDI_CH3_Init(106)..................:
MPID_nem_tcp_post_init(344)..........:
MPID_nem_newtcp_module_connpoll(3099):
gen_read_fail_handler(1194)..........: read from socket failed - The specified n
etwork name is no longer available.

and sometimes the following errors displayed:

*********** Warning ************
Unable to map \\n01\Debug. (error 71)

*********** Warning ************
launch failed: CreateProcess(\\n01\Debug\directional\fem) on 'N09' failed, error
 2 - The system cannot find the file specified.

launch failed: CreateProcess(\\n01\Debug\directional\fem) on 'N07' failed, error
 2 - The system cannot find the file specified.

launch failed: CreateProcess(\\n01\Debug\directional\fem) on 'N02' failed, error
 2 - The system cannot find the file specified.

*********** Warning ************
Unable to map \\n01\Debug. (error 71)

*********** Warning ************
forrtl: severe (29): file not found, unit 1, file C:\Windows\system32\parainfo\p
olesize.dat
Image              PC                Routine            Line        Source

fem.exe            000000014053DAE7  Unknown               Unknown  Unknown
fem.exe            00000001405390B3  Unknown               Unknown  Unknown
fem.exe            00000001404CB016  Unknown               Unknown  Unknown
fem.exe            00000001404A5635  Unknown               Unknown  Unknown
fem.exe            00000001404A4270  Unknown               Unknown  Unknown
fem.exe            0000000140482B39  Unknown               Unknown  Unknown
fem.exe            000000013FCAAEE7  READPOLE                   41  readdata.f90

fem.exe            000000013FCC6C02  MAIN__                     14  main.f90
fem.exe            000000014172267C  Unknown               Unknown  Unknown
fem.exe            0000000140510B37  Unknown               Unknown  Unknown
kernel32.dll       000000007738652D  Unknown               Unknown  Unknown
ntdll.dll          00000000774BC521  Unknown               Unknown  Unknown
launch failed: CreateProcess(\\n01\Debug\directional\fem) on 'N04' failed, error
 2 - The system cannot find the file specified.

launch failed: CreateProcess(\\n01\Debug\directional\fem) on 'N07' failed, error
 2 - The system cannot find the file specified.

launch failed: CreateProcess(\\n01\Debug\directional\fem) on 'N04' failed, error
 2 - The system cannot find the file specified.

*********** Warning ************
Unable to map \\n01\Debug. (error 71)

*********** Warning ************

After running

smpd -restart

and closed the MPI environment window and reopen it, it works again.

Could you please help me to take a look at it?

Thanks,

Zhanghong Tang

James Tullos (Intel)的头像

Hi Zhanghong,

Please check through your Windows* system logs and look for anything indicating a networking failure on the nodes.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

登陆并发表评论。