MPI 4 multi nodes running problem on Windows

MPI 4 multi nodes running problem on Windows

I just upgrade Intel MPI for Windows from 3.0.012 to
4.0.0.011. After I upgrade, I can run parallel case in a single node without
problem. If I run parallel case cross multi nodes, my program always stopped. I debug the running status, processes were
started up at shm data transfer mode. If I set I_MPI_FABRICS to shm:tcp,
program also stopped. If I set I_MPI_FABRICS to tcp, program can run. If I set
I_MPI_FABRICS to dapl and set I_MPI_FALLBACK to enable, program can run. But that
is not want I want. We are developing a commercial software, we want MPI can
select fabrics automatically, our users may not know the detail to set those environment
variables. The problem happed on both Windows XP 64bit and Windows 7 64bit
version. Does anyone meet the same problem? Thanks,

publicaciones de 16 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

Hello,

Could you check version of smpd running on your nodes? You get the information running the following command in a command window:
smpd -get binary
It shoould be from version 4.0.

Could you also check old environment - might some env variables left from the previous version. At least there should be no I_MPI_DEVICE.

Running Intel MPI by default fallback (I_MPI_FALLBACK) is enabled so the library will check all existing fast fabrics and if they are not available fallbacks to tcp. You can see what fast fabric has been selected by setting I_MPI_DEBUG=2 (or higher).

When you set I_MPI_FABRICS=shm:tcp and everything works just fine that means that something prevents to run in the same way in default mode.

BTW: You can upgrade 4.0.0.011 to 4.0.1 (and very soon to 4.0.2)

Regards!
Dmitry

I run smpd -get binary, it showsC:\Program Files (x86)\Intel\MPI-RT\4.0.0.012\em64t\bin\smpd.exeIf I run smpd V, it showsIntel MPI Library for Windows* OS, Version 4.0 Build 2/18/2010 1:00:47 PMCopyright (C) 2007-2010, Intel Corporation. All rights reserved.If I run smpd version, it shows3.1I installed our package on two fresh installed Windows 7 64bit computers. I think there some bugs in MPI version 4.0.0.012.I want to know if I need to install any other librarys.

>If I set I_MPI_FABRICS to shm:tcp,
program also stopped.
Could you run your program like:
mpiexec -genv I_MPI_DEBUG 5 -genv I_MPI_FABRICS shm:tcp -hosts [your hosts and number of processes] ./app_name

And place the output here.

Regards!
Dmitry

Thanks, DmitryI ran 5 cases.-------------------------------------------------------------------------------------------------------First case, I ran without -genv I_MPI_FABRICS shm:tcp, and I set host name to the same, gems3. The output is below.mpiexec -wdir "mydir" -genv I_MPI_DEBUG 5 -hosts 2 gems3 3 gems3 2 -pwdfile "mypassword" "myfile"[3] MPI startup(): shm data transfer mode[2] MPI startup(): shm data transfer mode[1] MPI startup(): shm data transfer mode[0] MPI startup(): shm data transfer mode[4] MPI startup(): shm data transfer mode[0] MPI startup(): I_MPI_DEBUG=5[0] MPI startup(): I_MPI_FABRICS_LIST=dapl,tcp[0] MPI startup(): I_MPI_FALLBACK=enable[0] MPI startup(): NUMBER_OF_PROCESSORS=1[0] MPI startup(): PROCESSOR_IDENTIFIER=AMD64 Family 16 Model 5 Stepping 3, AuthenticAMD[0] Rank Pid Node name Pin cpu[0] 0 672 gems3 n/a[0] 1 3056 gems3 n/a[0] 2 2960 gems3 n/a[0] 3 2116 gems3 n/a[0] 4 2700 gems3 n/aRunning time : 0:00:01.-------------------------------------------------------------------------------------------------------Second case, I ran with -genv I_MPI_FABRICS shm:tcp, and I also set host name to the same, gems3. The output is below.mpiexec -wdir "mydir" -genv I_MPI_DEBUG 5 -genv I_MPI_FABRICS shm:tcp -hosts 2 gems3 3 gems3 2 -pwdfile "mypassword" "myfile"[4] MPI startup(): shm and tcp data transfer modes[3] MPI startup(): shm and tcp data transfer modes[2] MPI startup(): shm and tcp data transfer modes[1] MPI startup(): shm and tcp data transfer modes[0] MPI startup(): shm and tcp data transfer modes[0] MPI startup(): I_MPI_DEBUG=5[0] MPI startup(): I_MPI_FABRICS=shm:tcp[0] MPI startup(): I_MPI_FABRICS_LIST=dapl,tcp[0] MPI startup(): I_MPI_FALLBACK=enable[0] MPI startup(): NUMBER_OF_PROCESSORS=1[0] MPI startup(): PROCESSOR_IDENTIFIER=AMD64 Family 16 Model 5 Stepping 3, AuthenticAMD[0] Rank Pid Node name Pin cpu[0] 0 2348 gems3 n/a[0] 1 2712 gems3 n/a[0] 2 2568 gems3 n/a[0] 3 2192 gems3 n/a[0] 4 2408 gems3 n/aRunning time : 0:00:01.The above two cases all work fine, because they start on same computer.-------------------------------------------------------------------------------------------------------Third case, I ran without -genv I_MPI_FABRICS shm:tcp, and I set host name to the two different name, gem3 and gems4. The output is below.mpiexec -wdir "mydir" -genv I_MPI_DEBUG 5 -hosts 2 gems3 3 gems4 2 -pwdfile "mypassword" "myfile"[1] MPI startup(): shm data transfer mode[0] MPI startup(): shm data transfer mode[2] MPI startup(): shm data transfer mode[3] MPI startup(): shm data transfer mode[4] MPI startup(): shm data transfer modejob aborted:rank: node: exit code[: error message]0: gems3: -1073741819: process 0 exited without calling finalize1: gems3: -1073741819: process 1 exited without calling finalize2: gems3: 1233: gems4: 1234: gems4: 123-------------------------------------------------------------------------------------------------------Forth case, I ran with -genv I_MPI_FABRICS shm:tcp, and I set host name to the two different name, gem3 and gems4. The output is below.mpiexec -wdir "mydir" -genv I_MPI_DEBUG 5 -genv I_MPI_FABRICS shm:tcp -hosts 2 gems3 3 gems4 2 -pwdfile "mypassword" "myfile"[1] MPI startup(): shm and tcp data transfer modes[0] MPI startup(): shm and tcp data transfer modes[2] MPI startup(): shm and tcp data transfer modes[3] MPI startup(): shm and tcp data transfer modes[4] MPI startup(): shm and tcp data transfer modesjob aborted:rank: node: exit code[: error message]0: gems3: -1073741819: process 0 exited without calling finalize1: gems3: -1073741819: process 1 exited without calling finalize2: gems3: 1233: gems4: 1234: gems4: 123The above two cases don't work, because they start on different computer.-------------------------------------------------------------------------------------------------------Fifth case, I ran with -genv I_MPI_FABRICS tcp, and I set host name to the two different name, gem3 and gems4. The output is below.mpiexec -wdir "mydir" -genv I_MPI_DEBUG 5 -genv I_MPI_FABRICS tcp -hosts 2 gems3 3 gems4 2 -pwdfile "mypassword" "myfile"[0] MPI startup(): tcp data transfer mode[2] MPI startup(): tcp data transfer mode[1] MPI startup(): tcp data transfer mode[3] MPI startup(): tcp data transfer mode[4] MPI startup(): tcp data transfer mode[0] MPI startup(): I_MPI_DEBUG=5[0] MPI startup(): I_MPI_FABRICS=tcp[0] MPI startup(): I_MPI_FABRICS_LIST=dapl,tcp[0] MPI startup(): I_MPI_FALLBACK=enable[0] MPI startup(): NUMBER_OF_PROCESSORS=1[0] MPI startup(): PROCESSOR_IDENTIFIER=AMD64 Family 16 Model 5 Stepping 3, AuthenticAMD[0] Rank Pid Node name Pin cpu[0] 0 2080 gems3 n/a[0] 1 2908 gems3 n/a[0] 2 920 gems3 n/a[0] 3 1464 gems4 n/a[0] 4 2096 gems4 n/aRunning time : 0:00:01.This case works fine.

Well, it's not clear why the library doesn't work in case of shm:tcp. Do gems3 and gems4 have the same CPUs?
Could you try the following command:
mpiexec -wdir "mydir" -genv I_MPI_DEBUG 5 -genv I_MPI_PLATFORM 0 -genv I_MPI_FABRICS shm:tcp -hosts 2 gems3 3 gems4 2 -pwdfile "mypassword" "myfile"

If I_MPI_PLATFORM doesn't help please download Intel MPI Library version 4.0 Update 1 and give it try. Remember that it should be updated on all nodes.

Regards!
Dmitry

Dear Dmitry,Thanks. I tried-genv I_MPI_PLATFORM 0, it still doesn't help. The two computers have same CPUs.I also upgrade MPI library from 4.0.0.012 to 4.0.1.007, and re-compile program. but the problem is the same. If I use MPI library 3.2.012, my program works fine.Our customers have different Network, like Infiniband, myrinet and ethernet. But many of them don't understand network settings. We want to run our program with default setting and the program can work properly. We ran our program with MPI 3.2.012, it can automatically select network. But with MPI version 4, it even cannot start between two computers.

Could you please check environment on both computers. Highly possible that something left from your previous installation. Especially look for I_MPI_DEVICE. Your run by default on 2 computers starts using shm only mode somehow - it's not what we are expecting.

"Third case, I ran without -genv I_MPI_FABRICS
shm:tcp, and I set host name to the two different name, gem3 and gems4
.
The output is below.

mpiexec -wdir "mydir" -genv I_MPI_DEBUG 5
-hosts 2 gems3 3 gems4 2 -pwdfile "mypassword" "myfile"

[1] MPI startup(): shm data transfer mode

[0] MPI startup(): shm data transfer mode"

You don't need any other library - everything should work fine.
Do you use a script to run your application? Might be you make some settings there?

Regards!
Dmitry

Hi Dmitry,The following is the environment setting. Both computers are almost the same. There is noI_MPI_DEVICE setting
BTW, the two computers are fresh installation, no previous MPI was installed.
ALLUSERSPROFILE=C:\ProgramDataAPPDATA=C:\Users\gems\AppData\RoamingCommonProgramFiles=C:\Program Files\Common FilesCommonProgramFiles(x86)=C:\Program Files (x86)\Common FilesCommonProgramW6432=C:\Program Files\Common FilesCOMPUTERNAME=GEMS3ComSpec=C:\Windows\system32\cmd.exeFP_NO_HOST_CHECK=NOHOMEDRIVE=C:HOMEPATH=\Users\gemsINTEL_LICENSE_FILE=C:\Program Files (x86)\Common Files\Intel\LicensesI_MPI_FABRICS_LIST=dapl,tcpI_MPI_FALLBACK=enableI_MPI_ROOT=C:\Program Files (x86)\Intel\MPI\4.0.1.007\LOCALAPPDATA=C:\Users\gems\AppData\LocalLOGONSERVER=\\GEMS3NUMBER_OF_PROCESSORS=1OS=Windows_NTPath="C:\Program Files (x86)\Intel\MPI\4.0.1.007\em64t\bin";C:\Program Files (x86)\Intel\MPI\4.0.1.007\em64t\bin;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\PATHEXT=.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSCPROCESSOR_ARCHITECTURE=AMD64PROCESSOR_IDENTIFIER=AMD64 Family 16 Model 5 Stepping 3, AuthenticAMDPROCESSOR_LEVEL=16PROCESSOR_REVISION=0503ProgramData=C:\ProgramDataProgramFiles=C:\Program FilesProgramFiles(x86)=C:\Program Files (x86)ProgramW6432=C:\Program FilesPROMPT=$P$GPSModulePath=C:\Windows\system32\WindowsPowerShell\v1.0\Modules\PUBLIC=C:\Users\PublicSESSIONNAME=ConsoleSystemDrive=C:SystemRoot=C:\WindowsTEMP=C:\Users\gems\AppData\Local\TempTMP=C:\Users\gems\AppData\Local\TempUSERDOMAIN=gems3USERNAME=gemsUSERPROFILE=C:\Users\gemswindir=C:\Windows
I ran the cases using batch file, we didn't make any setting in batch file . But even if I ran the case on command line, the problems are the same.

Hi Yongjun,

It's not clear why these variables are in the list:
I_MPI_FABRICS_LIST=dapl,tcpI_MPI_FALLBACK=enableThey can be removed from the environment.

Could you please compile your program (you can compile HelloWorld example from the test directory instead) with debug information and run it with I_MPI_FABRICS=shm:tcp on 2 nodes with I_MPI_DEBUG=50.
Please send me only lines with "business card" in them.

It looks like gems3 and gems4 are considered to have the same ip address. Could you please please check that they have different ip addresses?

Regards!
Dmitry

Hi Dmitry,I compiled the test case comes with Intel MPI package. the host names in ma.win are the IP address of the computers. They are192.168.206.132 and192.168.206.134. I am quite sure the two computers have different IP address.I ran two cases.First, I ran with shm:tcp, I cannot find business card output. mpiexec -genv I_MPI_DEBUG 50 -genv I_MPI_FABRICS shm:tcp -n 2 -machinefile ma.win -pwdfile pa.win test[0] MPI startup(): Intel MPI Library, Version 4.0 Update 1 Build 20100910[0] MPI startup(): Copyright (C) 2003-2010 Intel Corporation. All rights reserved.[0] MPI startup(): I_MPI_LIBRARY_VERSION: 4.0 Update 1[0] MPI startup(): I_MPI_VERSION_DATE_OF_BUILD: 9/10/2010 2:02:16 PM[0] MPI startup(): I_MPI_VERSION_MY_CMD_LINE: winconfigure.wsf[0] MPI startup(): I_MPI_VERSION_MACHINENAME: SVLMPIBLD07[0] MPI startup(): I_MPI_DEVICE_VERSION: 4.0 Update 1 9/10/2010[1] MPID_nem_impi_init_shm_configuration(): shm topology: windows pinning is unavailable[1] MPID_nem_impi_init_shm_configuration(): shm memcpy: cache bypass thresholds: 16384,2097152,-1,2097152,-1,2097152[1] MPID_nem_impi_init_shm_configuration(): shm topology: pinning is unavailable[0] MPID_nem_impi_init_shm_configuration(): shm topology: windows pinning is unavailable[0] MPID_nem_impi_init_shm_configuration(): shm memcpy: cache bypass thresholds: 16384,2097152,-1,2097152,-1,2097152[0] MPID_nem_impi_init_shm_configuration(): shm topology: pinning is unavailableFatal error in MPI_Init: Other MPI error, error stack:MPIR_Init_thread(527).................: Initialization failedMPID_Init(171)........................: channel initialization failedMPIDI_CH3_Init(70)....................:MPID_nem_init_ckpt(665)...............:MPIDI_CH3I_Seg_commit(372)............:MPIU_SHMW_Hnd_deserialize(362)........:MPIU_SHMW_Seg_open(942)...............:MPIU_SHMW_Seg_create_attach_templ(826): unable to allocate shared memory - OpenFileMapping The system cannot find the file specified.job aborted:rank: node: exit code[: error message]0: 192.168.206.132: 1231: 192.168.206.134: 1: process 1 exited without calling finalize

Second, I ran with tcp option, program ran ok. We can find the business card outputmpiexec -genv I_MPI_DEBUG 50 -genv I_MPI_FABRICS tcp -n 2 -machinefile ma.win -pwdfile pa.win test[0] MPID_nem_init_ckpt(): business card: description="gems3 gems3 " port=23235 ifname="" fabrics_list=tcp[0] getConnInfoKVS(): got business card: description="gems4 gems4 " port=33985 ifname="" fabrics_list=tcp[1] MPID_nem_init_ckpt(): business card: description="gems4 gems4 " port=33985 ifname="" fabrics_list=tcp

Hi Yongjun,

Well, it seems to me that you are using computer names without DNS suffix. Please check this suffix in "My Computer"-> System Properties->Computer Name (Tab)->"Change..." button.
Full computer name should have DNS suffix. If it's not so (computer name looks like 'gems3'), please press "More..." button and type a suffix in the "Primary DNS suffix" field.
If you don't have domain name you can try to use 'local'.
You need to do this on each computer you are going to use.

Please do it and try to run a program with default parameters.

Regards!
Dmitry

Hi Dmitry,Thanks for your help. I added DNS suffix. Now our program can run with default paramaters. Every thing works fine.I have a question. In Intel MPI version 3, DNS suffix isn't needed. We can let it empty. Is it a new requirement in MPI version 4?By default, DNS suffix is emptyif the computer doesn't join any domain. Does MPI version 4 require it must be set if computer doesn't join any domain?If I don't have domain name,how to use "local" ?Regards,Yongjun

Yongjun,
this is not a requirement but sometimes it works in unexpected way if there is no DNS suffix. We are investigating the issue. For now just add a suffix - nothing else will be needed.

Regards!
Dmitry

Hi Dmitry:Will this issue be fixed in the update version of Intel MPI (ex: 4.0.2.006) ?From our experience:On a cluster of Windows server 2008 and Windows XP(1) the DNS suffix must be added to prevent the failure of OpenFileMapping when using Intel MPI.(2) Once the DNS suffix is added, the programs based on MPICH2 (that some of our customers already use) will suffer the failure of gethostbyname(). Our customers are unhappy about this ... regards,Seifer

Hi Seifer,

This issue will be fixed in the upcoming 4.0 Update 3 release which should be available for customers sometime in November. I hope that this fix will resolve inconsistency between different implementations of MPI.

Regards!
Dmitry

Deje un comentario

Por favor inicie sesión para agregar un comentario. ¿No es socio? Únase ya