How to run program on cluster?

How to run program on cluster?

Dear all,

I have a cluster with 10 nodes (the names are n01, n02, ..., n10) and on every node the Windows 7 and Intel MPI 4.0.3.009 is installed. On every node, I can run the program by:
mpiexec -n 24 \\\\n01\\debug\\test

and on the node n01, I acn run the program by:
mpiexec -hosts 2 n02 24 n03 24 \\\\n01\\debug\\test
or
mpiexec -hosts 9 n02 24 n03 24 n04 24 n05 24 n06 24 n07 24 n08 24 n09 24 n10 24 \\\\n01\\debug\\test

However, the program hangs there if the node n01 is added, for example:
mpiexec -hosts 2 n01 24 n02 24 \\\\n01\\debug\\test
or
mpiexec -hosts 2 n02 24 n01 24 \\\\n01\\debug\\test

Could anyone help me to check this problem? How to launch the program to let it work on every node?

Thanks,
Zhanghong Tang

publicaciones de 15 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

Dear all,

Further tests show that when the program on n0i (for example, it is on \\n03\debug\de) and the mpiexec run on n0j (for example, run the mpiexec on n04), where i, j=1,2,..,10, then when hosts include any of n0i or n0j, the program will always hangs, however, if the n0i and n0j are not included in the hosts list, the program can run successfully. Could anyone tell me how should I configure the cluster and Intel MPI on the cluster?

Thanks,
Zhanghong Tang

Hi Zhanghong,

Are you running the basic MPI test program? Please try running with the -delegate option to mpiexec. If this does not work, please run with

-env I_MPI_DEBUG 5

And give me the output from that.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

Hi James,

Thank you very much for your kindly reply. I have two tests run on the node n01:
1) run by
mpiexec -delegate -env I_MPI_DEBUG 5 -wdir \\n01\debug\ -hosts 3 n02 24 n01 24 n03 24 \\n01\debug\de >out1.txt

The program existed soon and the output results are shown in out1.txt.

2) run by
mpiexec -env I_MPI_DEBUG 5 -wdir \\n01\debug\ -hosts 3 n02 24 n01 24 n03 24 \\n01\debug\de >out2.txt

The program hangs there and the output results are shown in out2.txt.

Could you please help me to check it?

Thanks,
Zhanghong Tang

Adjuntos: 

AdjuntoTamaño
Descargar out1.txt2.08 KB
Descargar out2.txt8.49 KB

Hi Zhangong,

How is your cluster set up? Are the nodes joined to a domain? Have you tried running with the file stored in a local folder on each computer (with the same path)? Is the program you are running the MPI test included in the installation?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

Hi James,

Thank you very mucn for your kindly reply. The cluster is just the collection of 10 nodes with Windows 7 64 bit OS installed on every node. It seems that I can't create a domain in windows 7 system (is there any method to create a domain for the cluster?).

I have also tried to run the files which are stored in a local folder as you said, the results are exactly the same as I said in the first two posts.

I have even tested a simple program which also has the same problem.

Thanks,
Zhanghong Tang

PS: the simple code I tested

include'mpif.h'

integer myid,numprocs,ierr

call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,myid,ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,numprocs,ierr)
write(*,*)myid
call MPI_FINALIZE(ierr)
end

Hi Zhanghong,

Creating a domain requires a domain controller. Don't worry about a domain, as long as the username and password matches across all of the computers, that should be sufficient.

What are the firewall settings on the computers? If you are leaving the firewall on, each computershould haveexceptions for mpiexec, smpd, and the program you are running. The easiest (though obviously much less secure) method is to turn the firewalls off completely.

What happens if you use mpiexec to run hostname?

What are your environment variables (specifically those starting with I_MPI)?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

Hi James,

Thank you very much for your kindly reply.

Just as you said, after I turned off the windows firewall, the program works fine.

Thanks,
Zhanghong Tang

Hi Zhanghong,

I'm glad to hear everything is working now.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

Hi James,

Now I have another problem in running the program. I have tens of models to be simulated at the same time and so I parallelized the code. The code can run successfully when running without the mpiexec (single process). However, when running by the following command line:

mpiexec -wdir \\n01\debug -hosts 4 n01 24 n02 24 n03 24 n04 24 \\n01\debug\de

The following error displayed:
job aborted:
rank: node: exit code[: error message]
0: N01: 123
1: N01: 123
2: N01: 123
3: N01: 123
4: N01: 123
5: N01: 123
6: N01: 123
7: N01: 123
8: N01: 123
9: N01: 123
10: N01: 123
11: N01: 123
12: N01: 123
13: N01: 123
14: N01: 123
15: N01: 123
16: N01: 123
17: N01: 123
18: N01: 123
19: N01: 123
20: N01: 123
21: N01: 123
22: N01: 123
23: N01: 123
24: n02: 123
25: n02: 123
26: n02: 123
27: n02: 123
28: n02: 123
29: n02: 123
30: n02: 123
31: n02: 123
32: n02: 123
33: n02: 123
34: n02: 123
35: n02: 123
36: n02: 123
37: n02: 123
38: n02: 123
39: n02: 123
40: n02: 123
41: n02: 123
42: n02: 123
43: n02: 123
44: n02: 123
45: n02: 123
46: n02: 123
47: n02: 123
48: n03: 123
49: n03: 123
50: n03: 123
51: n03: 123
52: n03: 123
53: n03: 123
54: n03: 123
55: n03: 123
56: n03: 123
57: n03: 123
58: n03: 123
59: n03: 123
60: n03: 123
61: n03: 123
62: n03: 123
63: n03: 123
64: n03: 123
65: n03: 123
66: n03: 123
67: n03: 123
68: n03: 123
69: n03: 123
70: n03: 123
71: n03: 123
72: n04: 123
73: n04: 123
74: n04: 123
75: n04: 123
76: n04: 123
77: n04: 123
78: n04: 123
79: n04: 123
80: n04: 123
81: n04: 123
82: n04: 123
83: n04: 123
84: n04: 123
85: n04: 123
86: n04: 123
87: n04: 123
88: n04: 123
89: n04: 123
90: n04: 123
91: n04: 123
92: n04: 123
93: n04: 123
94: n04: 123
95: n04: -1073740940: process 95 exited without calling finalize

I also checked with the following command line:
mpiexec -env I_MPI_DEBUG 5 -wdir \\n01\debug -hosts 4 n01 24 n02 24 n03 24 n04 24 \\n01\debug\de

The error information is the same.

Could you please help me to check the problem?

Thanks,
Zhanghong Tang

Hi James,

The problem is solved. It is because of out of array bound. However, it is strange that no further error messsage displayed when the program crashed, even I build the project by /check:all option.

Thanks,
Zhanghong Tang

Hi Zhanghong,

I'm glad you found the error. For your reference, the final error you received (process 95 exited without calling finalize) indicates that process 95 died in some manner and did not call MPI_Finalize. Was the array bounds error related to MPI? What language are you using?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

Hi James,

Thank you very much for your kindly reply.

I use Fortran in my program. The array bounds error is not related to MPI, I set different blocks of an array to different process and then some process will access the array out of range. This also explain why the program can run with only one process.

Thanks,
Zhanghong Tang

Hi Zhanghong,

Are your arrays statically sized at compile time? If they are dynamic, it can be difficult, if not impossible, for the compiler to determine if there will be an out-of-bounds error.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

Hi James,

The arrays are just dynamic, so it took me some time to find out this problem.

Thanks,
Zhanghong Tang

Deje un comentario

Por favor inicie sesión para agregar un comentario. ¿No es socio? Únase ya