Installing ICTCE on WCCS and jobs distribution problem

Installing ICTCE on WCCS and jobs distribution problem

toastnmaker的头像

Hello,


I have triedinstalling Intel Cluster Toolkit Compiler Edition on Windows Compute Cluster Server 2003.Downloaded the installator, ran. The installation steps seem to be straightforward, regarding the nodes, it found the correct number of nodes, asked whether I want to install it to the nodes as well. However even when checked yes, it did not install anything onto the nodes.On the master head computer everything has been installed properly and e.g. MPI works on the single master node correctly.


Thus, I needed to install the ICTCE manually on each node.


I ran smpd on each node. clusrun smpd.exe -status returns 0 for each node as it should.


When running


mpiexec -hosts 2 master 1 node 1 prog.exe


(compiled with Intel Compiler within Visual Studio 2005) mpiexec does not distribute the binary onto the nodes. I need to copy the binary manually and specifyits location by -gwdir parameter.


Don't you know what I am doing wrong that mpiexec does not spread the binary on the all or all specified nodes? I also have to say the computers names explicitely, otherwise it runs only locally.


Thank you


Martin


PS: I am a windows newbie, please be simple ;-)

18 帖子 / 0 new
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项
grigoryzagorodnev的头像

Hi Martin,


Mpiexec is not supposed to copy a binary across the network. You'd need to copy it manually, use shared network drive mapped to each system or UNC path.



Also, to get list of nodes from the job scheduler, you'd need an extra option passed to mpiexec: -hosts %CCP_NODES%



Following sample job submission command line addresses both problems (assuming, mastershare exists and both 'master' and 'node' are single-processor systems):



> job submit /numprocessors:2 /askednodes:master,node mpiexec -hosts %CCP_NODES% mastershareprog.exe



Best regards,


- Grigory

toastnmaker的头像

Thanks a lot, Grigory, it helped me!


Sharing works fine now, but the submitted job failes (job view says 'status: failed') while the job sent without "job submit" but simply running mpiexec only is ok. Is there a log somewhere where I can find why?


I suspected stdin and stdout can make the difficulties,so I put arguments I've googled out


/stdin:mastershare_dirinput.txt /stdout:mastershare_diroutput.txt


but the job still failes after submitting. How to deal with it?


Best regards


Martin

toastnmaker的头像

Updated: There was an incompatibility between WCCS's job and Intel's mpiexec.exe. When submitted with WCCS's mpiexec.exe it's running. What are the compatibility issues and what is the proper way to combine WCCS and ICTCE?


Martin

grigoryzagorodnev的头像

Hi Martin,


You'd need to specify full path to Intel mpiexec in the job submit command line. Otherwise MS mpiexec would be taken.



Best regards,


- Grigory

toastnmaker的头像

Uff, still doesn't work.


I_MPI_ROOT is set to c:program files (x86)intelictce3.1mpi3.1


C:cluster_share>job submit /numprocessors:2 /askednodes:master,node1 /s
tdin:mastercluster_shareinput.txt /stdout:mastercluster_shareoutput.txt "%I_MPI_ROOT%em64tinmpiexec.exe" -hosts %CCP_NODES% mastercluster_shareprog.exe
Job has been submitted. ID: 731.
C:cluster_share>job view 731
Job ID : 731
Status : Failed
Name : CLUSTERAdministrator:Jan 15 2008 11:20AM
Submitted by : CLUSTERAdministrator
Number of processors : 2-2
Allocated nodes : MASTER
Submit time : 1/15/2008 11:20:48 AM
Start time : 1/15/2008 11:20:48 AM
End time : 1/15/2008 11:20:49 AM
Number of tasks : 1
Notsubmitted : 0
Queued : 0
Running : 0
Finished : 0
Failed : 1
Cancelled : 0


Suprisingly, allocated nodes don't include NODE1. Please, don't you have any idea why?


Is there a more detailed error log somewhere?


prog.exe was compiled with Intel C++ Compiler. mpiexec is obviously also from Intel.


Best regards,


Martin

grigoryzagorodnev的头像

Hi Martin,


In fact, as CCP_NODES variable needs to be expanded at the moment of job execution, not at the submission time, extra screening is required.


"job submit ... mpieec -hosts %%CCP_NODES%%" works good.




> Is there a more detailed error log somewhere?


You'd need "/stderr:mastercluster_shareouterr.txt" job scheduler key to collect error messages. Add "-genv I_MPI_DEBUG 3" to mpiexec command line to see MPI verbose logs.




> Suprisingly, allocated nodes don't include NODE1


Is your system dual-processor(core) one? Job scheduler allocates all available processors in the first node before any other. Try to request more processors.



Best regards,


- Grigory


toastnmaker的头像

Thank you very much foryour assistance, Phil. I am slowly drifting to the right solution...


Now the nodes are recognized, but it still throws an error:


Error: You must specify the number of hosts after the -hosts option.
Unable to parse the mpiexec command arguments.


Which seems to me that there is still some mess with theCCP_NODES variableexpansion, most probably. Are there some ",',% characters or escape sequence like " missing? How to nest them?


I ran:


C:cluster_share>job submit /numprocessors:8 /askednodes:master,node1 /stdin:mastercluster_shareinput.txt /stdout:mastercluster_shareoutpu.txt /stderr:mastercluster_shareerr.txt "%I_MPI_ROOT%em64tinmpiexec.exe" -hosts %%CCP_NODES%% -genv I_MPI_DEBUG 3 minimastercluster_shareprog.exe


Best regards


Martin

toastnmaker的头像

Oops, I amsorry, of coursenot Phill, but Grigory, I am getting mad of the Windows command line :)

grigoryzagorodnev的头像

Martin,


Let's do some sanity check here...


Please make sure CCP_NODES is undefined in the shell you call "job submit" from. Run "echo %CCP_NODES%" to see it unexpanded, i.e. as "%CCP_NODES%" string.


Then check if it is properly defined in the context of executed job. Run "job submit ... echo %CCP_NODES%" and check outpu.txt.


Check whole environment "job submit ... set" to see if CCP_* variables are there.



Use %CCP_NODES% when running from interactive cmd line.


Use %%CCP_NODES%% within *.bat command scripts.



Please let me know the results.


- Grigory

toastnmaker的头像

> Run"echo %CCP_NODES%" to see it unexpanded,


> i.e. as "%CCP_NODES%"



in the job shell it is undefined.



C:cluster_share>echo %CPP_NODES%
%CPP_NODES%



C:cluster_share>set CPP_NODES
Environment variable CPP_NODES not defined



> Then check if it is properly defined in the context of


> executed job. Run "job submit ... echo %CCP_NODES%" and


> check outpu.txt.



It is ok.



C:cluster_share>job submit


/numprocessors:8 /askednodes:master,node1


/stdin:mastercluster_shareinput.txt


/stdout:mastercluster_shareoutput.txt


/stderr:mastercluster_shareerr.txt echo %CCP_NODES%


C:cluster_share>more vystup.txt
2 MASTER 4 NODE1 4



> Check whole environment "job submit ... set" to see if CCP_*


>variables are there.



There is CCP_NODES defined properly



C:cluster_share>job submit


/numprocessors:8 /askednodes:master,node1


/stdin:mastercluster_shareinput.txt


/stdout:mastercluster_shareoutput.txt


/stderr:mastercluster_shareerr.txt set


..
.


CCP_NODES=2 MINIMASTER 4 MININODE1 4


...



C:cluster_share>"%I_MPI_ROOT%em64tinmpiexec.exe" -hosts 2 MASTER 4 NODE1 4 mastercluster_shareprog.exe



works well and is running on both computers using all 4 processors on each of thembut apparently the same binaryexecutedthrough job submit failes



C:cluster_share>job submit


/numprocessors:8 /askednodes:minimaster,mininode1


/stdin:mastercluster_shareinput.txt


/stdout:mastercluster_shareoutput.txt
/stderr:mastercluster_shareerr.txt


"%I_MPI_ROOT%em64tinmpiexec.exe" -hosts 2


MASTER 4 NODE1 4 minimastercluster_shareprog.exe


Status: Failed


//master/cluster_share/output.txt and


//master/cluster_share/err.txtare created but remain empty this time.



It slowly drives me mad...



Best regards


Martin

toastnmaker的头像

to avoid any misinterpreting, at first I deleted 'mini' from the conmputers names minimaster, mininode1 for simplicity reasons in this forum. As you can see I forgot to keep this notationin the last post.


I suspect that there is a problem with the rights. mpiexec runs it on the nodes with no problem, but job submit which then starts mpiexec can cause some permission problem - but there are no messages like permission denied...


ps: I this kind of settings and running mpi tasks simpler in the linux version?

toastnmaker的头像

# C++ sequential helloworld with job submit (works):


C:cluster_share>job submit /stderr:minimastercluster_shareerr.txt /stdout:minimastercluster_shareout.txt "c:program files (x86)intelictce3.1mpi3.1em64tinmpiexec.exe" -hosts 2 minimaster 4 mininode1 4 -gwdir minimastercluster_share helloworld.exe
out.txt contains
Helloworld!
Helloworld!
Helloworld!
Helloworld!
Helloworld!
Helloworld!
Helloworld!
Helloworld!


err.txt is empty


# C++ parallel MPI helloworld with job submit (doesn't work):


C:cluster_share>job submit /stderr:minimastercluster_shareerr.txt /stdout:minimastercluster_shareout.txt "c:program files (x86)intelictce3.1mpi3.1em64tinmpiexec.exe" -hosts 2 minimaster 4 mininode1 4 -gwdir minimastercluster_share mpi_hello.exe
out.txt is empty
err.txt is empty



# C++ parallel MPI mpi_hello without job submitting (works):


C:cluster_share>"c:program files (x86)intelictce3.1mpi3.1em64tinmpiexec.exe" -hosts 2 minimaster 4 mininode1 4 -gwdir minimastercluster_share mpi_hello.exe
Master: Hello world: rank 0 of 8 running on minimaster.minicluster.local
Waiting for comp 1 Hello world: rank 1 of 8 running on minimaster.minicluster.local
Waiting for comp 2 Hello world: rank 2 of 8 running on minimaster.minicluster.local
Waiting for comp 3 Hello world: rank 3 of 8 running on minimaster.minicluster.local
Waiting for comp 4 Hello world: rank 4 of 8 running on mininode1.minicluster.local
Waiting for comp 5 Hello world: rank 5 of 8 running on mininode1.minicluster.local
Waiting for comp 6 Hello world: rank 6 of 8 running on mininode1.minicluster.local
Waiting for comp 7 Hello world: rank 7 of 8 running on mininode1.minicluster.local



# sequential C++ helloworld without job submitting (works):


C:cluster_share>"c:program files (x86)intelictce3.1mpi3.1em64tinmpiexec.exe" -hosts 2 minimaster 4 mininode1 4 -gwdir minimastercluster_share helloworld.exe
Helloworld!
Helloworld!
Helloworld!
Helloworld!
Helloworld!
Helloworld!
Helloworld!
Helloworld!


Why? What is wrong?


Best regards


Martin

grigoryzagorodnev的头像

Hi Martin,


> suspect that there is a problem with the rights


Yes, it looks that way. Let's try one extra thing:


Call "mpiexec -register -user n" to encrypt user name and password to the Windows registry. Use any integer value 'n' to identify a user slot index.


Launch the mpi application using the registered user credentials from slot 'n' with "mpiexec ... -user n" option.


For example:


> mpiexec -register -user 1


> job submit ... %I_MPI_ROOT%em64tinmpiexec -user 1 ...



Please let me know if use of credentials slot helps.


Best regards,


- Grigory

toastnmaker的头像

Before I start to experiment with "mpiexec -register -user" I would like to mention next note that the process_monitor showed that prog.exe in the case of running parallel mpi routinewith job submit can't find the impi.dll. It tries to look for it everywhere even in "%I_MPI_ROOT%/mpi/3.1/em64t/bin" but not in "%I_MPI_ROOT%/mpi/3.1/em64t/lib". When Imanually included"%I_MPI_ROOT%/mpi/3.1/em64t/lib" into the generalPATH environment settings on each node, it turned to work.


It is strange, 1) I would bet that it is linked statically! 2) using pure mpiexec without job submit run it works well. Why?


Does it point to bad rights, too?


Is the solution to start the mpiexec in a batch file where ictvars.bat is run and the environment is set? Why some parts are set on the nodes (e.g. I_MPI_ROOT or path to I_MPI_ROOT/mpi/3.1/em64t/bin) and some not (I_MPI_ROOT/mpi/3.1/em64t/bin) ?


Martin

grigoryzagorodnev的头像

Hi Martin,


Thank you for your patience and good questions!


Default usage model for 3.1 release it to call impivars.bat from the command lien shell prior mpiexec that would update PATH with em64t/lib. This model requires initial PATH variable to contain em64t/lib only and this is what installation process guarantees.


Intel MPI 3.1 for Windows does not fully support Win CCS job scheduler yet. Further releases will do.


Nevertheless I would try to find better solution for this case.



> 1) I would bet that it is linked statically!


In fact it is linked dynamically. There is no static link available at the moment.



> Does it point to bad rights, too?


No. This is configuration specifics.



BTW, please notice that file specified in "/stdin:" job scheduler option must exist; otherwise job would not run.


- Grigory


grigoryzagorodnev的头像

Hi Marting,


Here is the way to resolve dll search issue:


1. Run command line shell


2. Update environment with iMPI specifics
> call "C:Program Files (x86)IntelMPI3.1em64tinmpivars.bat"


3. Make critical variables cluster-wide
> cluscfg setenvs PATH="%PATH%"


3'. Check cluster-wide variables if necessary
> cluscfg listenvs



After that each time job executed, certain PATH variable value would be promoted to the job environment resolving dynamic libraries search issue.


4. Call "job submit ... mpiexec ..." as usual



- Grigory


toastnmaker的头像

It solved it!


The problem lurked in the fact, that there were twice the same entry in the PATH, therefore it broke the cluscfg setenvs command. Now it works well.


Thanks a lot Grigory!


(now moving to the debugging issues ;-) )

登陆并发表评论。