Remote Access Questions

Remote Access Questions

 Do you have questions about accessing the remote Intel® Xeon Phi™ cluster, compiling your code or submitting a job to the cluster? Post your questions and get help here!

58 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

It looks like cluster.colfaxinternational.com redirects to moderncodechallenge.intel.com. Am I accessing the VM incorrectly or is there an issue there?

 You need to login to  cluster.colfaxinternational.com using an SSH client, and the credentials that were sent to you by email upon registration. Please let us know if you have further questions!

Iman

Can I get git onto the cluster to share code with my dev environment via a bitbucket.org?

Sure, I have just installed git, it should be available from the command line.

Thanks. I can't access the bitbucket server, now though.

$ ssh -T git@bitbucket.org

ssh: connect to host bitbucket.org port 22: Connection timed out

Is this a problem with cfxcluster?

Craig,

Please know we've received your message and are looking into the issue.

We will update you as soon as we have a resolution ... probably tomorrow.

Thanks very much for your participation in the Challenge.

Kind regards,

- Richard

Hi again :)

I tried to run your supplied code on huge.cdc and got an error (below). I need to run a baseline so I can get a feel for how much improvement is made. Please help!

=>> PBS: job killed: walltime 626 exceeded limit 600
JOB 528.cfxcluster EPILOGUE REPORT: completion of this job was not clean. Killing stray processes 15369 15373

Thanks
Craig

Hi Graig,

The input specified in huge.cdc takes a very long time to run with the serial version of the code so it will timeout on the cluster. This code will only run on the cluster with an optimized code (we have a note in the Readme about that). You should use small.cdc to optimize your code until it can run efficiently on the larger input.

Please feel free to post any other questions you may have.

Thanks,

Iman

Quote:

Craig H. wrote:

Thanks. I can't access the bitbucket server, now though.

$ ssh -T git@bitbucket.org

ssh: connect to host bitbucket.org port 22: Connection timed out

Is this a problem with cfxcluster?

Craig, we opened the ports for this, please try again.

Running the original code with small.cdc, I get this in the STDIN.e file:

JOB 689.cfxcluster PROLOGUE REPORT: could not connect to Intel Xeon Phi coprocessor(s) in job host
ssh: connect to host mic0 port 22: No route to host
ssh: connect to host mic0 port 22: No route to host
ssh: connect to host mic0 port 22: No route to host
JOB 689.cfxcluster EPILOGUE REPORT: Internal error 0  255, please contact system administrator

 

Quote:

Craig H. wrote:

Running the original code with small.cdc, I get this in the STDIN.e file:

JOB 689.cfxcluster PROLOGUE REPORT: could not connect to Intel Xeon Phi coprocessor(s) in job host
ssh: connect to host mic0 port 22: No route to host
ssh: connect to host mic0 port 22: No route to host
ssh: connect to host mic0 port 22: No route to host
JOB 689.cfxcluster EPILOGUE REPORT: Internal error 0  255, please contact system administrator

 

 

Strange. Anyway, rebooted and online now.

Quote:

Andrey Vladimirov wrote:

Quote:

Craig H. wrote:

 

Running the original code with small.cdc, I get this in the STDIN.e file:

JOB 689.cfxcluster PROLOGUE REPORT: could not connect to Intel Xeon Phi coprocessor(s) in job host
ssh: connect to host mic0 port 22: No route to host
ssh: connect to host mic0 port 22: No route to host
ssh: connect to host mic0 port 22: No route to host
JOB 689.cfxcluster EPILOGUE REPORT: Internal error 0  255, please contact system administrator

 

 

 

 

Strange. Anyway, rebooted and online now.

Nothing works now :( 

On make -- sh: ./BUILD-HOST-GEN: Permission denied

and corrupt files:

-rw-------. 1 mcdc0031 mcdc0031     85 Sep 21 12:47 STDIN.e691
-rw-------. 1 mcdc0031 mcdc0031     86 Sep 21 12:50 STDIN.e692
-rw-------. 1 mcdc0031 mcdc0031     69 Sep 21 12:51 STDIN.e693
-rw-------. 1 mcdc0031 mcdc0031      0 Sep 21 12:47 STDIN.o691
-rw-------. 1 mcdc0031 mcdc0031      0 Sep 21 12:50 STDIN.o692
-rw-------. 1 mcdc0031 mcdc0031      0 Sep 21 12:51 STDIN.o693
-rw-rw-r--. 1 mcdc0031 mcdc0031  10753 Sep 21 12:04 util.cpp
-rw-rw-r--. 1 mcdc0031 mcdc0031   2836 Sep 21 12:04 util.hpp
[mcdc0031@cfxcluster cell_clustering]$ cat STDIN.e693
bash: /home/mcdc0031/cell_clustering/cell_clustering: Is a directory
[mcdc0031@cfxcluster cell_clustering]$

 

 

 

Quote:

Craig H. wrote:

Quote:

Nothing works now :( 

On make -- sh: ./BUILD-HOST-GEN: Permission denied

and corrupt files:

-rw-------. 1 mcdc0031 mcdc0031     85 Sep 21 12:47 STDIN.e691
-rw-------. 1 mcdc0031 mcdc0031     86 Sep 21 12:50 STDIN.e692
-rw-------. 1 mcdc0031 mcdc0031     69 Sep 21 12:51 STDIN.e693
-rw-------. 1 mcdc0031 mcdc0031      0 Sep 21 12:47 STDIN.o691
-rw-------. 1 mcdc0031 mcdc0031      0 Sep 21 12:50 STDIN.o692
-rw-------. 1 mcdc0031 mcdc0031      0 Sep 21 12:51 STDIN.o693
-rw-rw-r--. 1 mcdc0031 mcdc0031  10753 Sep 21 12:04 util.cpp
-rw-rw-r--. 1 mcdc0031 mcdc0031   2836 Sep 21 12:04 util.hpp
[mcdc0031@cfxcluster cell_clustering]$ cat STDIN.e693
bash: /home/mcdc0031/cell_clustering/cell_clustering: Is a directory
[mcdc0031@cfxcluster cell_clustering]$

 

I do not see cell_clustering in your home folder. A minute ago, when it was still there, I was able to run "make" without error messages. If you are still having issues, could you please post the exact sequence of commands that causes the problem?

Quote:

Andrey Vladimirov wrote:

Quote:

Craig H. wrote:

 

Running the original code with small.cdc, I get this in the STDIN.e file:

JOB 689.cfxcluster PROLOGUE REPORT: could not connect to Intel Xeon Phi coprocessor(s) in job host
ssh: connect to host mic0 port 22: No route to host
ssh: connect to host mic0 port 22: No route to host
ssh: connect to host mic0 port 22: No route to host
JOB 689.cfxcluster EPILOGUE REPORT: Internal error 0  255, please contact system administrator

Strange. Anyway, rebooted and online now.

It's gone again :(

JOB 711.cfxcluster PROLOGUE REPORT: could not connect to Intel Xeon Phi coprocessor(s) in job host
ssh: connect to host mic0 port 22: No route to host
ssh: connect to host mic0 port 22: No route to host
ssh: connect to host mic0 port 22: No route to host
JOB 711.cfxcluster EPILOGUE REPORT: Internal error 0  255, please contact system administrator

 

That's not good. I have set the scheduler to allocate a different coprocessor card for the runs. Hopefully this resolves the issue. We are going to roll out new hardware for computation fairly soon, so if this is a hardware issue, it will be stabilized.

How many coprocessor cards can I use for the final test? One or more?

You can only use one.

Thanks,

Hi,

   Can we run the Intel Analyzer tools such as Advisor XE, Inspector XE,  VTune Amplifier XE on the remote server?

Thanks

Nuwan

Hi Nuwan,

sorry, Advisor, Inspector and VTune are not supported on this cluster. Nothing prevents you from using them on your own system, though.

Andrey

Quote:

Nuwan K. wrote:

Hi,

   Can we run the Intel Analyzer tools such as Advisor XE, Inspector XE,  VTune Amplifier XE on the remote server?

Thanks

Nuwan

Hello,

This looks like an interesting challenge. I've been playing with the code. There's couple of questions I want to ask

1) There seems to be a CPU time limit of 60s (`ulimit -t`) on the machine provided. This is fine for running simulation with `small.cdc` without parallelization. Since there are 64 cores available, any trivial parallelization will make the process exceed the CPU time limit and therefore get killed by the kernel, making it impossible to test parallel code on it.

2) The code provided does not work on OS X out of the box. It is due to the unavailability of `clock_gettime`. There are workarounds but it will require us to modify `util.hpp`.

This brings me to 3) How much modification of code is allowed? For instance, there could be a lot more potential in improving locality by merging different subroutines, for instance `diffusion` and `decay`, also `cellMove` and `produce`. But this will make the simulation run completely different and the timing setup will probably be different.

It seems that we are allowed to modify anything in `cell_clustering.cpp`. But I know the more we modify, the harder it is for referees to see the *correctness of the code* instead of just the output. It seems that the logic choice here is to keep the subroutine semantic the same, i.e. each step consists of produce, diffusion, decay and cell move.

In conclusion, I'll make three suggestions:

1) an improved version of the base code should be provided with the following improvements: Separate `main` and timing functions from the main simulation code, and hence fixing the interface of the simulation.  As a result, you can declare any modification of the resulting `cell_clustering.cpp` "legal" without making it difficult for contesters and referees.

2) Set `ulimit -t` to unlimited on cluster.colfaxinternational.com. This is crucial for testing parallel code on the machine given.

3) (Optional) Fix timing code to work on OS X

Best,

Wei

Wei,

regarding your question 1) and suggestion 2), the README file in your home folder addresses these issues directly. You should not run on the head node, all computational jobs must be submitted into the queue.

For questions related to programming, one my colleagues from the will respond shortly.

Andrey

Hi Andrey,

Thank you for the quick reply. Sorry, I didn't realize the machine is only the head node of the cluster; I should have looked closer at the README.

Wei

Hi Wei,

To respond to your last question: You can only modify the implementation of each functions but the framework of the simulation should be kept the same. The reason is that if developers change the functions and the structure of the implementation, it will not be feasible for us to verify the correctness of submissions in a systematic way.

As for supporting OS X and other platforms. You are right, you'll need to tweak the code to run on OS X, we are however not providing different versions of the code for different platforms. It's up to the developers to do these tweaks based on  their platforms. We are making sure though that the current code works on the provided cluster.

Hope we answered your questions. Please feel free to post any other questions you may have.

Thanks,

Iman

Hi Iman,

Thank you for the clarification. I do have a further question.

> You can only modify the implementation of each functions but the framework of the simulation should be kept the same.

Sorry I think I need more clarification on this: I still don't really understand the extent of change we are allowed to make. Can we move code between functions? For instance, moving `produceSubstance` code all into `runDiffusionStep`, which then allows for further optimization? Or moving the two boundary condition checking to cellMove and runDiffusionCluster, which seems more semantically appropriate?

Thanks.

Best,

Wei

Hi Wei,

You should not move code between functions. For each function, we implemented separate time report for that function so if you merge two functions together for example, we cannot compare their performance to other submissions.

Hope that answers your question.

Thanks,

Iman

Hi Iman,

I understand now. No modifying the semantics of the functions. Thanks!

Wei

Does the STDIN/STDOUT have to be the same (in structure, if not in absolute values)? i.e. can we optimise away output to these streams?

Hi Craig,

Yes, please preserve the structure of the output as we are using it to evaluate submissions. They need to have the same structure/format as generated by the provided code.

Thanks,

Iman

How do I check the status of the build and test running on the remote cluster?

How do I check the test run time on the remote cluster?

I got it:
- To upload the code by the Portal;
- Access the Colfax server cluster using the PuTTY program on Win 10;
- Perform Unzip the files to the "cell_clustering" folder;
- Compile the code "cell_clustering" (used without changing the code);
- Submit the job to the Colfax Cluster;
- Check the momentary state of the job queue (Q or R);
- Read the STDIN.e ??? and STDIN.o ??? files;

[mcdc0396@cfxcluster cell_clustering]$ ls
BUILD-HOST-GEN   cell_clustering.cpp  LICENSE.md  README.md  STDIN.e2333  STDIN.e2335  STDIN.o2334  util.cpp
cell_clustering  huge.cdc             Makefile    small.cdc  STDIN.e2334  STDIN.o2333  STDIN.o2335  util.hpp

 

Quote:

Uákiti P. wrote:

How do I check the status of the build and test running on the remote cluster?

How do I check the test run time on the remote cluster?

I got it:
- To upload the code by the Portal;
- Access the Colfax server cluster using the PuTTY program on Win 10;
- Perform Unzip the files to the "cell_clustering" folder;
- Compile the code "cell_clustering" (used without changing the code);
- Submit the job to the Colfax Cluster;
- Check the momentary state of the job queue (Q or R);
- Read the STDIN.e ??? and STDIN.o ??? files;

[mcdc0396@cfxcluster cell_clustering]$ ls
BUILD-HOST-GEN   cell_clustering.cpp  LICENSE.md  README.md  STDIN.e2333  STDIN.e2335  STDIN.o2334  util.cpp
cell_clustering  huge.cdc             Makefile    small.cdc  STDIN.e2334  STDIN.o2333  STDIN.o2335  util.hpp

 

 

Uákiti,

Yes, you should check STDIN.exxxx for the runtime. Please let me know if you have further questions.

Thanks,

Iman

Hi Iman,

How and when updated cell_clustering.zip file in Colfax Cluster?

I resubmit code for the portal (yesterday and today), but the same has not been updated within the server.

Quote:

Uákiti P. wrote:

Hi Iman,

How and when updated cell_clustering.zip file in Colfax Cluster?

I resubmit code for the portal (yesterday and today), but the same has not been updated within the server.

Hi Uákiti,

The optimized code needs only to be submitted on the portal. The cluster is used for your own testing of the code but the copy on the cluster won't be updated. Hope that answers your question.

Thanks,

Iman

Hi, can we connect to the servers using "nomachine" ?. In the training videos this is presented as the recommended way to use a graphical user interface, but I'm unable to establish the connection !.

Thanks :-)

Hi, I've just read this answer and now all the work I've been doing during last days is useless :-(.

Quote:

Iman Saleh (Intel) wrote:

Hi Wei,

To respond to your last question: You can only modify the implementation of each functions but the framework of the simulation should be kept the same. The reason is that if developers change the functions and the structure of the implementation, it will not be feasible for us to verify the correctness of submissions in a systematic way.

As for supporting OS X and other platforms. You are right, you'll need to tweak the code to run on OS X, we are however not providing different versions of the code for different platforms. It's up to the developers to do these tweaks based on  their platforms. We are making sure though that the current code works on the provided cluster.

Hope we answered your questions. Please feel free to post any other questions you may have.

Thanks,

Iman

Is this really a requirement?. I can't even believe it. The level of optimizations one can archive being constrained by the function frames is just ridiculously low. I think this should be clearly stated in the presentation of the challenge, and not being hidden in a forum. I think all the fun of optimizing the code is lost if we can't play with this. Now it's only about adding some pragmas and hope for the best !! :-(.

Quote:

Pablo G. wrote:

Hi, can we connect to the servers using "nomachine" ?. In the training videos this is presented as the recommended way to use a graphical user interface, but I'm unable to establish the connection !.

Thanks :-)

Hi Pablo,

We recommend x2go remote desktop client available for download at http://wiki.x2go.org/doku.php. nomachine is not compatible with the OS version on the cluster.

This is specified here: https://moderncodechallenge.intel.com/local-machine-requirements/

Thanks,

Iman

 

Quote:

Pablo G. wrote:

Hi, I've just read this answer and now all the work I've been doing during last days is useless :-(.

Quote:

Iman Saleh (Intel) wrote:

 

Hi Wei,

To respond to your last question: You can only modify the implementation of each functions but the framework of the simulation should be kept the same. The reason is that if developers change the functions and the structure of the implementation, it will not be feasible for us to verify the correctness of submissions in a systematic way.

As for supporting OS X and other platforms. You are right, you'll need to tweak the code to run on OS X, we are however not providing different versions of the code for different platforms. It's up to the developers to do these tweaks based on  their platforms. We are making sure though that the current code works on the provided cluster.

Hope we answered your questions. Please feel free to post any other questions you may have.

Thanks,

Iman

 

 

Is this really a requirement?. I can't even believe it. The level of optimizations one can archive being constrained by the function frames is just ridiculously low. I think this should be clearly stated in the presentation of the challenge, and not being hidden in a forum. I think all the fun of optimizing the code is lost if we can't play with this. Now it's only about adding some pragmas and hope for the best !! :-(.

Hi Pablo,

The optimized code should generate the same output as the serial version. This has been specified in the communication you got about the contest. You are free to change the code as long as you can generate the same output parameters. These parameters are used to assess the correctness and performance of code submissions relative to each other. I'm sure you can understand setting conditions for optimizations to have fair comparison between different submissions.

Please feel free to post any other questions you may have.

Thanks,

Iman

 

Hi, I want to run VTune in the server. I've been able to log in via x2go (thanks :-) ).

However, when I run VTune, it says that it cannot detect a target system to analyse. Also, when I search for the kernel module on the host, I can't find anything that contains sep ( "lsmod | grep sep" returns nothing :-( ).

Maybe the environment is not configured properly to use vtune?. I've sourced amplxe-vars.sh.

Thanks for the help :-)

Quote:

Pablo G. wrote:

Hi, I want to run VTune in the server. I've been able to log in via x2go (thanks :-) ).

However, when I run VTune, it says that it cannot detect a target system to analyse. Also, when I search for the kernel module on the host, I can't find anything that contains sep ( "lsmod | grep sep" returns nothing :-( ).

Maybe the environment is not configured properly to use vtune?. I've sourced amplxe-vars.sh.

Thanks for the help :-)

Hi Pablo,

Glad you can connect!

Unfortunately, we don't support VTune on Xeon Phi in our cluster setup. 

Thanks,

Iman

Hi,

Is it normal? I just run my code in the MIC remote server.

$ qmic ~/cell_clustering/cell_clustering ~/cell_clustering/small.cdc

JOB 5409.cfxcluster PROLOGUE REPORT: could not connect to Intel Xeon Phi coprocessor(s) in job host

Permission denied, please try again.

Permission denied, please try again.

Permission denied (publickey,password,keyboard-interactive).

 

Quote:

Iman Saleh (Intel) wrote:

To respond to your last question: You can only modify the implementation of each functions but the framework of the simulation should be kept the same. The reason is that if developers change the functions and the structure of the implementation, it will not be feasible for us to verify the correctness of submissions in a systematic way.

 

Sorry, but I don't think this is completely clear yet. Here, you clearly say that we have to conserve the call structure and functionality  of every function.  In the introduction video it sounds more like, you can move around your code as you like as long as you conserve the output, i.e. the two phases. Can you please clearly state if we can remove/merge functions within the two phases of the code.

Thanks a lot for your help.

Best regards,

Simon

Quote:

Petrie W. wrote:

Hi,

Is it normal? I just run my code in the MIC remote server.

$ qmic ~/cell_clustering/cell_clustering ~/cell_clustering/small.cdc

JOB 5409.cfxcluster PROLOGUE REPORT: could not connect to Intel Xeon Phi coprocessor(s) in job host

Permission denied, please try again.

Permission denied, please try again.

Permission denied (publickey,password,keyboard-interactive).

 

 

On my last test, the cluster behaved normally. Are you getting this consistently or is is a one-time issue?

Quote:

Simon P. wrote:

Quote:

Iman Saleh (Intel) wrote:

 

To respond to your last question: You can only modify the implementation of each functions but the framework of the simulation should be kept the same. The reason is that if developers change the functions and the structure of the implementation, it will not be feasible for us to verify the correctness of submissions in a systematic way.

 

 

 

Sorry, but I don't think this is completely clear yet. Here, you clearly say that we have to conserve the call structure and functionality  of every function.  In the introduction video it sounds more like, you can move around your code as you like as long as you conserve the output, i.e. the two phases. Can you please clearly state if we can remove/merge functions within the two phases of the code.

Thanks a lot for your help.

Best regards,

Simon

Hi Simon,

The output is not only on the two phases level but  the code outputs some parameters within each phase as well. Check your stdout and stderr outputs, as long as you can preserve that same output, you can make changes.

Hope that clarifies it, please let me know if you have further questions.

Thanks,

Iman

Quote:

Andrey Vladimirov wrote:

Quote:

Petrie W. wrote:

 

Hi,

Is it normal? I just run my code in the MIC remote server.

$ qmic ~/cell_clustering/cell_clustering ~/cell_clustering/small.cdc

JOB 5409.cfxcluster PROLOGUE REPORT: could not connect to Intel Xeon Phi coprocessor(s) in job host

Permission denied, please try again.

Permission denied, please try again.

Permission denied (publickey,password,keyboard-interactive).

 

 

 

 

On my last test, the cluster behaved normally. Are you getting this consistently or is is a one-time issue?

 

Still doesn't work. My username is mcdc1417. I have generated a new pair of RSA keys in the remote server. Does it possibly relate to this issue? Or is it possible to reset my account.

Petrie

Quote:

Petrie W. wrote:

Quote:

Still doesn't work. My username is mcdc1417. I have generated a new pair of RSA keys in the remote server. Does it possibly relate to this issue? Or is it possible to reset my account.

Petrie

Yes, if you over-write the original keys, you will not be able to connect. There is an easy fix, though: adding the contents id_rsa.pub to authorized_keys. I did this for you, so you should be able to use the cluster now.

When I run code that depends on the intel math kernel library in the cluster I get the error:

error while loading shared libraries: libmkl_intel_lp64.so: cannot open shared object file: No such file or directory.

The code with libraries worked perfectly until one hour ago.

Please fix this.

Thanks :-)

Quote:

Pablo G. wrote:

When I run code that depends on the intel math kernel library in the cluster I get the error:

error while loading shared libraries: libmkl_intel_lp64.so: cannot open shared object file: No such file or directory.

The code with libraries worked perfectly until one hour ago.

Please fix this.

Thanks :-)

Fixed. Thank you for reporting this!

Hi, sometimes my jobs end prematurely without producing STDIN.e* or STDIN.o*.

Is this a fault in the cluster or in my code?. Any guess on what can be the cause, or a way to debug it?.

Thanks :-)

Quote:

Andrey Vladimirov wrote:

Quote:

Pablo G. wrote:

 

When I run code that depends on the intel math kernel library in the cluster I get the error:

error while loading shared libraries: libmkl_intel_lp64.so: cannot open shared object file: No such file or directory.

The code with libraries worked perfectly until one hour ago.

Please fix this.

Thanks :-)

 

 

Fixed. Thank you for reporting this!

Hi, the server is broken again :-(.

Can anyone fix this, or teach us how to do it?. Some of us can only work on this during weekends :-(

Regards.

Quote:

Pablo G. wrote:

Quote:

Hi, the server is broken again :-(.

Can anyone fix this, or teach us how to do it?. Some of us can only work on this during weekends :-(

Regards.

I am sorry about it. I don't know why Xeon Phi OS keeps losing the NFS-mount from host - possibly, somebody's jobs are running out of memory. But I implemented a fix that should prevent this from happening in the future. The server should work now, and I will keep an eye on this issue.

Regarding jobs disappearing without a trace, if you notice it again, could you please post or send me via private message the job number?

Quote:

Andrey Vladimirov wrote:

Regarding jobs disappearing without a trace, if you notice it again, could you please post or send me via private message the job number?

Another example is 13876. It executed during a few minutes and disappeared What happened to that job?.

Thanks

Pages

Leave a Comment

Please sign in to add a comment. Not a member? Join today