Problem operating the cluster

Problem operating the cluster

I am having issues with operating cluster. I thought it would be better if someone could tell me how to train my cnn (train.py) from the github repository

so that I would have a working idea. https://github.com/prajjwal1/convolutional-neural-networks/tree/master/i...

There is a train.py file which trains the rnn model. Could someone post the exact instructions (step by step) on how to proceed with cluster, operate it , train my rnn model and obtain the model.h5 file which contains the weights of the model ?

28 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Prajj - just letting you know that I can now see your forum post.  Our team will get back with you soon.

Hi Prajjwal,
    Could you please give details on the the server/cluster you are running, the commands you executed, the error you are getting. This would help up to repo the problem and provide help.

Thanks
Ravi Keron

Thanks for repying,

i am running the u4511 cluster, could you tell me the exact instructions(command s to execute from ssh) on how do i train my rnn model (train.py in repo) from the cluster and obtain model.h5 file (which contains the weights of neural network). Can you provide me steps to train neural network using cluster. How do i start from the beginning? That would be really helpful.

Dear Prajjwal,
     Create a qsub_rn.sh file(you can name anything. The qsub_rn.sh is an example) with the below code to run train.py on compute node
#!/bin/bash
cd $PBS_O_WORKDIR
python train.py

Once you create the above file execute the file with the following command
qsub -e ./error.txt -o ./output.txt -V qsub_rn.sh

To monitor the status use the below command
qstat

Regards
Ravi Keron

How do obtain the model.h5 file? After training neural nets how do i obtain the file which contains the weights? Like in this case it is model.h5 file.

I wanted to update few libraries as well, like keras has been updated to keras 2.0. How do i update it? Librares like tensorflow, keras are updated frequently. I can't do that with root privileges . 

The error which i am getting is due to old library versions installed. How do i upgrade these libraries? 

i ran

```

pip3 install keras --upgrade

pip3 install tensorflow --upgrade

```

but i can't do that because i don't have root privileges . How do i update it?

Hi Prajjwal,
For any upgrades on the colfax cluster, please contact https://colfaxresearch.com/discussion/
and start a discussion thread with all the required upgrades. The admin team would help.

Regards
Ravi Keron

Hi Prajjwal,
Were able to get the required upgrades installed?

Regards
Ravi Keron

Not at all , See I mostly work on jupyter notebook as it gives an interactive session and works really well with deep learning. I have this certain dependencies which I want jupyter notebook to load from virtual environment. I am able to use jupyter notebook through ssh tunneling but I am not able to use it through virtual environment, as a result of which my dependencies are not getting loaded. I have written about my problem on Facebook page , colfax cluster as well. I have asked for sudo rights for the cluster to be able to install pip packages. I am not able to use jupyter notebook through virtual environment. Could you please grant me sudo rights as it will be beneficial for me as currently I don't have hardware nor any support to train my neural networks. I will only use that sudo right for installation certain pip packages. I have been an avid Linux user for 4 long years and I am well aware of UNIX environment. I won't do anything extra with root rights. If anything goes wrong on my CPU, you can revoke the access at any time. This jupyter notebook https://c001.colfaxresearch.com/hub/login will solve my purpose if I have root rights to install pip packages and update them as well. My Tensorflow code doesn't work because currently the version which is installed on cluster is 0.13 and we have reached 1.3 as of now. I will only use rights to update and install few pip packages.
Regards

Hi Prajjawal,
Please try the following. This should help you to get to the installation of latest packages with Jupyter doing "!pip install packagename"
Once you log into the colfax cluster
Run the following to build your own virtual environment:
1. ssh colfax
2. conda create --name test_env jupyter
3. source activate test_env --- Here you can give any name of your choice for the environment
4 conda install numpy pandas opencv scikit-learn matplotlib tensorflow keras jupyter

Generate and edit config file:
To generate Run the following
jupyter notebook --generate-config

This creates a config file in ~/.jupyter/jupyter_notebook_config.py

Do the edit of the config file as below:
vi ~/.jupyter/jupyter_notebook_config.py
....
# c.NotebookApp.port = 8888
c.NotebookApp.port = 8892 (give a port of your choice but that should be unique)

In Putty for the colfax session above set the tunneling
Connection-> SSH -> Tunnels
Source port - 8976
Destination - localhost:8892
select Local and Auto radio buttons
click on add

Access to Jupyter Notebook on Colfax Cluster Compute Node by your local machine's browser

1. Run jupyter on compute node
a.echo jupyter notebook | qsub

2.Check on which compute node job is triggered
a.[u4336@c001 ~]$ qstat -f | grep exec_host
You will see an output like
exec_host = c001-n036/0 ( based on what node you are on you will see the output)

3.Please refer to the link for commands https://colfaxresearch.com/discussion/topic/connecting-jupyter-notebook-...
Run the plink to command from command prompt in Windows system
plink -ssh -L 8976:localhost:8892 colfax ssh -L 8892:localhost:8892 c001-n036

c001-n036 - This part would be from the node you are on

Regards
Ravi Keron

I have tried all these methods . I use Ubuntu and for ssh tunneling, i use 

ssh -L 8896:localhost:8896 colfax ssh -L 8896:localhost:8896 c001-n029

But this doesn't start jupyter notebook from virtual environment. Jupyter notebook loads all the dependencies from root environment. I want to start jupyter notebook from virtual environment

Here's the problem which I am facing:

When I do all this:

1. ssh colfax

2. conda create --name test_env jupyter

3. source activate test_env --- Here you can give any name of your choice for the environment

4 conda install numpy pandas opencv scikit-learn matplotlib tensorflow keras jupyter

 

jupyter notebook --generate-config

vi ~/.jupyter/jupyter_notebook_config.py

c.NotebookApp.port = 8892

echo jupyter notebook | qsub

qstat -f | grep exec_host

By ssh tunneling I run this:

ssh -L 8896:localhost:8896 colfax ssh -L 8896:localhost:8896 c001-n029

 

Now I have attached a screenshot which will help you visualize:

When i run all these commands (Ubuntu 16.04) , jupyter notebook starts the server from root environment and not from virtual environment. So jupyter notebook loads all the dependencies from root environment and not from virtual environment.

 

To demonstrate this: I have taken an example which is showcased in screenshot. The version installed on cluster's root environment is 0.12 and the version of tensorflow which is installed on virtual environment is 1.12. So when i run jupyter notebook and check the version of tensorflow which is being used , it displays 0.12 which clearly indicates that it is loading up dependencies from root environment. So no matter what dependencies I install on conda environment, jupyter notebook will load dependencies from root environment.

 

To solve the problem, there are two ways:

1) Please tell me how do i run jupyter notebook from conda environement on cluster i.e jupyter notebook should load dependencies from virtual environment . 

2) Getting root access which i won't get. So it all comes down to first one.

I hope this explanation will help you understand my problem.

Regards

Prajjwal

 

 

 

 

 

 

 

 

 

 

 

 

 

Hi Prajjawal,

 

   Please run the below commands before invoking the Jupyter Notebook. This should help you to get the virtual environment mapped to Jupyter.

 

pip install ipykernel

python -m ipykernel install --user --name=intelfull3

 

Regards

Ravi Keron

 

I ran the commands as given by you. I have attached a screenshot. Before qsub i ran the commands . When i ran jupyter notebook, i ran the command to check the version of tensorflow, it printed out 0.12 which indicates that its still loading dependencies from root environment. The version of tensorflow installed in my conda environment is 1.12, so ideally it should have printed out 1.12. This method isnt working, could you provide another way out?

Hi Prajjwal,

Please follow the below steps to start jupyter notebook in our own environment and create a notebook in our own environment.

1. Activate conda environment

 

[u4336@c001 ~]$ conda info -e

# conda environments:

#

my_root                  /home/u4336/.conda/envs/my_root

test_env                 /home/u4336/.conda/envs/test_env

root                  *  /opt/intel/intelpython35

 

[u4336@c001 ~]$ source activate test_env

(test_env) [u4336@c001 ~]$ conda info -e

# conda environments:

#

my_root                  /home/u4336/.conda/envs/my_root

test_env              *  /home/u4336/.conda/envs/test_env

root                     /opt/intel/intelpython35

 

(test_env) [u4336@c001 ~]$

 

2. Add conda environment to kernel to access from jupyter

 

pip install ipykernel

python -m ipykernel install --user --name=test_env (In your setup, (nn) is the conda environment name so use ‘nn’).

 

3. Run jupyter notebook on compute node in conda environment

 

a. echo jupyter notebook | qsub    Returns <jobID>

 

Check on which compute node job is triggered

b. [u4336@c001 ~]$ qstat -f <jobID> | grep exec_host

exec_host = c001-n036/0

 

4. Run plink command as below

 

Local browser port : X, Port configured in Jupyter – Y . Same ports should be used in tunneling also.

a. plink -ssh -L X:localhost:Y colfax ssh -L Y:localhost:Y c001-n029 (compute node number from qstat command)

 

5. Should see <test_env> (In your setup we should see ‘nn’ ) conda environment name in browser when try to create new notebook.

6. Create a new notebook with conda environment kernel and verify the tensorflow version, In the screenshots posted by you, I am  seeing “Python3” which is default environment, that is the reason we are not seeing latest versions.

 

Attachments: 

AttachmentSize
Downloadimage/png 7428661.png56.33 KB
Downloadimage/png 742866-2.png49.77 KB

The performance which I am getting on cluster is very poor, On one epoch my laptop (core i5, no gpu) computes in 5 seconds, while on cluster it takes around 5000 seconds. How do i fix this ?

How do i fix this?

If I am getting access to Intel Xeon phi, it has significant improved computational power, right ? So why eta is so large, how do i reduce eta to less than or equal to 5 seconds ?

Hi Prajjwal,

Can you please confirm on the below points as per the steps given yesterday.

1. Jupyter notebok is running on compute node, login to compute node and check for jupyter process.

[u4336@c001 ~]$ qstat -f 23358.c001 | grep exec_host (use job id from your setup instead of 23358.c001)
    exec_host = c001-n029/0
[u4336@c001 ~]$ ssh c001-n029
Last login: Mon Aug 28 05:24:38 2017 from c001
[u4336@c001-n029 ~]$ ps -eaf | grep jupyter
u4336    247110 247109  0 Sep17 ?        00:01:13 /opt/intel/intelpython35/bin/python /opt/intel/intelpython35/bin/jupyter-notebook
u4336    247164 247110  0 Sep17 ?        00:00:24 /home/u4336/.conda/envs/test_env/bin/python -m ipykernel -f /home/u4336/.local/share/jupyter/runtime/kernel-089c940a-04b2-4e43-be1d-03f966b9779d.json
u4336    247316 247110  0 Sep17 ?        00:00:24 /home/u4336/.conda/envs/test_env/bin/python -m ipykernel -f /home/u4336/.local/share/jupyter/runtime/kernel-c9c7e3ec-3cc2-42e1-ac84-51a12e0beaa7.json
u4336    247888 247110  0 01:49 ?        00:00:13 /home/u4336/.conda/envs/test_env/bin/python -m ipykernel -f /home/u4336/.local/share/jupyter/runtime/kernel-fbb6630f-09a8-4bce-a9ca-5d083033dfea.json
u4336    253035 252984  0 22:06 pts/0    00:00:00 grep --color=auto jupyter
[u4336@c001-n029 ~]$

 

2. Notebook is started on your own conda virtual environment instead of root. Pls share your screen shot like how i have shared.

3. Are you able to see latest tensorflow through jupyter notebook?

Regards,

Rajeswari Ponnuru.

 

 

This is my output for the following commands

 

[u4511@c001 ~]$ source activate nn
(nn) [u4511@c001 ~]$ qstat
(nn) [u4511@c001 ~]$ echo jupyter notebook | qsub
23402.c001
(nn) [u4511@c001 ~]$ qstat -f 23402.c001 | grep exec_host
    exec_host = c001-n031/0
(nn) [u4511@c001 ~]$ ssh c001-n031
[u4511@c001-n031 ~]$ ps -eaf | grep jupyter
u4511    245144 245143  8 08:46 ?        00:00:06 /opt/intel/intelpython35/bin/python /opt/intel/intelpython35/bin/jupyter-notebook
u4511    245207 245154  0 08:48 pts/0    00:00:00 grep --color=auto jupyter

Now how do I start a jupyter notebook which runs on compute node and makes use of virtual environment ?

 

Hi Prajjawal,

 If you start jupyter in local browser then whatever we execute, it will run on compute node only as we have started jupyter on compute and tunneling to compute node using below command.  Accesss jupyter notebook how you have started earlier and confirm on screen shots for 5 and 6 points. I have attached same screen shots in the same thread in my previous replies. Pls download those attachments and see for reference.

4. Run plink command as below (plink for windows, if it is ubuntu not needed)

Local browser port : X, Port configured in Jupyter – Y . Same ports should be used in tunneling also.

a. plink -ssh -L X:localhost:Y colfax ssh -L Y:localhost:Y c001-n029 (compute node number from qstat command)

5. Should see <test_env> (In your setup we should see ‘nn’ ) conda environment name in browser when try to create new notebook.

6. Create a new notebook with conda environment kernel and verify the tensorflow version, In the screenshots posted by you, I am  seeing “Python3” which is default environment, that is the reason we are not seeing latest versions.

Thanks,

Rajeswari Ponnuru

I've been able to see the latest version of the Tensorflow by the ipykernel method which you provided. It worked.
Could you give me the exact commands from beginning on how do I run jupyter notebook via virtual environment on "Compute node" for Ubuntu

It will not start on jupyter notebook by default.  I linked my ipykernel with my virtual environment and I ssh into compute node and made sure that tensorflow was installed, but still i get an import error.

Please provide a solution from the beginning on running jupyter notebook via virtual environment on compute node for ubuntu. That would be really helpful for everyone. I will then share it on facebook group from where everyone can get the required resource.

I was able to run jupyer notebook on cluster via virtual env

Hi Prajjawal,

I am closing this thread as you are able to run jupyter notebook on compute node. 

For performance issue, pls start new thread.

Regards,

Rajeswari Ponnuru.

 

Leave a Comment

Please sign in to add a comment. Not a member? Join today