Optimization for fully convolutional neural network

TCE Options

TCE Level: 

TCE Open Date: 

Tuesday, December 17, 2019 - 00:11

Optimization for fully convolutional neural network


I am currently working on a Keras reimplementation of the Jasper speech-to-text network from NYU and NVIDIA labs. I am going off of the information available in their Arxiv paper in order to reconstruct the network as faithfully as possible. I am currently using an Intel distribution of TensorFlow 1.14 on Devcloud with one GPU node in order to train the model and an Intel NUC for inference.

However, I am running into quite a large hurdle. When I tried to get a version of the smallest model (19 layers containing several residual connections) training on the smallest training set (~100 hours of speech), I get an estimated time per epoch of 40-45 hours. Given the maximum wall time of 24 hours on the Devcloud, training this network as is does not appear to be feasible. At this point in time I am unware as to what areas I could optimize in order to drop that training time down to something more manageable. Is this just a situation where I should just throw more GPUs at it? If I were to upgrade to a more current version of TensorFlow, how much gain in training time should I realistically expect?

Thanks for your time,

Nick

___

Training details:

  • 4 1D conv stacks (1D Conv, Batch norm, Relu activation)
  • 5 residual blocks, each 3 deep (containing the described conv stack)
  • 19 total conv stacks
  • CTC loss
  • SGD optimizer
  • batches of 16 from generator
  • 20 epochs

Edit: It appears that the Intel optimized 1.14 TensorFlow package is not gpu enabled. Is there an optimized version of TensorFlow that is gpu enabled and is available to be used on the Devcloud?

14 posts / 0 new

Hi,

Thanks for contacting us. 

DevCloud does not have dedicated GPU but has iGPU installed. iGPU is different from dedicated GPU and is not currently supported by Intel Optimized Tensorflow.

However , you can try the optimizations on CPU  itself to get improved performance.

Please follow the below urls for more details on Optimizing Tensorflow workloads on CPU.

https://software.intel.com/en-us/articles/maximize-tensorflow-performanc...

https://software.intel.com/en-us/articles/tips-to-improve-performance-fo...

Distributed training using Horovod can also help to distribute the workload on multiple cores on same CPU node and extend to multiple nodes. https://github.com/horovod/horovod  will give you more details.

Hope this clarifies your query. Please feel free to reach out to us if you have any further queries. Thank You.

 


Hello Lakshmi,

Thanks for your response!

I went ahead and incorporated the suggestions made in first two Intel articles you linked. Specifically, the settings I changed were as follows:

  • KMP_AFFINITY=granularity=fine,compact,1,0
  • KMP_BLOCKTIME=0
  • KMP_DUPLICATE_LIB_OK=True
  • KMP_SETTINGS=True
  • OMP_NUM_THREADS=6

I ran these on an interactive node in order to get a time estimate and, unfortunately, I am still getting excessively large training times (Started at 40 hours per epoch, got as low as 32 hours after 14 batches of 32 items). It appears to be steadily decreasing, but not to a reasonable time.

Additionally, I tried packaging these directives up and submitting them as a job to the server. I have attached my error file, which says that the code was terminated after throwing a Xbyak error. I am unsure as to why I am able to run the code on an interactive node and not as a submission.

I am currently working on incorporating Horovod, but their readme doesn't make it clear on how to use it with distributed CPUs, so that is taking some time.

--Nick

Attachments: 

AttachmentSize
Downloadtext/plain Specs2Text_small.e448609.txt8.04 KB

Hi Nick,

 

Would it be possible to share the workload along with the steps you have followed to submit the job script for further debugging?

 

Thanks,

Lakshmi.

 

 

 


Sure thing!

Attached are the model and training files alongside the shell script that I submitted as a job.

The job was submitted via qsub:

qsub devcloud_Specs2text.sh

When this did not work, I switched to an interactive node via:

qsub -I -l nodes=1:gpu:ppn=2 -d .

and ran the training script from there after exporting the necessary environment variables.

I don't think I am exactly sure with what you mean by workload, but the network is currently being trained on 28,539 files taking up 11 GB of data across 20 epochs in batches of 16. I am currenting batching through a generator to try to save some room in onboard memory.

Hopefully this helps!

Attachments: 

AttachmentSize
Downloadapplication/x-gzip for_debug.tar.gz3.47 KB

Best Reply

Hi Nick,

Thanks for sharing the model files and training files along with the shell script.

We are unable to recreate the Xbyak error that you mentioned previously as we don't have the dataset and json file referenced in the train.py file.

Could you please provide a subset of the dataset along with json file so that we can recreate the error that you are getting from our end.

Meanwhile, could you please try submitting the script as blow:

#PBS -l nodes=1:gpu
#PBS -l walltime=24:00:00
#PBS -N Specs2Text_small
cd $PBS_O_WORKDIR

export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=0
export OMP_NUM_THREADS=6
export KMP_SETTINGS=TRUE

cd ~/NUC_project/Specs2text

echo Starting training...
python train.py

Thanks.

 

 


Identifying a GPU node as a PBS directive appears to have at least allowed me to run the job outside an interactive node, so that's good!

I have attached a folder containing a small dataset (388 files), the json, and all of the necessary scripts to run the model. You will need a couple uncommon packages, unidecode and inflect, which can be pip installed.

Given that you have the correct packages installed, you should only have to run the train.py file from the master directory.

Attachments: 

AttachmentSize
Downloadapplication/x-gzip for_debug_smallset.tar.gz72.57 MB

Hi Nick, 

Thanks for sharing the datasets along with the json file.

We created a conda environment and installed all the necessary packages.In the 1st epoch after completing almost 22 iterations we are getting the following error.

Traceback (most recent call last):
  File "train.py", line 41, in <module>
    verbose=1)
  File "/home/uXXXXX/.conda/envs/hvd_idz/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1433, in fit_generator
    steps_name='steps_per_epoch')
  File "/home/uXXXXX/.conda/envs/hvd_idz/lib/python3.6/site-packages/tensorflow/python/keras/engine/training_generator.py", line 220, in model_iteration
    batch_data = _get_next_batch(generator, mode)
  File "/home/uXXXXX/.conda/envs/hvd_idz/lib/python3.6/site-packages/tensorflow/python/keras/engine/training_generator.py", line 362, in _get_next_batch
    generator_output = next(generator)
  File "/home/uXXXXX/.conda/envs/hvd_idz/lib/python3.6/site-packages/tensorflow/python/keras/utils/data_utils.py", line 918, in get
    six.reraise(*sys.exc_info())
  File "/home/uXXXXX/.conda/envs/hvd_idz/lib/python3.6/site-packages/six.py", line 696, in reraise
    raise value
  File "/home/uXXXXX/.conda/envs/hvd_idz/lib/python3.6/site-packages/tensorflow/python/keras/utils/data_utils.py", line 894, in get
    inputs = self.queue.get(block=True).get()
  File "/home/uXXXXX/.conda/envs/hvd_idz/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
  File "/home/uXXXXX/.conda/envs/hvd_idz/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/uXXXXX/.conda/envs/hvd_idz/lib/python3.6/site-packages/tensorflow/python/keras/utils/data_utils.py", line 828, in next_sample
    return six.next(_SHARED_SEQUENCES[uid])
  File "/home/uXXXXX/for_debug/data_gen.py", line 91, in next_batch
    self.genshuffle()
  File "/home/uXXXXX/for_debug/data_gen.py", line 102, in genshuffle
    self.wavpath, self.transcript, self.finish = shuffle(self.wavpath,
AttributeError: 'BatchGen' object has no attribute 'wavpath'
Traceback (most recent call last):
  File "/home/uXXXXX/.conda/envs/hvd_idz/lib/python3.6/multiprocessing/util.py", line 262, in _run_finalizers
    finalizer()
  File "/home/uXXXXX/.conda/envs/hvd_idz/lib/python3.6/multiprocessing/util.py", line 186, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/home/uXXXXX/.conda/envs/hvd_idz/lib/python3.6/shutil.py", line 486, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/uXXXXX/.conda/envs/hvd_idz/lib/python3.6/shutil.py", line 444, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/home/uXXXXX/.conda/envs/hvd_idz/lib/python3.6/shutil.py", line 442, in _rmtree_safe_fd
    os.unlink(name, dir_fd=topfd)
OSError: [Errno 16] Device or resource busy: '.nfs000000130031880a000000cd'

As mentioned in the error trace we couldn't find an attribute wavpath initialized in the BatchGen class of data_gen.py file. Please let us know whether we are missing any other code snippets.

Thanks.


My mistake! It appears as though I had forgot to update one of the functions in the generator script.

Attached is the updated script. You should just have to remove the _.txt after the py and then delete the old version.

Attachments: 

AttachmentSize
Downloadtext/plain data_gen.py_.txt5.61 KB

Hi Nick,

Thanks for sharing the updated python files. We submitted the script again with the new python file. Will update the result once the training is complete.

Thank You.


Hi Nick,

We were able to run the python files without the Xbyak error that you have mentioned earlier. Also, we are able to complete training in almost 9 hours.

However, we are getting an OS error at the end after creating the model.The output model generated is 414 MB.

Please follow the steps given below to create a conda environment and install the necessary packages.

conda create -n env_speech -c intel python=3.6
source activate env_speech
pip install numpy
pip install matplotlib
pip install scipy
pip install sklearn
pip install tensorflow==0.14.0

Please find the attached script along with the output and error file generated.

You can try optimization in the same code after installing Intel Optimized Tensorflow instead of normal tensorflow and using other OMP/KMP settings as well.

Please feel free to get back to us if you are facing any errors.Thank you.

Attachments: 

AttachmentSize
Downloadapplication/zip For Debug.zip8.9 KB

Hey Lakshmi,

Thanks for putting in time on this. The Xbyak error might have arisen from my lack of creating a new environment. I made a new one for my development with Horovod and have yet to see that error come up.

It doesn't appear that the files are attached. Would you mind trying to attach them again?

Thanks,

Nick


Hi Nick,

 

Attached the zip file again in the previous post itself. Please let us know if you still face any issues.

 

Thanks.

 


Hi Nick,

Could you please confirm if the solution provided is helpful.

Leave a Comment

Please sign in to add a comment. Not a member? Join today