I am currently working on a Keras reimplementation of the Jasper speech-to-text network from NYU and NVIDIA labs. I am going off of the information available in their Arxiv paper in order to reconstruct the network as faithfully as possible. I am currently using an Intel distribution of TensorFlow 1.14 on Devcloud with one GPU node in order to train the model and an Intel NUC for inference.
However, I am running into quite a large hurdle. When I tried to get a version of the smallest model (19 layers containing several residual connections) training on the smallest training set (~100 hours of speech), I get an estimated time per epoch of 40-45 hours. Given the maximum wall time of 24 hours on the Devcloud, training this network as is does not appear to be feasible. At this point in time I am unware as to what areas I could optimize in order to drop that training time down to something more manageable. Is this just a situation where I should just throw more GPUs at it? If I were to upgrade to a more current version of TensorFlow, how much gain in training time should I realistically expect?
Thanks for your time,
- 4 1D conv stacks (1D Conv, Batch norm, Relu activation)
- 5 residual blocks, each 3 deep (containing the described conv stack)
- 19 total conv stacks
- CTC loss
- SGD optimizer
- batches of 16 from generator
- 20 epochs
Edit: It appears that the Intel optimized 1.14 TensorFlow package is not gpu enabled. Is there an optimized version of TensorFlow that is gpu enabled and is available to be used on the Devcloud?