- Visão geral
Modern automatic speech recognition systems incorporate tremendous amount of expert knowledge and a wide array of machine learning techniques. The promise of deep learning is to strip away much of this complexity in favor of the flexibility of neural networks. We will describe our efforts in implementing end-to-end speech recognition in neon by combining convolutional and recurrent neural networks to create an acoustic model followed by a graph-based decoding scheme. These types of models are trained to go directly from raw waveforms to transcribed speech without requiring any kind of explicit forced alignment. We will also discuss additional challenges that must be overcome to produce state-of-the-art results.