Accurate speech recognition systems are vital to many businesses, whether they are a virtual assistant taking commands, video reviews that understand user feedback, or improve customer service. However, today’s world-class speech recognition systems can only function with user data from third party providers or by recruiting graduates from the world’s top speech and language technology programs.
At Baidu Research, we have been working on developing a speech recognition system that can be built, debugged, and improved by a team with little to no experience in speech recognition technology (but with a solid understanding of machine learning). We believe a highly simplified speech recognition pipeline should democratize speech recognition research, just like convolutional neural networks revolutionized computer vision.
Along this endeavor we developed Deep Speech 1 as a proof-of-concept to show a simple model can be highly competitive with state-of-art models. With Deep Speech 2 we showed such models generalize well to different languages, and deployed it in multiple applications. Today, we are excited to announce Deep Speech 3 – the next generation of speech recognition models which further simplifies the model and enables end-to-end training while using a pre-trained language model.
In our paper, we perform an empirical comparison between three models — CTC which powered Deep Speech 2, attention-based Seq2Seq models which powered Listend-Attend-Spell among others, and RNN-Transducer for end-to-end speech recognition. The RNN-Transducer can be thought of as an encoder-decoder model which assumes the alignment between input and output tokens is local and monotonic. This makes the RNN-Transducer loss a better fit for speech recognition (especially when online) than attention-based Seq2Seq models by removing extra hacks applied to attentional models to encourage monotonicity.
Additionally, unlike CTC which requires an external LM to output meaningful results, the RNN-Transducer supports a purely neural decoder, and has no mismatch between the model’s usage in training and test time. Naturally, the RNN-Transducer model outperforms the CTC models, even without an external language model.
The fact that Seq2Seq and RNN-Transducer models are successful without an external language model presents a challenge. Language models are vital to speech recognition because language models can be trained rapidly on much larger datasets, and secondly language models can be used for specializing the speech recognition model according to context (user, geography, application etc.) without a labeled speech corpus for each context. We learned the latter is especially vital in a production speech recognition system when we deployed Deep Speech 2.
To support these use cases, we developed Cold Fusion, which leverages a pre-trained language model when training a Seq2Seq model. We show Seq2Seq models with Cold Fusion are able to better utilize language information, resulting in better generalization and faster convergence, and also in almost complete transfer to a new domain while using less than 10 percent of the labeled training data. Cold Fusion also gives us the ability to swap language models during test time to specialize to any context. While this work is on Seq2Seq models, this should apply equally well to RNN-Transducers.
Together, the RNN-Transducer loss combined with Cold-Fusion for leveraging language models presents the next frontier for speech recognition. We are excited to look towards the future and explore what further advances these technologies will enable.
You can find the referenced papers here：