Baidu Research presents Deep Voice, a production-quality text-to-speech system constructed entirely from deep neural networks. The biggest obstacle to building such a system thus far has been the speed of audio synthesis – previous approaches have taken minutes or hours to generate only a few seconds of speech. We solve this challenge and show that we can do audio synthesis in real-time, which amounts to an up to 400X speedup over previous WaveNet inference implementations.
Synthesizing artificial human speech from text, commonly known as text-to-speech (TTS), is an essential component in many applications such as speech-enabled devices, navigation systems, and accessibility for the visually-impaired. Fundamentally, it allows human-technology interaction without requiring visual interfaces.
Modern TTS systems are based on complex, multi-stage processing pipelines, each of which may rely on hand-engineered features and heuristics. Due to this complexity, developing new TTS systems can be very labor intensive and difficult.
Deep Voice is inspired by traditional text-to-speech pipelines and adopts the same structure, while replacing all components with neural networks and using simpler features. This makes our system more readily applicable to new datasets, voices, and domains without any manual data annotation or additional feature engineering.
Deep Voice lays the groundwork for truly end-to-end speech synthesis without a complex processing pipeline and without relying on hand-engineered features for inputs or pre-training.
Our current pipeline is not yet end-to-end, and consists of a phoneme model and an audio synthesis component. The clips below are synthesized from text with our entire pipeline. Here are two utterances chosen at random.
The robotic nature of the voice comes from the pipeline structure and the phoneme model; the audio synthesis component alone generates much more natural clips. The following are clips using the audio synthesis module, but using features from the ground truth audio instead of the phoneme model.
These samples sound very close to the original audio, showing that our audio synthesis component can reproduce human voices very effectively. The following are the ground truth for the utterances above.
Deep learning has revolutionized many fields such as computer vision and speech recognition, and we believe that text-to-speech is now at a similar tipping point. We’re excited to see what the deep learning community can come up with and hope to accelerate that by sharing our entire text-to-speech system in reproducible detail.
For more details, take a look at our paper.