Deep Voice 2: Multi-Speaker Neural Text-to-Speech


Back to list

In February, Baidu Silicon Valley AI Lab published Deep Voice 1, a system for generating synthetic human voices entirely with deep neural networks. Unlike alternative neural text-to-speech (TTS) systems, Deep Voice 1 runs in real-time, synthesizing audio as fast as it needs to be played – making it usable for interactive applications like media and conversational interfaces. By training deep neural networks capable of learning from large amounts of data and simple features (rather than custom-designed hand-engineered pipelines), we created an incredibly flexible system for high-quality voice synthesis in real time.

Today, we’re excited to announce Deep Voice 2, the next iteration of the Deep Voice system. In only three months, we’ve been able to scale our system from 20 hours of speech and a single voice to hundreds of hours with hundreds of voices. Deep Voice 2 can learn from hundreds of voices and imitate them perfectly. Unlike traditional systems, which need dozens of hours of audio from a single speaker, Deep Voice 2 can learn from hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality.  

Deep Voice 2 learns to generate speech by finding shared qualities between different voices. Specifically, each voice corresponds to a single vector – about 50 numbers which summarize how to generate sounds that imitate the target speaker. Unlike all previous TTS systems, Deep Voice 2 learns these qualities from scratch, without any guidance about what makes voices distinguishable.

Below are a set of samples from our system, trained on approximately 100 speakers. Each speaker has his or her own speech cadence, accent, pitch, and pronunciation habits – and our system can mirror those almost exactly.

“About half the people who are infected also lose weight.”

“We can continue to strengthen the education of good lawyers.”