Deep Voice 2: Multi-Speaker Neural Text-to-Speech

In February, Baidu Silicon Valley AI Lab published Deep Voice 1, a system for generating synthetic human voices entirely with deep neural networks. Unlike alternative neural text-to-speech (TTS) systems, Deep Voice 1 runs in real-time, synthesizing audio as fast as it needs to be played – making it usable for interactive applications like media and conversational interfaces. By training deep neural networks capable of learning from large amounts of data and simple features (rather than custom-designed hand-engineered pipelines), we created an incredibly flexible system for high-quality voice synthesis in real time.

Today, we’re excited to announce Deep Voice 2, the next iteration of the Deep Voice system. In only three months, we’ve been able to scale our system from 20 hours of speech and a single voice to hundreds of hours with hundreds of voices. Deep Voice 2 can learn from hundreds of voices and imitate them perfectly. Unlike traditional systems, which need dozens of hours of audio from a single speaker, Deep Voice 2 can learn from hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality.  

Deep Voice 2 learns to generate speech by finding shared qualities between different voices. Specifically, each voice corresponds to a single vector – about 50 numbers which summarize how to generate sounds that imitate the target speaker. Unlike all previous TTS systems, Deep Voice 2 learns these qualities from scratch, without any guidance about what makes voices distinguishable.

Below are a set of samples from our system, trained on approximately 100 speakers. Each speaker has his or her own speech cadence, accent, pitch, and pronunciation habits – and our system can mirror those almost exactly.

“About half the people who are infected also lose weight.”

“We have the means to help ourselves.”

“We can continue to strengthen the education of good lawyers.”

For more information on Deep Voice 2 and to read the study in full, please read our paper.

2018-01-17T16:20:53+00:00 May 24th, 2017|


  1. Amit Barnwal May 25, 2017 at 10:54 am

    Difficult to understand what’s the real voice and what’s computer generated!!??

  2. Djeb May 25, 2017 at 11:43 am

    Will the code be released on github ?

  3. Sushil May 25, 2017 at 11:28 pm

    when can i use it?

  4. Chris May 26, 2017 at 12:50 am

    @Amit All these samples are computer generated!

  5. Sven May 30, 2017 at 1:32 am

    Will there be a API some Day? I want to use this technology within my Robot!

  6. 董小飒 May 30, 2017 at 6:25 am


  7. Jason May 31, 2017 at 12:19 am

    I would love to try this for my own voice?! Any tutorials?

Comments are closed.