Deep Speech presented an end-to-end neural architecture using the CTC loss for speech recognition in multiple languages. Today, we present Gram CTC which extends the CTC loss function to automatically discover and predict word pieces instead of characters.
Models using Gram CTC achieve state-of-the-art on the Fisher-Swbd benchmark with single model, demonstrating that end-to-end learning using Gram-CTC outperforms context-dependent-phoneme based systems, using the same training data while also speeding up training by 2x.
Consider the following possible transcriptions for the same audio clip, that are both phonetically feasible transcriptions.
- recognize speech using common sense
- wreck a nice beach you sing calm incense
CTC predicts one character at a time, assuming independence between each other given the input. To make the two transcripts likely, CTC will have to pick two characters to fill in the blanks, which looks like this.
Using only candidates from option 2 to fill in the blanks, we get our first target “recognize speech …”, and using candidates option 1 we get “wreck a nice beach …”. Many nonsensical transcripts can also be generated by selecting some candidates from option 1 and others from option 2.
Word pieces are units that are between characters and words, such as “ing”, “euax”, “sch” etc., While the same character may take on different pronunciations depending on context, word pieces often tend to have a consistent pronunciation in English. So our example may be predicted using word pieces as follows:
As we can see, there are fewer nonsensical transcriptions are possible. Predicting word pieces has the following additional advantages:
- It is easier to model, since word pieces are more closely related to sound than characters.
- Since word pieces capture larger context in sound, the number of prediction steps needed is much smaller. This means that our models can use larger stride. Thus, our models run on only half as many time steps, speeding up training and inference. On the same infrastructure our training time fell from 9 hours per pass over a dataset of 2000 hours to 5 hours.
- The model can learn common spellings that correspond to the same sound. In the example above, “alm” and “omm” are very close in pronunciation – this is hard to model to CTC, but easier in Gram-CTC.
More details can be found in our paper.