A Breakthrough in Speech Technology: Baidu Launched SMLTA, the First Streaming Multi-layer Truncated Attention Model for Large-scale Online Speech Recognition


Back to list

On 16 January 2019, Baidu announced four major breakthroughs in speech technology. Among them, the most notable one is the streaming multi-layer truncated attention model (SMLTA) for online automatic speech recognition (ASR). SMLTA is a milestone in large-scale deployment of an attention model in the speech technology industry.

Attention Models have been used in the field of automatic speech recognition for a few years. The core idea of the Attention Model is to automatically extract audio features of each Chinese syllable (or Chinese character) of the whole sentence through machine learning. In this way, the model establishes a direct relationship between particular audio sounds in the continuous audio stream and the words of the recognized sentence. The speech recognition process has also become a word-by-word generation process. Such attention-based models completely abandon the traditional speech recognition frameworks such as the hybrid Deep Neural Network-Hidden Markov (DNN-HMM) acoustic models. Instead, attention models directly implement speech recognition end-to-end modeling, which is widely recognized to have better modeling ability.

In recent years, many researchers have tried attention models in speech recognition tasks, and achieved a series of improvements in the laboratory environments. Thus far, however, there are few successful cases of large-scale deployment of attention models in online speech recognition systems. The reason lies in the two main problems of attention model in speech recognition:

 1.    The problem of stream decoding. Most of the traditional attention models are based on modelling a sentence as a whole, and a typical example is the LAS (Listen, Attend and Spell) model proposed by Google. If the global attention model is used in online speech recognition, the entire utterance needs be uploaded to the server before the decoder begins to compute. This means real-time speech interaction is not possible and will inevitably lead to a longer waiting time for users, affecting the user experience. Some other methods can indeed achieve the streaming attention model, such as Google's Neural Transducer (NT) model. However, the performance of such methods is worse than the global attention model, ultimately degrading the user experience.

2.    The difficulty of modeling long sentences: The core idea of traditional attention models is to consider all information from the sentence, and to select features most matched with the current portion of the sentence. The longer the sentence is, the more difficult it is to select important features, resulting in more errors.

To solve the above two problems, the industry needs a new ASR attention model that can conduct streaming attention modeling and decoding concurrently with the streaming upload of speech data, and thus reduce users’ waiting time for recognition results and make real-time online speech interaction possible. At the same time, with the increasing length of input speech data, the continuous speech stream can be truncated to ensure that the attention model is more focused, improving the accuracy of long sentences.

This innovative streaming multi-layer truncated attention model, SMLTA, proposed by Baidu, is the first of its kind in the industry. It is the world’s first large-scale online speech recognition service based on attention technology. Such a technological breakthrough reinforces Baidu’s leadership in AI industrial applications.

The four major innovations of SMLTA are truncation, streaming, multi-layer and CTC & Attention based approach. Specifically, the peak information of CTC (a type of speech recognition scoring algorithm) is used to truncate the continuous speech stream, and then the attention model of the current modeling unit runs on each truncated speech segment. In this way, the whole sentence is split into the local speech segments, and local attention is used to get the suitable features. Meanwhile, in order to overcome the inevitable insertion and deletion errors of CTC model, SMLTA introduces a special multi-layer attention mechanism to achieve progressive and more accurate layer-by-layer feature selection. Last but not least, while the recognition rate of this innovative modeling surpasses that of the traditional global attention modeling, the online resource consumption rate such as computation and decoding speed stays the same as the traditional CTC model. The announcement of SMLTA also marks the first time that the local attention model has been reported to outperform the global attention model.

Baidu has successfully deployed this attention model online in Baidu’s input method editor (IME) products to serve hundreds of millions of users. It is considered the world’s first large-scale deployment of an attention model for online speech recognition. In terms of computing requirements, Baidu launches the model efficiently with all computing done by CPU. No additional GPU is needed. And the number of machine used is the same as that of the traditional CTC model. Finally, in terms of the accuracy of the IME products, according to our large number of tests, compared with the original Deep peak 2 CTC system, the relative accuracy rate of SMLTA has grown by 15%.

With high performance and low power consumption, Baidu’s SMLTA is undoubtedly another breakthrough in the history of Chinese online speech recognition.

In addition to online speech, Baidu's speech technology has harvested breakthroughs in offline speech, Chinese-English mixed input method and Mandarin Chinese-one dialect mixed input method. At present, the accuracy of Baidu IME products’ offline speech input is 35% higher than the industry average, which ensures smooth and fast user experience without going online. Technological innovation has enabled Baidu’s IME products to surmount linguistic barriers, making it the only input method to achieve high-precision Chinese-English mixed speech recognition input without affecting the accuracy of Chinese speech input at all. The recognition system also has a "Dialect-Free Speech" feature that integrates Mandarin and six major Chinese dialects, so there is no need to click any buttons to switch between Mandarin and other dialects. Users can speak their preferred dialect, anytime and anywhere.

Since 2012, Baidu has been continuously exploring and innovating in speech recognition technology. The company has not only improved the recognition accuracy, but also led the technical directions with its AI technology. Last year, at the Baidu Input Method Conference, Baidu Speech released Deep Peak 2 model, which surpass the traditional model that had been used for more than ten years. The Deep Peak 2 model makes use of the advantages of neural network models and greatly improves the recognition accuracy in different scenarios. Within a year, Baidu Speech Technology Team showcased a significant technological innovation again.

"We believe that technology is real only when it is applied to products that are truly experienced by users. We will never create technology for the sake of creating it," Said Liang Gao, Speech Technology Division, AIG, Baidu.