Introducing Qian Yan, Baidu's New Plan to Build 100 Chinese NLP Datasets in Three Years


Back to list

图片 1.png

Since its creation in 2010, Baidu's natural language processing (NLP) unit has been hard at work powering some of our smartest linguistic products, from intelligent search to smart home. As Baidu NLP turns ten this year, we are thrilled to announce new plans and products in an effort to accelerate the large-scale implementation of NLP. We also use the occasion to review Baidu's innovations in language and knowledge technologies.


Meet "Qian Yan"


One of the most important barriers in developing language and knowledge technology is data shortage. That's why we have teamed up with the China Computer Federation and the Chinese Information Processing Society of China to launch Qian Yan, a plan to jointly develop the world's largest Chinese natural language processing database.


"In the future, we hope that more dataset developers can participate in the establishment of Qian Yan, jointly promote the progress of Chinese information processing technology, and build a world-wide Chinese information processing influence," says Hua Wu, chairman of Baidu Technical Committee. "In the next three years, we plan to collect and build no less than 100 Chinese natural language processing datasets for over 20 tasks, covering all areas of language and knowledge technology."


The first phase of Qian Yan, or "Thousands of Words" in Chinese, has covered seven major tasks such as open-domain dialogue systems, reading comprehension, and more than 20 Chinese open-sourced datasets, jointly developed by data source developers from 11 universities and enterprises.


Wu also announces a computing power-sharing initiative to offer computing power for developers through Baidu AI Studio.

An overview of Baidu's language and knowledge technologies

CTO Haifeng Wang unveils Baidu NLP's layout, including knowledge graph, natural language understanding and generation technologies, and downstream applications.


Knowledge graphs are the building block for computers to develop a cognitive understanding of the world. Baidu has built a large-scale knowledge graph with over five billion entities and 550 billion facts. Over 40 billion calls have been made to the knowledge graph API each day.


In addition to the knowledge graph, Baidu NLP tech has advanced its language understanding capability. In 2019, we introduced ERNIE, a continual pre-training framework that builds and learns incrementally through sequential multi-task learning. ERNIE has achieved new state-of-the-art on GLUE, a widely-recognized multi-task benchmark and analysis platform for NLP, and becomes the first model to score over 90.


In terms of natural language generation (NLG), we have proposed ERNIE-GEN, an enhanced multi-flow sequence-to-sequence pre-training and fine-tuning framework that achieves state-of-the-art results on a range of NLG tasks. Additionally, we've presented GraphSum, an effective model to summarize long documents by incorporating explicit graph representations into the neural architecture.


These research works have laid a great foundation for downstream NLP applications like dialogue systems and machine translation. For instance, we have introduced PLATO-2, Baidu's newest open-domain chatbot that can discuss any topic in both Chinese & English. Developers can tap our dialogue customization and service platform UNIT to efficiently build intelligent dialogue systems. Baidu Translate now supports over 200 languages and 400,000 third-party applications. We have also incorporated new model methods like multi-agent joint learning and simultaneous semantic interpretation to improve translation quality.


In the past ten years, Baidu has won over 20 awards, including the State Science and Technology Progress Award, and over 30 international competition champions in language and knowledge. Our researchers have published more than 300 academic papers and filed more than 2,000 patents.

Five new NLP products

Also announced are news releases and upgrades of five language and knowledge products:


Semantic understanding platform ERNIE, which is built on Baidu's deep learning platform PaddlePaddle and pre-training framework ERNIE, can provide a one-stop solution for developers to customize enterprise-level NLP models. With labeled data input, ERNIE can train and fine-tune the model and create an API for request. Over 20,000 developers across finance, telecommunications, education, and e-commerce industries have applied ERNIE to their businesses.


TextMind is an intelligent document analysis platform that offers document comparison and document review features, with the support of optical character recognition and NLP technology behind the scenes.


Baidu Brain's Intelligent Creation Platform can help assist the creation of writings and videos for content publishers. Four months after the platform went live, over 7,000 people used its AI-powered video synthesis tool to create 150,000 videos. The platform has received a significant upgrade and released three new solutions: smart planning, smart editing, and smart review.


UNIT introduces three major features: smarter task-oriented dialogue understanding, easy-to-use form question and answer, and a new general dialogue engine.


The newly released AI Simultaneous Interpretation Conferencing solution aims to play a role as a "conference interpreter" for users. Users can quickly build a set of simultaneous interpretation services with only one computer and one mobile phone.