Baidu Research

Baidu at AAAI 2020: NLP, Machine Learning, and Computer Vision

2020-02-12

The 34th AAAI Conference on Artificial Intelligence (AAAI-20) is now underway in New York. As one of the world’s leading conferences in the field of artificial intelligence, AAAI-20 received over 8,800 submissions, with 7,737 submissions reviewed and 1,591 accepted research papers (highlighting an acceptance rate of 20.6 percent).

This year, Baidu achieved a record-high of 28 accepted research papers covering a wide range of topics from natural language processing and machine learning to computer vision and more. Despite the absence of a number of our research authors at the conference due to the recent coronavirus travel ban, we encourage attendees to stop by our booth and chat with our experts about our latest research projects and career opportunities.

In this blog, we will spotlight innovations from three of our accepted research papers in further detail.

Pre-trained language model

Unsupervised pre-trained language models have made significant progress in various natural language processing tasks. But while they provide an opportunity to attain valuable insight from the training corpora, existing models are often based on the co-occurrence of words and sentences.

In the paper ERNIE 2.0: A Continual Pre-training Framework for Language Understanding, our researchers proposed a continual pre-training framework named ERNIE 2.0 which incrementally builds and learns pre-training tasks through constant multi-task learning. In this framework, models can learn different aspects of knowledge in training corpora, including named entity, semantic closeness, and discourse relations.

These experimental results demonstrated how ERNIE 2.0 outperformed BERT and XLNet on 16 tasks including English tasks on GLUE benchmarks and several common tasks in Chinese. Last year, the ERNIE model achieved new state-of-the-art performance on GLUE and became the world's first model to score over 90 in terms of the macro-average score (90.1), surpassing human baselines by 3 percent. Today, ERNIE is widely applied to real-world application scenarios and boosts the capabilities of understanding the language.

The paper has been accepted as an oral presentation while the source code and pre-trained models have been released at GitHub.

Machine reading comprehension

Adversarial training has proved to be an effective method for training robust machine reading comprehension models, as existing manual approaches are not able to generate all possible adversarial samples along with their rules in a regular way.

In the paper A Robust Adversarial Training Approach to Machine Reading Comprehension, our researchers presented an automatic adversarial model-driven approach to recognize undetected adversarial samples and eventually improve the robustness of machine reading models.

Specifically, researchers used an adversarial method to generate a perturbation vector input for each training sample, aiming to mislead the reading comprehension model. They then used a strategy to sample the lexical weights of perturbation vectors to extract corresponding discretized perturbation texts, which are used to construct the adversarial samples that are used to train the reading comprehension model. The above steps are repeated until the model converges.

The research results showed that our proposed adversarial training technique achieved a significantly improved outcome across different adversarial datasets as well as generated diversified adversarial samples. The paper also concludes that this method will require further improvements as a number of generated adversarial samples did not contain natural language.

Computer vision

3D object detection is playing an increasingly critical role in autonomous driving, but stereo imagery-based 3D detection tactics are still no match for lidar-based methods. In the paper ZoomNet: Part-Aware Adaptive Zooming Neural Network for 3D Object Detection, our researchers propose adaptive zooming, a technique by which distant cars are analyzed on a larger scale to achieve more accurate depth estimation.

The resulting architecture, named ZoomNet, surpassed all existing state-of-the-art technology by significant margins on the popular KITTI 3D detection benchmark. More importantly, ZoomNet is the first stereo imagery-based solution to reach comparable performance to current lidar-based methods at a relatively lower threshold.

More specifically, ZoomNet achieves this by performing a fine-grained analysis on 2D instances from left and right bounding boxes. The foreground pixels in 2D are then projected into 3D space for pose regression. With the built-in technique adaptive zooming, ZoomNet can simultaneously adjust the size of the 2D instance bounding box to a uniform resolution as well as the camera’s intrinsic parameters. As a result, ZoomNet can achieve higher quality disparity maps from the adjusted image and construct point clouds of similar density for instances of different depths. In addition, researchers can also introduce part locations as a generalized version of key-points to better localize cars and to enhance the resistance to occlusion.

Our researchers also presented the KITTI Fine-Grained car (KFG) dataset by extending KITTI with an instance-wise 3D CAD model and pixel-wise fine-grained annotations. Both the KFG dataset and our codes will be publicly available soon.

Accepted papers

Generative Adversarial Regularized Mutual Information Policy Gradient Framework for Automatic Diagnosis

Yuan Xia, Jingbo Zhou, Zhenhui Shi, Chao Lu, Haifeng Huang

Capturing Sentence Relations for Answer Sentence Selection with Multi-Perspective Graph Encoder

Zhixing Tian, Yuanzhe Zhang, Xinwei Feng, Wenbin Jiang, Yajuan Lyu, Kang Liu and Jun Zhao

Distributed Primal-Dual Optimization for Online Multi-task Learning

Peng Yang, Ping Li

Meta-CoTGAN: A Meta Cooperative Training Paradigm for Improving Adversarial Text Generation

Haiyan Yin, Dingcheng Li, Xu Li, Ping Li

IVFS: Simple and Efficient Feature Selection for High Dimensional Topology Preservation

Xiaoyun Li, Chenxi Wu, Ping Li

ERNIE 2.0: A Continual Pre-training Framework for Language Understanding

Yu Sun,Shuohuan Wang, Yukun Li,Shikun Feng,Hao Tian,Hua Wu, Haifeng Wang

Knowledge Graph Grounded Goal Planning for Open-Domain Conversation Generation

Jun Xu, Haifeng Wang, Zheng-Yu Niu, Hua Wu, Wanxiang Che

Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding

Yuchen Liu, Jiajun Zhang, Hao Xiong, Long Zhou, Zhongjun He, Hua Wu, Haifeng Wang, Chengqing Zong

A Robust Adversarial Training Approach to Machine Reading Comprehension

Kai Liu, Xin Liu, An Yang, Jin Liu, Jinsong Su, Sujian Li, Qiaoqiao She

Multi-Label Classification with Label Graph Superimposing

Ya Wang, Dongliang He, Fu Li, Xiang Long, Zhichao Zhou, Jinwen Ma, Shilei Wen

ZoomNet: Part-Aware Adaptive Zooming Neural Network for 3D Object Detection

Zhenbo Xu, Wei Zhang, Xiaoqing Ye, Xiao Tan, Wei Yang, Shilei Wen, Errui Ding, Ajin Meng, Liusheng Huang

Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification

Renchun You, Zhiyao Guo, Lei Cui, Xiang Long, Yingze Bao, Shilei Wen

Dynamic Instance Normalization for Arbitrary Style Transfer

Yongcheng Jing, Xiao Liu, Yukang Ding, Xinchao Wang, Errui Ding, Mingli Song, Shilei Wen

SetRank: A Setwise Bayesian Approach for Collaborative Ranking from Implicit Feedback

Chao Wang, Hengshu Zhu, Chen Zhu, Chuan Qin, Hui Xiong

Relational Graph Neural Network with Hierarchical Attention for Knowledge Graph Completion

Zhao Zhang, Fuzhen Zhuang, Hengshu Zhu, Zhiping Shi, Hui Xiong, Qing He

Why We Go Where We Go: Profiling User Decisions on Choosing POIs

Renjun Hu, Xinjiang Lu, Chuanren Liu, Yanyan Li, Hao Liu, Shuai Ma, and Hui Xiong

Semi-Supervised Hierarchical Recurrent Graph Neural Network for City-Wide Parking Availability Prediction

Weijia Zhang, Hao Liu, Yanchi Liu, Jingbo Zhou, Hui Xiong

Learning Conceptual-Contextual Embeddings for Medical Text

Xiao Zhang, Dejing Dou and Ji Wu

Ultrafast Photorealistic Style Transfer via Neural Architecture Search.

Jie An*, Haoyi Xiong*, Jun Huan, and Jiebo Luo

Person Tube Retrieval via Language Description

Hehe Fan, Yi Yang

Context Modulated Dynamic Networks for Actor and Action Video Segmentation From a Sentence

Hao Wang, Cheng Deng, Fan Ma, Yi Yang

Symbiotic Attention with Privileged Information for Egocentric Action Recognition

Xiaohan Wang, Yu Wu, Linchao Zhu, Yi Yang

Adversarial Localized Energy Network for Structured Prediction

Pingbo Pan, Ping Liu, Yan Yan, Tianbao Yang, Yi Yang

EEMEFN: Low-Light Image Enhancement via Edge-Enhanced Multi-Exposure Fusion network

Minfeng Zhu, Pingbo Pan, Wei Chen, Yi Yang

Relation-Aware Pedestrian Attribute Recognition with Graph Convolutional Networks

Zichang Tan, Yang Yang, Jun Wan, Guodong Guo, Stan Z. Li

GBCNs: Genetic Binary Convolutional Networks for Enhancing the Performance of 1-bit DCNNs

Chunlei Liu, Wenrui Ding, Yuan Hu, Baochang Zhang, Jianzhuang Liu, Guodong Guo

AutoRemover: Automatic Object Removal for Autonomous Driving Videos

Rong Zhang, Wei Li, Peng Wang, Chenye Guan, Jin Fang, Yuhang Song, Jinhui Yu, Baoquan Chen, Weiwei Xu, Yang Ruigang

CSPN++: Learning Context and Resource Aware Convolutional Spatial Propagation Networks for Depth Completion

Xinjing Cheng, Peng Wang, Chenye Guan and Ruigang Yang