About IDL

Baidu launched the Institute of Deep Learning in 2013. The team’s focus areas include image recognition, machine learning, robotics, human-computer interaction, 3D vision and heterogeneous computing.

Visit the Baidu IDL Beijing website →

Technical Work

CFO: Conditional Focused Neural Question Answering with Large-scale Knowledge Bases.
Zihang Dai, Lei Li and Wei Xu
Annual Meeting of the Association for Computational Linguistics (2016)

How can we enable computers to automatically answer questions like “Who created the character Harry Potter”? Carefully built knowledge bases provide rich sources of facts. However, it remains a challenge to answer factoid questions raised in natural language due to numerous expressions of one question. In particular, we focus on the most common questions — ones that can be answered with a single fact in the knowledge base. We propose CFO, a Conditional Focused neural-network-based approach to answering factoid questions with knowledge bases. Our approach first zooms in a question to find more probable candidate subject mentions, and infers the final answers with a unified conditional probabilistic framework. Powered by deep recurrent neural networks and neural embeddings, our proposed CFO achieves an accuracy of 75.7% on a dataset of 108k questions – the largest public one to date. It outperforms the current state of the art by an absolute margin of 11.8%.


Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation.
Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, Wei Xu
Transactions of the Association for Computational Linguistics (2016)

Neural machine translation (NMT) aims at solving machine translation (MT) problems with purely neural networks and exhibits promising results in recent years. However, most of the existing NMT models are of shallow topology and there is still a performance gap between the single NMT model and the best conventional MT system. In this work, we introduce a new type of linear connections, named fast-forward connections, based on deep Long Short-Term Memory (LSTM) network, together with the interleaved bi-directional way for stacking them. Fast-forward connections play an essential role to propagate the gradients in building the deep topology of depth 16. On WMT’14 English- to-French task, we achieved BLEU=37.7 with single attention model, which outperforms the corresponding single shallow model by 6.2 BLEU points. It is the first time that a single NMT model achieves state-of-the-art performance and outperforms the best conventional model by 0.7 BLEU points. Even without considering attention mechanism, we can still achieve BLEU=36.3. After the special handling for unknown words and the model ensembling, we obtained the best score on this task with BLEU=40.4. Our models are also verified on the more difficult WMT’14 English-to-German task.


Online Reconstruction of Indoor Scenes from RGB-D Streams
Wang Hao, Wang Jun, Wang Liang
Conference on Computer Vision and Pattern Recognition (2016), dataset

A system capable of performing robust online volumetric reconstruction of indoor scenes based on input from a handheld RGB-D camera is presented. Our system is powered by a two-pass reconstruction scheme. The first pass tracks camera poses at video rate and simultaneously constructs a pose graph on-the-fly. The tracker operates in real-time, which allows the reconstruction results to be visualized during the scanning process. Live visual feedbacks make the scanning operation fast and intuitive. Upon termination of scanning, the second pass takes place to handle loop closures and reconstruct the final model using globally refined camera trajectories. The system is online with low delay and returns a dense model of sufficient accuracy. The beauty of this system lies in its speed, accuracy, simplicity and ease of implementation when compared to previous methods. We demonstrate the performance of our system on several real-world scenes and quantitatively assess the modeling accuracy with respect to ground truth models obtained from a LIDAR scanner.


CNN-RNN: A Unified Framework for Multi-label Image Classification
Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, and Wei Xu
Conference on Computer Vision and Pattern Recognition- Oral (2016)

While deep convolutional neural networks (CNNs) have shown a great success in single-label image classification, it is important to note that real world images generally contain multiple labels, which could correspond to different objects, scenes, actions and attributes in an image. Traditional approaches to multi-label image classification learn independent classifiers for each category and employ ranking or thresholding on the classification results. These techniques, although working well, fail to explicitly exploit the label dependencies in an image. In this paper, we utilize recurrent neural networks (RNNs) to address this problem. Combined with CNNs, the proposed CNN-RNN framework learns a joint image-label embedding to characterize the semantic label dependency as well as the image-label relevance, and it can be trained end-to-end from scratch to integrate both information in a unified framework. Experimental results on public benchmark datasets demonstrate that the proposed architecture achieves better performance than the state-of-the-art multi-label classification model.


Video Paragraph Captioning using Hierarchical Recurrent Neural Networks
Haonan Yu, Jiang Wang, Yi Yang, Zhiheng Huang, Wei Xu
Conference on Computer Vision and Pattern Recognition- Oral (2016)

We present an approach that exploits hierarchical Recur- rent Neural Networks (RNNs) to tackle the video captioning problem, i.e., generating one or multiple sentences to de- scribe a realistic video. Our hierarchical framework con- tains a sentence generator and a paragraph generator. The sentence generator produces one simple short sentence that describes a specific short video interval. It exploits both temporal- and spatial-attention mechanisms to selectively focus on visual elements during generation. The paragraph generator captures the inter-sentence dependency by taking as input the sentential embedding produced by the sentence generator, combining it with the paragraph history, and outputting the new initial state for the sentence generator. We evaluate our approach on two large-scale benchmark datasets: YouTubeClips and TACoS-MultiLevel. The exper- iments demonstrate that our approach significantly outper- forms the current state-of-the-art methods with BLEU@4 scores 0.499 and 0.305 respectively.


Attention to Scale: Scale-aware Semantic Image Segmentation
Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, and Alan L Yuille
Conference on Computer Vision and Pattern Recognition (2016)

Incorporating multi-scale features in fully convolutional neural networks (FCNs) has been a key element to achieving state-of-the-art performance on semantic image segmentation. One common way to extract multi-scale features is to feed multiple resized input images to a shared deep network and then merge the resulting features for pixelwise classification. In this work, we propose an attention mechanism that learns to softly weight the multi-scale features at each pixel location. We adapt a state-of-the-art semantic image segmentation model, which we jointly train with multi-scale input images and the attention model. The proposed attention model not only outperforms average- and max-pooling, but allows us to diagnostically visualize the importance of features at different positions and scales. Moreover, we show that adding extra supervision to the output at each scale is essential to achieving excellent performance when merging multi-scale features. We demonstrate the effectiveness of our model with extensive experiments on three challenging datasets, including PASCAL-Person-Part, PASCAL VOC 2012 and a subset of MS-COCO 2014.


ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering
Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, Ram Nevatia (2015)

We propose a novel attention based deep learning architecture for visual question answering task (VQA). Given an image and an image related natural language question, VQA generates the natural language answer for the question. Generating the correct answers requires the model’s attention to focus on the regions corresponding to the question, because different questions inquire about the attributes of different image regions. We introduce an attention based configurable convolutional neural network (ABC-CNN) to learn such question-guided attention. ABC-CNN determines an attention map for an image-question pair by convolving the image feature map with configurable convolutional kernels derived from the question’s semantics. We evaluate the ABC-CNN architecture on three benchmark VQA datasets: Toronto COCO-QA, DAQUAR, and VQA dataset. ABC-CNN model achieves significant improvements over state-of-the-art methods on these datasets. The question-guided attention generated by ABC-CNN is also shown to reflect the regions that are highly relevant to the questions.


Fully Convolutional Attention Localization Networks: Efficient Attention Localization for Fine-Grained Recognition.
Liu Xiao, Tian Xia, Jiang Wang, and Yuanqing Lin. (2016)

Fine-grained recognition is challenging mainly because the inter-class differences between fine-grained classes are usually local and subtle while intra-class differences could be large due to pose variations. In order to distinguish them from intra-class variations, it is essential to zoom in on highly discriminative local regions. In this work, we introduce a reinforcement learning-based fully convolutional attention localization network to adaptively select multiple task-driven visual attention regions. We show that zooming in on the selected attention regions significantly improves the performance of fine-grained recognition. Compared to previous reinforcement learning-based models, the proposed approach is noticeably more computationally efficient during both training and testing because of its fully-convolutional architecture, and it is capable of simultaneous focusing its glimpse on multiple visual attention regions. The experiments demonstrate that the proposed method achieves notably higher classification accuracy on three benchmark fine-grained recognition datasets: Stanford Dogs, Stanford Cars, and CUB-200-2011.


Localizing by Describing: Attribute-Guided Attention Localization for Fine-Grained Recognition.
Liu, Xiao, Jiang Wang, Shilei Wen, Errui Ding, and Yuanqing Lin. (2016)

A key challenge in fine-grained recognition is how to find and represent discriminative local regions. Recent attention models are capable of learning discriminative region localizers only from category labels with reinforcement learning. However, not utilizing any explicit part information, they are not able to accurately find multiple distinctive regions. In this work, we introduce an attribute-guided attention localization scheme where the local region localizers are learned under the guidance of part attribute descriptions. By designing a novel reward strategy, we are able to learn to locate regions that are spatially and semantically distinctive with reinforcement learning algorithm. The attribute labeling requirement of the scheme is more amenable than the accurate part location annotation required by traditional part-based fine-grained recognition methods. Experimental results on the CUB-200-2011 dataset demonstrate the superiority of the proposed scheme on both fine-grained recognition and attribute recognition.


SWIFT: Compiled Inference for Probabilistic Programs
Yi Wu, Lei Li and Stuart J. Russell
Neural Information Processing Systems – Workshop on Black Box Learning and Inference (2015)

One long-term goal for research on probabilistic programming languages (PPLs) is efficient inference using a single, generic inference engine. Many current inference engines incur significant interpretation overhead . This paper describes a PPL compiler, Swift, that generates model-specific and inference-algorithm-specific target code from a given PP in the BLOG language with highly optimized data structures.


On Optimization Algorithms for Recurrent Networks with Long Short-Term Memory
Hieu Pham, Zihang Dai and Lei Li
Bay Area Machine Learning Symposium (2015)

On these tasks, we compare the performances of RNN-LSTMs trained with various optimization algorithms such as SGD, AdaGrad and momentum. We further propose a novel optimization technique that achieves better performance on both tasks. 


Twisted Recurrent Network for Named Entity Recognition
Zefu Lu, Lei Li and Wei Xu 
Bay Area Machine Learning Symposium (2015)

Given a sequence of text tokens, a named entity recognizer (NER) shall identity the chunks of tokens that belongs to predefined category of entities such as persons and organizations. NER problem is formulated as to produce a sequence of entity labels, one for every token in the sentence. 


A Deep Visual Correspondence Embedding Model for Stereo Matching Costs
Zhuoyuan Chen, Xun Sun, Liang Wang, Yinan Yu, Chang Huang
International Conference on Computer Vision (2015)

This paper presents a data-driven matching cost for stereo matching. A novel deep visual correspondence embedding model is trained via Convolutional Neural Network on a large set of stereo images with ground truth disparities.


End-to-end Learning of Semantic Role Labeling Using Recurrent Neural Networks
Jie Zhou, Wei Xu
Association for Computational Linguistics (2015)                                                                                                   

In this work, we propose to use deep bi-directional recurrent network as an end-to-end system for SRL. We take only original text information as input feature, without using any syntactic knowledge. The proposed algorithm for semantic role labeling was mainly evaluated on CoNLL-2005 shared task and achieved F1 score of 81.07.


Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering
Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, Wei Xu
Arviv.org (2015)

In this paper, we present the mQA model, which is able to answer questions about the content of an image. The answer can be a sentence, a phrase or a single word


Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, Alan Yuille
Arviv.org (2015)


Learning from Massive Noisy Labeled Data for Image Classification
Xiao, Tong, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2691-2699 (2015)


Deep Multiple Instance Learning for Image Classification and Auto-Annotation
Wu, Jiajun, Yinan Yu, Chang Huang, and Kai Yu
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3460-3469 (2015)


Multi-Objective Convolutional Learning for Face Labeling
Liu, Sifei, Jimei Yang, Chang Huang, and Ming-Hsuan Yang
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3451-3459 (2015)


Explain Images with Multimodal Recurrent Neural Networks
Mao, Junhua, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille
arXiv preprint arXiv:1410.1090 (2014).


Depth-based Hand Pose Estimation: Methods, Data, and Challenges
Supancic III, James Steven, Gregory Rogez, Yi Yang, Jamie Shotton, and Deva Ramanan
arXiv preprint arXiv:1504.06378 (2015)


Conditional Random Fields as Recurrent Neural Networks
Zheng, Shuai, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip Torr
arXiv preprint arXiv:1502.03240 (2015)


DenseBox: Unifying Landmark Localization with End to End Object Detection
Huang, Lichao, Yi Yang, Yafeng Deng, and Yinan Yu
arXiv preprint arXiv:1509.04874 (2015)


Look and think twice: Capturing Top-down Visual Attention with Feedback Convolutional Neural Networks
Cao, Chunshui, Xianming Liu, Yi Yang, Yinan Yu, Jiang Wang, Zilei Wang, Yongzhen Huang In Proceedings of the IEEE International Conference on Computer Vision, pp. 2956-2964 (2015)


Maxios: Large Scale Nonnegative Matrix Factorization for Collaborative Filtering
Simon Shaolei Du, Yilin Liu, Boyi Chen and Lei Li
Neural Information Processing Systems, Workshop on Distributed Machine Learning and Matrix Computations (2014)

We propose Maxios, a novel approach to fill missing values for large scale and highly sparse matrices efficiently and accurately. We formulate the matrix-completion problem as weighted nonnegative matrix factorization. In addition, we develop distributed update rules using alternating direction method of multipliers.


BFiT: From Possible-World Semantics to Random-Evaluation Semantics in Open Universe
Yi Wu, Lei Li and Stuart J. Russell
Neural Information Processing Systems – Workshop on Probabilistic Programming (2014)

In this paper, we explicitly analyze the equivalence between these two semantics in the context of openuniverse probability models (OUPMs). We propose a novel dynamic memoization technique to construct OUPMs using procedural instructions in random-evaluation based PPLs.


Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, Alan Yuille
Arxiv.org (2014)


Bidirectional LSTM-CRF Models for Sequence Tagging
Zhiheng Huang, Wei Xu, Kai Yu
Arxiv.org (2014)