Blog
Baidu at AAAI 2021: Multimodal Transformer, Supervised Quantum Learning, Early Detection of COVID-19 High-Risk Areas

2021-02-06

Back to list

图片1.png


The 35th annual AAAI Conference on Artificial Intelligence is now underway from Feb 2 to Feb 9 as a fully virtual meeting due to the impact of COVID-19. This year, AAAI received a record-high 9034 submissions with only 1692 papers accepted.


Baidu is presenting 24 research papers on a wide range of AI fields, from multimodal models to quantum machine learning to COVID-19 research. You can find more technical details on the most notable papers below.

 

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graph

Paper: https://arxiv.org/abs/2006.16934 


We propose a knowledge-enhanced approach, ERNIE-ViL, to learn joint representations of vision and language. ERNIE-ViL tries to construct the detailed semantic connections (objects, attributes of objects and relationships between objects in visual scenes) across vision and language, which are essential to vision-language cross-modal tasks. Incorporating knowledge from scene graphs, ERNIE-ViL constructs Scene Graph Prediction tasks, i.e., Object Prediction, Attribute Prediction and Relationship Prediction in the pre-training phase. More specifically, these prediction tasks are implemented by predicting nodes of different types in the scene graph parsed from the sentence. Thus, ERNIE-ViL can model the joint representation characterizing the alignments of the detailed semantics across vision and language. Pre-trained on two large image-text alignment datasets (Conceptual Captions and SBU), ERNIE-ViL learns better and more robust joint representations. It achieves state-of-the-art performance on 5 vision-language downstream tasks after fine-tuning ERNIE-ViL. Furthermore, it ranked the 1st place on the VCR leader-board with an absolute improvement of 3.7%.


图片2.png 


Entity Structure Within and Throughout: Modeling Mention Dependencies for Document-Level Relation Extraction


Entities, as the essential elements in relation extraction tasks, exhibit certain structure. In this work, we formulate such entity structure as distinctive dependencies between mention pairs. We then propose SSAN, which incorporates these structural dependencies within the standard self- attention mechanism and throughout the overall encoding stage. Specifically, we design two alternative transformation modules inside each self-attention building block to produce attentive biases so as to adaptively regularize its attention flow. Our experiments demonstrate the usefulness of the pro- posed entity structure and the effectiveness of SSAN. It significantly outperforms competitive baselines, achieving new state-of-the-art results on three popular document-level relation extraction datasets. We further provide ablation and visualization to show how the entity structure guides the model for better relation extraction. Our code is publicly available soon.


图片3.png 


MVFNet: Multi-View Fusion Network for Efficient Video Recognition

Paper: https://arxiv.org/abs/2012.06977


Conventionally, spatiotemporal modeling network and its complexity are the two most concentrated research topics in video action recognition. Existing state-of-the-art methods have achieved excellent accuracy regardless of the complexity meanwhile efficient spatiotemporal modeling solutions are slightly inferior in performance. In this paper, we attempt to acquire both efficiency and effectiveness simultaneously. First of all, besides traditionally treating H x W x T video frames as space-time signal (viewing from the Height-Width spatial plane), we propose to also model video from the other two Height-Time and Width-Time planes, to capture the dynamics of video thoroughly. Secondly, our model is designed based on 2D CNN backbones and model complexity is well kept in mind by design. Specifically, we introduce a novel multi-view fusion (MVF) module to exploit video dynamics using separable convolution for efficiency. It is a plug-and-play module and can be inserted into off-the-shelf 2D CNNs to form a simple yet effective model called MVFNet. Moreover, MVFNet can be thought of as a generalized video modeling framework and it can specialize to be existing methods such as C2D, SlowOnly, and TSM under different settings. Extensive experiments are conducted on popular benchmarks (i.e., Something-Something V1 & V2, Kinetics, UCF-101, and HMDB-51) to show its superiority. The proposed MVFNet can achieve state-of-the-art performance with 2D CNN's complexity.


图片4.png 

 

PGNet: Real-time Arbitrarily-Shaped Text Spotting with Point Gathering Network

 

The reading of arbitrarily-shaped text has received increasing research attention, but existing text spotters are mostly built on two-stage frameworks or character-based methods, which suffer from either Non-Maximum Suppression (NMS) and Region-of-Interest (RoI) operations or character-level annotations. In this paper, to address above problems, we propose a novel fully convolutional Point Gathering Network (PGNet) for reading arbitrarily-shaped text in real-time. PGNet is a single-shot text spotter, where the pixel-level character classification map is learned with proposed PG-CTC loss avoiding the usage of character-level annotations. With PG-CTC decoder, we gather high-level character classification vectors from two-dimensional space and decode them into text symbols without NMS and RoI operations involved, which guarantees high efficiency. Additionally, reasoning the relations between each character and its neighbors, a graph refinement module (GRM) is proposed to optimize the coarse recognition and further improve the end-to-end performance. Experiments demonstrate that the proposed method achieves state- of-the-art or competitive accuracy, meanwhile significantly improving the running speed. In particular, on Total-Text, it runs at 46.7 FPS, surpassing the previous spotters with a large margin.


图片5.png 


VSQL: Variational Shadow Quantum Learning for Classification

https://arxiv.org/abs/2012.08288


Classification of quantum data is essential for quantum machine learning and near-term quantum technologies. In this paper, we propose a new hybrid quantum-classical framework for supervised quantum learning, which we call Variational Shadow Quantum Learning (VSQL). Our method in particular utilizes the classical shadows of quantum data, which fundamentally represent the side information of quantum data with respect to certain physical observables. Specifically, we first use variational shadow quantum circuits to extract classical features in a convolution way and then utilize a fully-connected neural network to complete the classification task. We show that this method could sharply reduce the number of parameters and thus better facilitate quantum circuit training. Simultaneously, less noise will be introduced since fewer quantum gates are employed in such shadow circuits. Moreover, we show that the Barren Plateau issue, a significant gradient vanishing problem in quantum machine learning, could be avoided in VSQL. Finally, we demonstrate the efficiency of VSQL in quantum classification via numerical experiments on the classification of quantum states and the recognition of multi-labeled handwritten digits. In particular, our VSQL approach outperforms existing variational quantum classifiers in the test accuracy in the binary case of handwritten digit recognition and notably requires much fewer parameters.


图片6.png 


C-Watcher: A Framework for Early Detection of High-Risk Neighborhoods Ahead of COVID-19 Outbreak

Paper: https://arxiv.org/abs/2012.12169


The novel coronavirus disease (COVID-19) has crushed daily routines and is still rampaging through the world. Existing solution for nonpharmaceutical interventions usually needs to timely and precisely select a subset of residential urban areas for containment or even quarantine, where the spatial distribution of confirmed cases has been considered as a key criterion for the subset selection. While such containment measure has successfully stopped or slowed down the spread of COVID-19 in some countries, it is criticized for being inefficient or ineffective, as the statistics of confirmed cases are usually time-delayed and coarse-grained. To tackle the issues, we propose C-Watcher, a novel data-driven framework that aims at screening every neighborhood in a target city and predicting infection risks, prior to the spread of COVID-19 from epicenters to the city. In terms of design, C-Watcher collects large-scale long-term human mobility data from Baidu Maps, then characterizes every residential neighborhood in the city using a set of features based on urban mobility patterns. Furthermore, to transfer the firsthand knowledge (witted in epicenters) to the target city before local outbreaks, we adopt a novel adversarial encoder framework to learn "city-invariant" representations from the mobility-related features for precise early detection of high-risk neighborhoods, even before any confirmed cases known, in the target city. We carried out extensive experiments on C-Watcher using the real-data records in the early stage of COVID-19 outbreaks, where the results demonstrate the efficiency and effectiveness of C-Watcher for early detection of high-risk neighborhoods from a large number of cities.


图片7.png 

Community-Aware Multi-Task Transportation Demand Prediction

 

Transportation demand prediction is of great importance to urban governance and has become an essential function in many online applications. While many efforts have been made for regional transportation demand prediction, predicting the diversified transportation demand for different com- munities (e.g., the aged, the juveniles) remains an unexplored problem. However, this task is challenging because of the joint influence of spatio-temporal correlation among regions and implicit correlation among different communities. To this end, in this paper, we propose the Multi-task Spatio-Temporal Network with Mutually-supervised Adaptive task grouping (Ada-MSTNet) for community-aware transportation demand prediction. Specifically, we first construct a sequence of multi-view graphs from both spatial and community perspectives, and devise a spatio-temporal neural net- work to simultaneously capture the sophisticated correlations between regions and communities, respectively. Then, we propose an adaptively clustered multi-task learning module, where the prediction of each region-community specific transportation demand is regarded as distinct task. Moreover, a mutually supervised adaptive task grouping strategy is introduced to softly cluster each task into different task groups, by leveraging the supervision signal from one another graph view. In such a way, Ada-MSTNet is not only able to share common knowledge among highly related communities and regions, but also shield the noise from unrelated tasks in an end-to-end fashion. Finally, extensive experiments on two real-world datasets demonstrate the effectiveness of our approach compared with seven baselines.


图片8.png 

Out-of-Town Recommendation with Travel Intention Modeling


Out-of-town recommendation is designed for those users who leave their home-town areas and visit the areas they have never been to before. It is challenging to recommend Point- of-Interests (POIs) for out-of-town users since the out-of- town check-in behavior is determined by not only the user’s home-town preference but also the user’s travel intention. Besides, the user’s travel intentions are complex and dynamic, which leads to big difficulties in understanding such intentions precisely. In this paper, we propose a TRAvelINtention-aware Out-of-town Recommendation framework, named TRAINOR. The proposed TRAINOR framework distinguishes itself from existing out-of-town recommenders in three aspects. First, graph neural networks are explored to represent users’ home-town check-in preference and geo- graphical constraints in out-of-town check-in behaviors. Second, a user-specific travel intention is formulated as an aggregation combining home-town preference and generic travel intention together, where the generic travel intention is regarded as a mixture of inherent intentions that can be learned by Neural Topic Model (NTM). Third, a non-linear mapping function as well as a matrix factorization method are employed to transfer users’ home-town preference and estimate out-of-town POI’s representation, respectively. Extensive experiments on real-world data sets validate the effectiveness of the TRAINOR framework. Moreover, the learned travel intention can deliver meaningful explanations for understanding a user’s travel purposes.


图片9.png 

A Blind Block Term Decomposition of Higher Order Tensors


Tensor decompositions have found many applications in signal processing, data mining, machine learning, etc. In particular, the block term decomposition (BTD), which is a generalization of CP decomposition and Tucker decomposition/HOSVD, has been successfully used for the compression and acceleration of neural networks. However, computing BTD is NP-hard, and optimization-based methods usually suffer from slow convergence or even fail to converge, which limits the applications of BTD. This paper considers a “blind” block term decomposition (BBTD) of high order tensors, in which the block diagonal structure of the core tensor is unknown. Our contributions include: 1) We establish the necessary and sufficient conditions for the existence of BTD, characterize the condition when a BTD solves the BBTD problem, and show that the BBTD is unique under a “low rank” assumption. 2) We propose an algebraic method to compute the BBTD. This method transforms the problem of determining the block diagonal structure of the core tensor into a clustering problem of complex numbers, in polynomial time. And once the clustering problem is solved, the BBTD can be obtained via computing several matrix decompositions. Numerical results show that our method is able to compute the BBTD, even in the presence of noise to some extent, whereas optimization-based methods (e.g., MINF and NLS in TENSORLAB) may fail to converge.


图片10.png 

FCFR-Net: Feature Fusion based Coarse-to-Fine Residual Learning for Monocular Depth Completion

Paper: https://arxiv.org/abs/2012.08270


Depth completion aims to recover a dense depth map from a sparse depth map with the corresponding color image as input. Recent approaches mainly formulate the depth completion as a one-stage end-to-end learning task, which outputs dense depth maps directly. However, the feature extraction and supervision in one-stage frameworks are insufficient, limiting the performance of these approaches. To address this problem, we propose a novel end-to-end residual learning framework, which formulates the depth completion as a two-stage learning task, i.e., a sparse-to-coarse stage and a coarse-to-fine stage. First, a coarse dense depth map is obtained by a simple CNN framework. Then, a refined depth map is further obtained using a residual learning strategy in the coarse-to-fine stage with coarse depth map and color image as input. Specially, in the coarse-to-fine stage, a channel shuffle extraction operation is utilized to extract more representative features from color image and coarse depth map, and an energy based fusion operation is exploited to effectively fuse these features obtained by channel shuffle operation, thus leading to more accurate and refined depth maps. We achieve SoTA performance in RMSE on KITTI benchmark. Extensive experiments on other datasets future demonstrate the superiority of our approach over current state-of-the-art depth completion approaches.


图片11.png 

Modeling the Probabilistic Distribution of Unlabeled Data for One-shot Medical Image Segmentation


Existing image segmentation networks mainly leverage large-scale labeled datasets to attain high accuracy. However, labeling medical images is very expensive since it requires sophisticated expert knowledge. Thus, it is more desirable to employ only a few labeled data in pursuing high segmentation performance. In this paper, we develop a data augmentation method for one-shot brain magnetic resonance imaging (MRI) image segmentation which exploits only one labeled MRI image (named atlas) and a few unlabeled images. In particular, we propose to learn the probability distributions of deformations (including shapes and intensities) of different unlabeled MRI images with respect to the atlas via 3D variational autoencoders (VAEs). In this manner, our method is able to exploit the learned distributions of image deformations to generate new authentic brain MRI images, and the number of generated samples will be sufficient to train a deep segmentation net- work. Furthermore, we introduce a new standard segmentation benchmark to evaluate the generalization performance of a segmentation network through a cross-dataset setting (collected from different sources). Extensive experiments demonstrate that our method outperforms the state-of-the-art one-shot medical segmentation methods. Our code has been released at https://github.com/dyh127/Modeling-the- Probabilistic-Distribution-of-Unlabeled-Data.


图片12.png 

TRQ: Ternary Neural Networks With Residual Quantization


Ternary neural networks (TNNs) are potential for network acceleration by reducing the full-precision weights in network to ternary ones, e.g., {−1, 0, 1}. However, existing TNNs are mostly calculated based on rule-of-thumb quantization methods by simply thresholding operations, which causes a significant accuracy loss. In this paper, we introduce a stem-residual framework which provides new insight into ternary quantization, termed Ternary Residual Quantization (TRQ), to achieve more powerful TNNs. Rather than directly thresholding operations, TRQ recursively performs quantization on full-precision weights for a refined reconstruction by combining the binarized stem and residual parts. With such a unique quantization process, TRQ endows the quantizer with high flexibility and precision. Furthermore, our TRQ is generic, which can be easily extended to multiple bits through recursively encoded residual for a better recognition accuracy. Extensive experimental results demonstrate that the proposed method yields great recognition accuracy while being accelerated.


图片13.png