Globally Normalized Reader

We present the Globally Normalized Reader - an approach to extractive question answering with the same performance and significantly lower computational complexity than previous methods. Many popular models like Bidirectional Attention Flow \cite{seo2016bidirectional} use expensive attention mechanisms, and others like the Match-LSTM \cite{wang2016machine} explicitly score all possible answers. In constrast, the GNR casts question answering as a search problem and applies a learning to search framework to solve it. The resulting model is 24.7x faster than Bi-Directional Attention flow and is the 2nd highest ranked model on the Stanford Question Answering (SQAD) dev set \cite{rajpurkar2016squad}.

All state-of-the-art neural question answering systems suffer from overfitting. To help address these issues, we also present Type Swaps, the first successful data augmentation technique for neural question answering. Using augmented data from Type Swaps reduces the generalization error of our models and provides a 1% boost in EM on the SQuAD dev set.

Question Answering As Search

Suppose we want to answer the question "In what year did Nikola Tesla die?". We browse Wikipedia and find the article:

Nikola Tesla (Serbian Cyrillic: Никола Тесла; 10 July 1856 – 7 January 1943) was a Serbian American inventor, electrical engineer, mechanical engineer, physicist, and futurist best known for his contributions to the design of the modern alternating current (AC) electricity supply system.

Question answering (QA) and information extraction systems have proven to be invaluable in wide variety of applications such as medical information collection on drugs and genes \cite{quirk2016distant}, large scale health impact studies \cite{althoff2016influence}, or educational material development \cite{koedinger2015data}. Recent progress in neural-network based extractive question answering models are quickly closing the gap with human performance on several benchmark QA tasks such as SQuAD \cite{rajpurkar2016squad}, MS MARCO \cite{nguyen2016ms}, or NewsQA \cite{trischler2016newsqa}. However, current approaches to extractive question answering face several limitations:

  • Computation is allocated equally to the entire document, regardless of answer location, with no ability to ignore or focus computation on specific parts. This limits applicability to longer documents.
  • They rely extensively on expensive bi-directional attention mechanisms \cite{seo2016bidirectional} or must rank all possible answer spans \cite{lee2016learning}.
  • While data-augmentation for question answering have been proposed \cite{zhou2017neural}, current approaches still do not provide training data that can improve the performance of existing systems.

We propose instead to cast extractive QA as an iterative search problem: select the answer's sentence, start word, and end word. At each step, our choices prune the search space, and allow us to place computation only where it matters most: promising search paths.

We show that globally normalizing the decision process and back-propagating through beam search makes this representation viable and learning efficient. Our claims are empirically validated through experiments where we achieve the second highest single model performance on the Stanford Question Answering Dataset \cite{rajpurkar2016squad} (68.4 EM, 76.21 F1 dev) and is 24.7x faster than bi-attention-flow \cite{seo2016bidirectional}.

We also introduce a data-augmentation method to produce semantically valid examples by aligning named entities to a knowledge base and swapping them with new entities of the same type. This method improves the performance of all models considered in this work and is of independent interest for a variety of NLP tasks.

How does the Globally Normalized Reader read?

To describe our approach let us consider a simple example: "Who was first to recognize that the Analytical Engine had applications beyond pure calculation?". To answer this question we will read the following paragraph:

Ada Lovelace was known for her work on Charles Babbage's Analytical Engine. She was the first to recognize that the machine had applications beyond calculation. As a result, she is often regarded as the first to recognise the full potential of a "computing machine" and the first computer programmer.

Not every part of the document is relevant to the question. To reflect this, we can detect early on where the answer might show up. The Globally Normalized Reader translates this intuition by gradually selecting subportions of the document. We illustrate this below with vertical bars showing decision probabilities, and hovering over a node to highlight the portion of the document that is being considered.

The extractive question answering problem is to extract the death date, "7 January 1943", from the given passage. The Globally Normalized Reader casts question answering an search problem. First, find the sentence containing the correct answer. Then, find the starting word of the answer within the sentence. Finally, find the end word of the answer. This process is illustrated below.

Once the reader selects a relevant sentence in the document, it can now drill down further within that subportion of the document. In the diagram below we can see how it now focuses its attention on only one of the sentences, and then selects the right subset of words within the sentence:

There are may ways to parameterize a probability distribution over the sentence, start word, and end word choices. One of the key ingredients of the Globally Normalized Reader is parameterizing the distribution in a globally (as opposed to locally) normalized way.

In a globally normalized model, the distribution is normalized over all (sentence, start word, end word) tuples, whereas in a locally normalized model each choice of sentence, start word, and end word is separately normalized and multiplied together using the chain rule.

Global normalization makes the model strictly more expressive and allows it to more easily recover from search errors. In our work, we show globally normalizing the model provides a 1 % boost in EM and elevates it to near state of the art performance.

For more details on global normalization, refer to the paper or the excellent presentation in \cite{andor2016globally}.

Learning to Search

Although global normalized models have nice representational properties, they provide computational challenges. In particular, evaluating the probability of any particular (sentence, start word, end word) tuple requires an expensive summation over all such tuples to compute the normalization constant, i.e. a sum over a set of size # sentences * # starting words * # ending words. For a long document, this is prohibitive.

To overcome this challenge, we employ beam search. In particular, we approximate the summation over all tuples by only summing over the final beam candidates. This approach, also known as learning to search, requires us to backpropogate through beam search at training time.

At test time, the highest ranking candidate tuple is also obtained through beam search. This means the model only has to score O(beam size) candidate answer spans, rather than scoring all possible spans as is common in existing methods. This procedure reduces the discrepancy between how the model is trained and how it is evaluated, and it is key to providing the GNR's 20x speed improvements over existing methods.

Where has learning to search been successful?

Several approaches to learning to search have been proposed for various NLP tasks and conditional computation. Most recently, \cite{andor2016globally} and \cite{zhou2015neural} demonstrated the effectiveness of globally normalized networks and training with beam search for part of speech tagging and transition-based dependency parsing, while \citet{wiseman2016sequence} showed that these techniques could also be applied to sequence-to-sequence models in several application areas including machine translation. These works focus on parsing and sequence prediction tasks and have a fixed computation regardless of the search path, while we show that the same techniques can also be straightforwardly applied to question answering and extended to allow for conditional computation based on the search path.

Learning to search has also been used in context of modular neural networks with conditional computation in the work of \cite{andreas2016learning} for image captioning. In their work reinforcement learning was used to learn how to turn on and off computation, while we find that conditional computation can be easily learnt with maximum likelihood and the help of early updates \cite{andor2016globally,zhou2015neural,collins2004incremental} to guide the training process.

What's next?

We believe there a broad range of structured prediction problems (code generation, generative models for images, audio, or videos) where the size of original search space makes current techniques intractable, but if cast as learning-to-search problems with conditional computation, might be within reach.

How can we generate quasi-infinite data?

Virtually all of the state of the art neural approaches to question answering on SQuAD suffer from overfitting and must be heavily regularized to obtain good results. In other areas of machine learning, like image or speech recognition, researchers have been able to improve generalization via data augmentation. However, to date, no one has proposed a data augmentation strategy that improves performance in question answering. To address this issue, we present Type Swaps, a novel strategy to generate extremely large amounts of synthetic QA examples and validate that Type Swaps improves the performance of GNR.

Type Swaps works by identifying entities in the document and question, and then leverages WikiData to swap in new entities of the same type. Since Wikidata contains a very large number of entities, we can generate an astronomical number of new examples. Please refer to the figure for an example and the paper for more technical details.

We find augmenting the dataset with additional type-sensitive synthetic examples improves performance for all of the models we studied and increases the best GNR model by almost 2 % EM. Since this source of improvement is not tied to our architecture choices, these benefits are expected to carry over to different models \cite{seo2016bidirectional,weissenborn2017fastqa,xiong2016dynamic,rnet}, and perhaps more broadly in other natural language tasks that contain named entities and have limited supervised data.

Type Swaps, our data augmentation strategy, offers a way to incorporate the nature of the question and the types of named entities in the answers into the learning process of our model and reduce sensitivity to surface variation. Existing neural-network approaches to extractive QA have so far ignored this information. Augmenting the dataset with additional type-sensitive synthetic examples improves performance by providing better coverage of different answer types. Growing the number of augmented samples used improves the performance of all models under study.

Past a certain amount of augmentation, we observe performance degradation. This suggests that despite efforts to closely mimic the original training set, there is a train-test mismatch or excess duplication in the generated examples.


To provide a better sense for the model's behavior, we include below a variety of other example questions, documents, and search trees:


We would like to thank the anonymous EMNLP reviewers for their valuable feedback. In addition, we thank Adam Coates, Carl Case, Andrew Gibiansky, and Szymon Sidor for thoughtful comments and fruitful discussion. We also thank James Bradbury and Bryan McCann for running Type Swap experiments on the DCN.

If you use the dataset/code in your research, please cite our paper:

  title={Globally Normalized Reader},
  author={Raiman, Jonathan and Miller, John},
  booktitle={Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing},


2018-01-17T16:20:53+00:00 September 26th, 2017|

Share This Story, Choose Your Platform!