Notes for "Query Regression Networks for Machine Comprehension" Paper

Query Regression Networks for Machine Comprehension


  • Machine Comprehension (MC) - given a natural language sentence, answer a natural language question.
  • End-To-End MC - can not use language resources like dependency parsers. The only supervision during training is the correct answer.
  • Query Regression Network (QRN) - Variant of Recurrent Neural Network (RNN).
  • Link to the paper

Related Work

  • Long Short-Term Memory (LSTM) and Gated Recurrence Unit (GRU) are popular choices to model sequential data but perform poorly on end-to-end MC due to long-term dependencies.
  • Attention Models with shared external memory focus on single sentences in each layer but the models tend to be insensitive to the time step of the sentence being accessed.
  • Memory Networks (and MemN2N)
    • Add time-dependent variable to the sentence representation.
    • Summarize the memory in each layer to control attention in the next layer.
  • Dynamic Memory Networks (and DMN+)
    • Combine RNN and attention mechanism to incorporate time dependency.
    • Uses 2 GRU
      • time-axis GRU - Summarize the memory in each layer.
      • layer-axis GRU - Control the attention in each layer.
  • QRN is a much simpler model without any memory summarized node.


  • Single recurrent unit that updates its internal state through time and layers.
  • Inputs
    • qt - local query vector
    • xt - sentence vector
  • Outputs
    • ht - reduced query vector
    • xt - sentence vector without any modifications
  • Equations
    • zt = α(xt, qt)
    • &alpha is the update gate function to measure the relevance between input sentence and local query.
    • h`t = γ(xt, qt)
    • &gamma is the regression function to transform the local query into regressed query.
    • ht = zt*h`t + (1 - zt)*ht-1
  • To create a multi layer model, output of current layer becomes input to the next layer.


  • Reset gate function (rt) to reset or nullify the regressed query h`t (inspired from GRU).
    • The new equation becomes ht = zt*rt*h`t + (1 - zt)*ht-1
  • Vector gates - update and reset gate functions can produce vectors instead of scalar values (for finer control).
  • Bidirectional - QRN can look at both past and future sentences while regressing the queries.
    • qtk+1 = htk, forward + htk, backward.
    • The variables of update and regress functions are shared between the two directions.


  • Unlike most RNN based models, recurrent updates in QRN can be computed in parallel across time.
  • For details and equations, refer the paper.

Module Details

Input Modules

  • A trainable embedding matrix A is used to encode the one-hot vector of each word in the input sentence into a d-dimensional vector.
  • Position Encoder is used to obtain the sentence representation from the d-dimensional vectors.
  • Question vectors are also obtained in a similar manner.

Output Module

  • A V-way single-layer softmax classifier is used to map predicted answer vector y to a V-dimensional sparse vector v.
  • The natural language answer y is the arg max word in v.


  • bAbI QA dataset used.
  • QRN on 1K dataset with '2rb' (2 layers + reset gate + bidirectional) model and on 10K dataset with '2rvb' (2 layers + reset gate + vector gate + bidirectional) outperforms MemN2N 1K and 10K models respectively.
  • Though DMN+ outperforms QRN with a small margin, QRN are simpler and faster to train (the paper made the comment on the speed of training without reporting the training time of the two models).
  • With very few layers, the model lacks reasoning ability while with too many layers, the model becomes difficult to train.
  • Using vector gates works for large datasets while hurts for small datasets.
  • Unidirectional models perform poorly.
  • The intermediate query updates can be interpreted in natural language to understand the flow of information in the network.
