shagunsodhani/QRN.md

## QRN.md

      
    Raw
  

              QRN.md
            
          
    Query Regression Networks for Machine Comprehension

Introduction


Machine Comprehension (MC) - given a natural language sentence, answer a natural language question.
End-To-End MC - can not use language resources like dependency parsers. The only supervision during training is the correct answer.
Query Regression Network (QRN) - Variant of Recurrent Neural Network (RNN).
Link to the paper

Related Work


Long Short-Term Memory (LSTM) and Gated Recurrence Unit (GRU) are popular choices to model sequential data but perform poorly on end-to-end MC due to long-term dependencies.
Attention Models with shared external memory focus on single sentences in each layer but the models tend to be insensitive to the time step of the sentence being accessed.
Memory Networks (and MemN2N)

Add time-dependent variable to the sentence representation.
Summarize the memory in each layer to control attention in the next layer.


Dynamic Memory Networks (and DMN+)

Combine RNN and attention mechanism to incorporate time dependency.
Uses 2 GRU

time-axis GRU - Summarize the memory in each layer.
layer-axis GRU - Control the attention in each layer.


QRN is a much simpler model without any memory summarized node.

QRN


Single recurrent unit that updates its internal state through time and layers.
Inputs

q_t - local query vector
x_t - sentence vector


Outputs

h_t - reduced query vector
x_t - sentence vector without any modifications


Equations

z_t = α(x_t, q_t)
&alpha is the update gate function to measure the relevance between input sentence and local query.
h`_t = γ(x_t, q_t)
&gamma is the regression function to transform the local query into regressed query.
h_t = z_t*h`_t + (1 - z_t)*h_t-1


To create a multi layer model, output of current layer becomes input to the next layer.

Variants


Reset gate function (r_t) to reset or nullify the regressed query h`_t (inspired from GRU).

The new equation becomes h_t = z_t*r_t*h`_t + (1 - z_t)*h_t-1


Vector gates - update and reset gate functions can produce vectors instead of scalar values (for finer control).
Bidirectional - QRN can look at both past and future sentences while regressing the queries.

q_t^k+1 = h_t^{k, forward} + h_t^{k, backward}.
The variables of update and regress functions are shared between the two directions.


Parallelization


Unlike most RNN based models, recurrent updates in QRN can be computed in parallel across time.
For details and equations, refer the paper.

Module Details

Input Modules


A trainable embedding matrix A is used to encode the one-hot vector of each word in the input sentence into a d-dimensional vector.
Position Encoder is used to obtain the sentence representation from the d-dimensional vectors.
Question vectors are also obtained in a similar manner.

Output Module


A V-way single-layer softmax classifier is used to map predicted answer vector y to a V-dimensional sparse vector v.
The natural language answer y is the arg max word in v.

Results


bAbI QA dataset used.
QRN on 1K dataset with '2rb' (2 layers + reset gate + bidirectional) model and on 10K dataset with '2rvb' (2 layers + reset gate + vector gate + bidirectional) outperforms MemN2N 1K and 10K models respectively.
Though DMN+ outperforms QRN with a small margin, QRN are simpler and faster to train (the paper made the comment on the speed of training without reporting the training time of the two models).
With very few layers, the model lacks reasoning ability while with too many layers, the model becomes difficult to train.
Using vector gates works for large datasets while hurts for small datasets.
Unidirectional models perform poorly.
The intermediate query updates can be interpreted in natural language to understand the flow of information in the network.