shagunsodhani/End-To-End Memory Networks.md

## End-To-End Memory Networks.md

      
    Raw
  

              End-To-End Memory Networks.md
            
          
    End-To-End Memory Networks

Introduction


Neural Network with a recurrent attention model over a large external memory.
Continous form of Memory-Network but with end-to-end training so can be applied to more domains.
Extension of RNNSearch and can perform multiple hops (computational steps) over the memory per symbol.
Link to the paper.
Link to the implementation.

Approach


Model takes as input x₁,...,x_n (to store in memory), query q and outputs answer a.

Single Layer


Input set (x_i) embedded in D-dimensional space, using embedding using matrix A to obtain memory vectors (m_i).
Query is also embedded using matrix B to obtain internal state u.
Compute match between each memory m_i and u in the embedding space followed by softmax operation to obtain probability vector p over the inputs.
Each x_i maps to an output vector c_i (using embedding matrix C).
Output o = weighted sum of transformed input c_i, weighted by p_i.
Sum of output vector, o and embedding vector, u, is passed through weight matrix W followed by softmax to produce output.
A, B, C and W are learnt by minimizing cross entropy loss.

Multiple Layers


For layers above the first layer, input u^k+1 = u^k + o^k.
Each layer has its own A^k and C^k - with constraints.
At final layer, output o = Softmax(W(o^K, u^K))

Constraints On Embedding Vectors


Adjacent

Output embedding for one layer is input embedding for another ie A^k+1 = C^k
W = C^k
B = A¹


Layer-wise (RNN-like)

Same input and output embeddings across layes ie A¹ = A² ... = A^K and C¹ = C² ... = C^K.
A linear mapping H is added to update of u between hops.
u^k+1 = Hu^k + o^k.
H is also learnt.
Think of this as a traditional RNN with 2 outputs

Internal output - used for memory consideration
External output - the predicted result
u becomes the hidden state.
p is an internal output which, combined with C is used to update the hidden state.


Related Architectures


RNN - Memory stored as the state of the network and unusable in long temporal contexts.
LSTM - Locks network state using local memory cells. Fails over longer temporal contexts.
Memory Networks - Uses global memory.
Bidirectional RNN - Uses a small neural network with sophisticated gated architecture (attention model) to find useful hidden states but unlike MemNN, perform only a single pass over the memory.

Sentence Representation for Question Answering Task


Bag-of-words representation

Input sentences and questions are embedded as a bag of words.
Can not capture the order of the words.


Position Encoding

Takes into account the order of words.


Temporal Encoding

Temporal information encoded by matrix T_A and memory vectors are modified as

m_i = sum(Ax_ij) + T_A(i)


Random Noise

Dummy Memories (empty memories) are added at training time to regularize T_A.


Linear Start (LS) training

Removes softmax layers when starting training and insert them when validation loss stops decreasing.


Observations


Best MemN2N models are close to supervised models in performance.
Position Encoding improves over bag-of-words approach.
Linear Start helps to avoid local minima.
Random Noise gives a small yet consistent boost in performance.
More computational hops leads to improved performance.
For Language Modelling Task, some hops concentrate on recent words while other hops have more broad attention span over all memory locations.