Notes for End-To-End Memory Networks paper

End-To-End Memory Networks


  • Neural Network with a recurrent attention model over a large external memory.
  • Continous form of Memory-Network but with end-to-end training so can be applied to more domains.
  • Extension of RNNSearch and can perform multiple hops (computational steps) over the memory per symbol.
  • Link to the paper.
  • Link to the implementation.


  • Model takes as input x1,...,xn (to store in memory), query q and outputs answer a.

Single Layer

  • Input set (xi) embedded in D-dimensional space, using embedding using matrix A to obtain memory vectors (mi).
  • Query is also embedded using matrix B to obtain internal state u.
  • Compute match between each memory mi and u in the embedding space followed by softmax operation to obtain probability vector p over the inputs.
  • Each xi maps to an output vector ci (using embedding matrix C).
  • Output o = weighted sum of transformed input ci, weighted by pi.
  • Sum of output vector, o and embedding vector, u, is passed through weight matrix W followed by softmax to produce output.
  • A, B, C and W are learnt by minimizing cross entropy loss.

Multiple Layers

  • For layers above the first layer, input uk+1 = uk + ok.
  • Each layer has its own Ak and Ck - with constraints.
  • At final layer, output o = Softmax(W(oK, uK))

Constraints On Embedding Vectors

  • Adjacent

    • Output embedding for one layer is input embedding for another ie Ak+1 = Ck
    • W = Ck
    • B = A1
  • Layer-wise (RNN-like)

    • Same input and output embeddings across layes ie A1 = A2 ... = AK and C1 = C2 ... = CK.
    • A linear mapping H is added to update of u between hops.
    • uk+1 = Huk + ok.
    • H is also learnt.
    • Think of this as a traditional RNN with 2 outputs
      • Internal output - used for memory consideration
      • External output - the predicted result
      • u becomes the hidden state.
      • p is an internal output which, combined with C is used to update the hidden state.

Related Architectures

  • RNN - Memory stored as the state of the network and unusable in long temporal contexts.
  • LSTM - Locks network state using local memory cells. Fails over longer temporal contexts.
  • Memory Networks - Uses global memory.
  • Bidirectional RNN - Uses a small neural network with sophisticated gated architecture (attention model) to find useful hidden states but unlike MemNN, perform only a single pass over the memory.

Sentence Representation for Question Answering Task

  • Bag-of-words representation

    • Input sentences and questions are embedded as a bag of words.
    • Can not capture the order of the words.
  • Position Encoding

    • Takes into account the order of words.
  • Temporal Encoding

    • Temporal information encoded by matrix TA and memory vectors are modified as

    mi = sum(Axij) + TA(i)

  • Random Noise

    • Dummy Memories (empty memories) are added at training time to regularize TA.
  • Linear Start (LS) training

    • Removes softmax layers when starting training and insert them when validation loss stops decreasing.


  • Best MemN2N models are close to supervised models in performance.
  • Position Encoding improves over bag-of-words approach.
  • Linear Start helps to avoid local minima.
  • Random Noise gives a small yet consistent boost in performance.
  • More computational hops leads to improved performance.
  • For Language Modelling Task, some hops concentrate on recent words while other hops have more broad attention span over all memory locations.
