Skip to content

Instantly share code, notes, and snippets.

@shagunsodhani
Created April 24, 2016 14:19
Show Gist options
  • Save shagunsodhani/17881da05d9ee1f6539b2baa8067a6ef to your computer and use it in GitHub Desktop.
Save shagunsodhani/17881da05d9ee1f6539b2baa8067a6ef to your computer and use it in GitHub Desktop.
Notes for End-To-End Memory Networks paper

End-To-End Memory Networks

Introduction

  • Neural Network with a recurrent attention model over a large external memory.
  • Continous form of Memory-Network but with end-to-end training so can be applied to more domains.
  • Extension of RNNSearch and can perform multiple hops (computational steps) over the memory per symbol.
  • Link to the paper.
  • Link to the implementation.

Approach

  • Model takes as input x1,...,xn (to store in memory), query q and outputs answer a.

Single Layer

  • Input set (xi) embedded in D-dimensional space, using embedding using matrix A to obtain memory vectors (mi).
  • Query is also embedded using matrix B to obtain internal state u.
  • Compute match between each memory mi and u in the embedding space followed by softmax operation to obtain probability vector p over the inputs.
  • Each xi maps to an output vector ci (using embedding matrix C).
  • Output o = weighted sum of transformed input ci, weighted by pi.
  • Sum of output vector, o and embedding vector, u, is passed through weight matrix W followed by softmax to produce output.
  • A, B, C and W are learnt by minimizing cross entropy loss.

Multiple Layers

  • For layers above the first layer, input uk+1 = uk + ok.
  • Each layer has its own Ak and Ck - with constraints.
  • At final layer, output o = Softmax(W(oK, uK))

Constraints On Embedding Vectors

  • Adjacent

    • Output embedding for one layer is input embedding for another ie Ak+1 = Ck
    • W = Ck
    • B = A1
  • Layer-wise (RNN-like)

    • Same input and output embeddings across layes ie A1 = A2 ... = AK and C1 = C2 ... = CK.
    • A linear mapping H is added to update of u between hops.
    • uk+1 = Huk + ok.
    • H is also learnt.
    • Think of this as a traditional RNN with 2 outputs
      • Internal output - used for memory consideration
      • External output - the predicted result
      • u becomes the hidden state.
      • p is an internal output which, combined with C is used to update the hidden state.

Related Architectures

  • RNN - Memory stored as the state of the network and unusable in long temporal contexts.
  • LSTM - Locks network state using local memory cells. Fails over longer temporal contexts.
  • Memory Networks - Uses global memory.
  • Bidirectional RNN - Uses a small neural network with sophisticated gated architecture (attention model) to find useful hidden states but unlike MemNN, perform only a single pass over the memory.

Sentence Representation for Question Answering Task

  • Bag-of-words representation

    • Input sentences and questions are embedded as a bag of words.
    • Can not capture the order of the words.
  • Position Encoding

    • Takes into account the order of words.
  • Temporal Encoding

    • Temporal information encoded by matrix TA and memory vectors are modified as

    mi = sum(Axij) + TA(i)

  • Random Noise

    • Dummy Memories (empty memories) are added at training time to regularize TA.
  • Linear Start (LS) training

    • Removes softmax layers when starting training and insert them when validation loss stops decreasing.

Observations

  • Best MemN2N models are close to supervised models in performance.
  • Position Encoding improves over bag-of-words approach.
  • Linear Start helps to avoid local minima.
  • Random Noise gives a small yet consistent boost in performance.
  • More computational hops leads to improved performance.
  • For Language Modelling Task, some hops concentrate on recent words while other hops have more broad attention span over all memory locations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment