shagunsodhani/Memory Networks.md

## Memory Networks.md

      
    Raw
  

              Memory Networks.md
            
          
    Memory Networks

Introduction


Memory Networks combine inference components with a long-term memory component.
Used in the context of Question Answering (QA) with memory component acting as a (dynamic) knowledge base.
Link to the paper.

Related Work


Knowledge Base approach

Use information retrieval techniques to build a knowledge base and perform inference over it.
In contrast, Memory Networks store memory as raw text or embedded vectors and perform extraction on the fly.


Associative Memory Network distribute memory across the whole network while memory-based learning compartmentalizes the memory. In contrast, Memory Networks combines compartmentalized memory with neural network modules that learn how to read and write to the memory.


Neural Turing Machine (NTM) performs sequence prediction using read-writeable "large, addressable memory" and performs sorting, copy and recall operations on it. But they consider very small memory (128 slots) as compared to Memory Networks (14M sentences).


RNNs and LSTMs can perform QA tasks but they can remember only short-term dependencies.


Memory Networks


Memory m (an array of objects indexed by m_i).
Four learnable components

I - Input feature map

Converts input to internal feature representation using standard processing techniques.


G - Generalization

Updates old memories given new input x
m_i = G(m_i, I(x), m) for all i.
Simplest form would be to store I(x) in a slot of memory ie m_H(x) = I(x)


O - Output feature

Produce new output in feature space, given the input x and memory m
Output feature o = O(I(x), m)


R - Response

Convert output into desired format
r = R(o)


Memory Neural Networks (MemNN)

Basic Model


Input (x) to I is a sentence and is stored in the next available memory slot.


O module produces output features by finding k supporting memories for the given input.


For k = 1, the highest scoring supporting memory is:
o₁ = O₁(x, m) = argmax (s_O(x, m_i)) for all i.


For k = 2,
o₂ = O₂(x, m) = argmax (s_O([x, m_o1], m_i)) for all i.


Final input o = [x, m_o1,m_o2]


R either returns m_ok or uses a RNN to perform true sentence generation.


Alternatively, rank all the words seen by the model and return highest scoring word.


r = argmax(s_R([x, m_o1, m_o2], w)) for all w in W(set of all words in the dictionary).


Both scoring functions s_o and s_r are in form of an embedding model:
s(x, y) = φ_x(x)^TU^TUφ_y(y)

U : n x D matrix
n : embedding dimension
D : number of features
φ(x), φ(y) : map text to D dimensional space.


Training

Fully supervised.
Marginal ranking loss.
Stochastic Gradient Descent.
For loss function, refer paper.


Word Sequences As Inputs


Learn a segmentation function to look up breakpoints.
seg(c) = W_seg^TU_Sφ_seg(c)

W_seg : parameters of linear classifier in embedded space.
c : sequence of input words as a bag of words using a separate dictionary.


Efficient Memory via Hashing


Hash the input I(x) into buckets and only score memories in the same bucket.
Hashing via clustering word embeddings performs better than just hashing words.

Modelling Write Time


Extend model to account for the time when memory was written to answer questions involving temporal context.

Modelling Previously Unseen Words


In the case of a new word, use the neighboring words to predict the embedding of the new word.
Use a "dropout" kind approach - d% of the time, assume the word is new and use its context to derive its embedding.

Exact Matches And Unseen Words


Embedding modes can not efficiently use exact word matches due to low dimensionality.
Add "bag of words" matching scores to the learned embedding scores.
Alternatively, stay in n-dimensional embedding space but extend the feature representation D with matching features to indicate if a word occurs in both x and y.

Observations


RNNs perform poorly on tasks that require encoding long term memory.
LSTMs are an improvement over RNNs but are still inadequate.
Memory Networks outperform both LSTMs and RNNs in long-term memory encoding.
Time feature and unseen word modelling helps to enhance the performance.

Limitations


Difficult to train via backpropagation.
Requires full supervision which limits their applicability to different domains.