shagunsodhani/KeyValueMemNN.md

## KeyValueMemNN.md

      
    Raw
  

              KeyValueMemNN.md
            
          
    Key-Value Memory Networks for Directly Reading Documents

Introduction


Knowledge Bases (KBs) are effective tools for Question Answering (QA) but are often too restrictive (due to fixed schema) and too sparse (due to limitations of Information Extraction (IE) systems).
The paper proposes Key-Value Memory Networks, a neural network architecture based on Memory Networks that can leverage both KBs and raw data for QA.
The paper also introduces MOVIEQA, a new QA dataset that can be answered by a perfect KB, by Wikipedia pages and by an imperfect KB obtained using IE techniques thereby allowing a comparison between systems using any of the three sources.
Link to the paper.

Related Work


TRECQA and WIKIQA are two benchmarks where systems need to select the sentence containing the correct answer, given a question and a list of candidate sentences.
These datasets are small and make it difficult to compare the  systems using different sources.
Best results on these benchmarks are reported by CNNs and RNNs with attention mechanism.

Key-Value Memory Networks


Extension of Memory Networks Model.
Generalises the way context is stored in memory.
Comprises of a memory made of slots in the form of pair of vectors (k₁, v₁)...(k_m, v_m) to encode long-term and short-term context.

Reading the Memory


Key Hashing - Question, x is used to preselect a subset of array (k_h1, v_h1)...(k_hN, v_hN) where the key shares atleast one word with x and frequency of the words is less than 1000.
Key Addresing - Each candidate memory is assigned a relevance probability:

p_hi = softmax(Aφ_X(x).Aφ_K(k_hi))
φ is a feature map of dimension D and A is a dxD matrix.


Value Reading - Value of memories are read by taking their weighted sum using addressing probabilites and a vector o is returned.
o = sum(p_hiAφ_V(v_hi))
Memory access process conducted by "controller" neural network using q = Aφ_X(x) as the query.
Query is updated using

q₂ = R₁(q+o)


Addressing and reading steps are repeated using new R_i matrices to retrive more pertinent information in subsequent access.
After a fixed number of hops, H, resulting state of controller is used to compute a final prediction.
a = argmax(softmax(q_H+1^TBφ_Y(y_i)))
where y_i are the possible candidate outputs and B is a dXD matrix.
The network is trained end-to-end using a cross entropy loss, backpropogation and stochastic gradient.
End-to-End Memory Networks can be viewed as a special case of Key-Value Memory Networks by setting key and value to be the same for all the memories.

Variants of Key-Value Memories


φ_x and φ_y - feature map corresponding to query and answer are fixed as bag-of-words representation.

KB Triple


Triplets of the form "subject relation object" can be represented in Key-Value Memory Networks with subject and relation as the key and object as the value.
In standard Memory Networks, the whole triplet would have to be embedded in the same memory slot.
The reversed relations "object is_related_to subject" can also be stored.

Sentence Level


A document can be split into sentences with each sentence encoded in the key-value pair of the memory slot as a bag-of-words.

Window Level


Split the document in the windows of W words and represent it as bag-of-words.
The window becomes the key and the central word becomes the value.

Window + Centre Encoding


Instead of mixing the window centre with the rest of the words, double the size of the dictionary and encode the centre of the window and the value using the second dictionary.

Window + Title


Since title of the document could contain useful information, the word window can be encoded as the key and document title as the value.
The key could be augmented with features like "window" and "title" to distinguish between different cases.

MOVIEQA Benchmark

Knowledge Representation


Doc - Raw documents (from Wikipedia) related to movies.
KB - Graph based KB made of entities and relations.
IE - Performing Information Extraction on Wikipedia to create a KB.
The QA pairs should be answerable by both raw document and KB so that the three approaches can be compared and the gap between the three solutions can be closed.
The dataset has more than 100000 QA pairs, making it much larger than most existing datasets.

Experiments

MOVIEQA

Systems Compared


Bordes et al's QA system
Supervised Embeddings(without KB)
Memory Networks
Key-Value Memory Networks

Observations


Key-Value Memory Networks outperforms all methods on all data sources.
KB > Doc > IE
The best memory representation for directly reading documents uses "Window Level + Centre Encoding + Title".

KB vs Synthetic Document Analysis


Given KB triplets, construct synthetic "Wikipedia" articles using templates, conjuctions and coreferences to determine the causes for gap in performance when using KB vs doc.
Loss in One Template sentences are due to difficulty of extracting subject, relation and object from the artifical docs.
Using multiple templates does not detoriate performance much. But conjuctions and coreferences cause a dip in performance.

WIKIQA


Given a question, select the sentence (from Wikipedia document) that best answers the question.
Key-Value Memory Networks outperforms all other solutions though it is only marginally better than LDC (Sentence Similarity Learning by Lexical Decomposition and Composition) and attentive models based on CNNs and RNNs.