sailik1991/0.0 CS224N notes.md

## 0.0 CS224N notes.md

      
    Raw
  

              0.0 CS224N notes.md
            
          
    Table of Contents


CNNs for NLP
Question-Answering
Machine Translation
Vanishing Gradients and Fancy RNNs
Language Modeling
Dependency Parsing
Backpropagation
Vector representaion of Word

Convolutional Neural Networks for NLP

Question-Answering


The field has a long history. Reading comprehension and finding answers in a passage  started with the Yale A.I. Project (Schank, Abelson, Lehnert et. al. 1977), was revived by Hirschman 1999. Burges 2013 gave a more formal definition of the problem.

Open-domain Question Answering-- Simmons et. al. and factoid question answering (where your answer is a named entity have also been looked into.
IBM's Jeopardy system is another QA system that became famous in 2011.


In NeurIPS 2015, Deepmind made dataset from CNN daily mail news stories. Then, Pircy Liang's group produced the SQuAD dataset (EMNLP 2016).

SQuAD v1.1. has a passage. Answer is a substring from the passage. Labelled by 3 humans. If your answer is one of the 3 given by the human, you are correct (accuracy). F1 metric is better than accuracy because it is more robust to word choices.
People started building Neural Networks to solve this. Started working well. In 2018, Google's BERT topped the leader board with better than human level accuracy (not that 3 humans to model the human level performance in not a strong result).
In this dataset, all answers are in the paragraph. Thus, a good ranking mechanism that ranks candidates obtained by the sub-strings from the paragraph can work well.


SQuAD 2.0 address this limitation and has many questions where the answer is not in to paragraph. You can say that it has no answer. Simple measure is if your confidence about an answer is not above a threshold, say no answer.

Convoluted passage with a question (about when was someone killed) that asks for a year as an answer. Three options for years and one of them is close to the word destory. Machine (Ms nlNET) picks that up. Unfortunately, there is no answer. Shows that the neural networks didn't really understand the question or the passage.
They showed that the best performing system for SQuAD 1.1 fif not perform that well.
Unfortunately, it did not turn out to be that difficult (as far as the authors expected). Present models have also reached human performance.
No yes/no, counting or implicit why questions.
The dataset was build by human studies in which the humans were asked to come up with questions that had answers in the passage or the passage cannot answer. This biases the human to ask question closely realted to the passage. People have tried to consider more natural question (eg. how people ask question on Google).


DrQA -- simple end-to-end baseline

Question to a vector-- embed each word; biLSTM; concatenate the end vectors to get a vecor representing the question.
Similar for the words in the passage.
Use the question representation (query) to work out an (bi-linear) attention over the vector in the passage. Use this attention weight softmax to predict the start of the answer. Use another network to find the end of the answer. Why not look at the middle; maybe an important part of the answer is in the middle; biLSTM transfers info from the middel to the two ends (might be working because of this).


Stanford Attentive Reader++ -- Why is this more complex?

Uses weighted vector to represent the question (not only the ends). What is the query vector in this case that helps in determining the weights? It is a random vector that is also trained end-to-end.
Linguistic features (POS & NER), word frequency is concatenated to the passage vector.
For each word in the passage, if it appears in the question, use that explicitly in the representation.


Why do these system do better than traditional systems? They are good at semantic matching Chen, Bolton, Manning 2016.
BiDAF by Seo et. al. 2017 

Map question words to passage words and also, vice-versa (using attention for the mapping).
The similarity score is build on top of a tuple that has the context word, the question word and also a concatenation of the context and question word (a DNN purist should assume the networks learns the latter by itself).

Context2Question attention over it-- attention weighted view of the question as per the context in the passage.
Question2Context attention over it--  Reverse but find the position of the question that is most well alighted to different context parts. Softmax over these attention weights provides an attention map over the passage.
Output of these attentions is input the the BiLSTM layers and then the representations are used in a linear layer to predict start and end.


FusionNet, 2017-- Addative tanh version of the attention. But one could use a weighted (product) layer to calculate the attention. Fusion net factors the Weight matrix into low dimensional matrix. Tons of attention is cleaper to do this way.
Most research is making bigger networks and training it.
ELMo and BERT-- Contextual word representaion. Uses a language modeling technique. BERT used by all models at the top of the leader board. Using it as a sub-module and building on top of it apparently helps. Link that tries to explain ELMo, BERT and co.

Machine Translation, Seq-2-seq, Attention


MT history-- Translating sentence x in one language to y in another language. Began in 1950s with Russian to English translation mostly due to the cold war. The systems were rule-based using multi-lingual dictionaries.


1990-2010-- Statistical MT. Find argmax_y P(y|x) = P(x|y) P(y). P(x|y) includes a translation mode. P(y) is a target-language model. Breaks down translation and writing good sentences in the target language as opposed to learning both of them together. One needs a lot of parallel data. The Rosetta Stone has a parallel corpus for understanding Ancient Egyptians.

To learn alignment, break it down to P(a,x|y) where a is the alignment. There can be many to one (many words in the source language combine to one word in the target language) or one-to-many alignment issues (fertile work in source language translates to many words in the target language). Many to many are also possible (a phrase to phrase alignment that cannot be broken down into word to word alignment).
SMT uses a search algorithm in translation (decoding) and rejects sequences that have low-probability (realize you need an argmax-y).
Huge research field and made an extremely complex system. People needed to maintain tables that had equivalent phrases. A lot of human effort and more needed to learn conversion to a new language.


Neural Machine Translation-- Sequence to sequence architecture that involves two RNNs. One passes in word embeddings into the Encoder RNN. Encoded hidden state is passed tot he decoder and then the decoder behaves like a language model to generate target sentence. (Note that one needs word embedding for both source and target language)


Seq2seq is a conditional language model that encodes some information based on which the generation is supposed to work. Thus, it can be used for summarization, dialog (utterance to utterance), parsing, code generation (eng to py) etc.


While training, the decoder gets as input the previous hidden state and the actual word in the given translation in the parallel corpus. End to end learning implies backpropagating gradients obtained at each step of the decoder to the first start of the encoder.


Greedy decoding chooses the argmax at each step and uses that as the input to the next step of the decoder. Note that and argmax of each step might not result in the argmax of the entire system. With no way of backtracking, a bad choice at an intermediate step will result in a bad translation.

Exhaustive search decoding is too expensive. The branching factor is of the order of the vocabulary.
People use beam search-- maintain a beam of fixed size that keeps track of k-most likely (k = 5-10 in practice) hypothesis. The score of sequence can be decomposed into the sum of probabilities of each word given the sequence before it. (Bean is not optimal but executive, which is optimal, it too expensive). Fringe has k * K options, discard all but top k.
Stopping criterion-- One a hypothesis generates , keep it aside. Still, the beam search continues. Two options to stop-- (1) after T time steps (30 words) or (2) after you have collected n complete hypothesis. Obviously, longer length one will have lower probabilities; this prioritizes shorter generations. A simple solution is to normalize the score using the length of the sequence.
Pros-- Fluent, better context usage, phrase similarity identification and usage, single end-to-end optimization needed; less human effort required (no feature engineering and same method for any language pair).
Cons-- Less interpretable and thus, harder to debug (compared to SMTs); difficult to control (if you see a particular kind of error, it is difficult to program to it that way) leading to safety concerns (never say particular kind of things). (People do take SMT research ideas to address some of these.)
Evaluation metric-- Bilingual Evaluation Understudy (BLEU). N-gram precision + brevity penalty for too-short system translation. It is useful but pretty imperfect (if human translation used rare words while your model uses common words or vice versa, your model gets a low score even if it produces good quality translation).
Out of Vocabulary (OOV) -- people use sub-word modeling
Domain mismatch -- Trained on the formal text used in informal (say social media context) does not work that well.
Translating a long text that needs context from previous sentences is difficult.
NMT is difficult in the context of low resource language pairs
Common sense is difficult-- eg. paper jam (paper stuck in a printer) -> Mermelada de papel (Jam made of paper)
Biases-- gender neural sentences from a source language translated to show biases in target languages (সে একজন ডাক্তার --> He is a doctor).


Attention--  In seq2seq, only the last hidden state of the source sentence in used for translation. This is an information bottleneck. Attention addresses this by providing a soft weighing over the encoder hidden states.


Mathematically, it can be described as follows,


Pros: Improve NMT performance, solves the bottleneck problem, helps with the vanishing gradient problem, provides some interpretability. SMT learned alignment explicitly. Attention learns this alignment for free (learned the structure in a non-supervised way).


Generally, attention is a technique to compute a weighted sum of (vector) values, dependent on a query (vector). Often, called query attends to the values. In seq-2-seq, decoder hidden state is the query and the encoder hidden states are the values. Doing attention implies three steps-- (1) computing attention scores (eg. dot-product of the query, i.e. decoder hidden state and values, i.e. encoder hidden states; Multiplicative attention-- the score is a bilinear function with a learnable weight matrix that learns the scoring function; Additive attention-- learn weights to add query and values, pass it through a non-linearity like tanh to learn the scores), (2) getting an attention distribution (eg. softmax of the attention scores), (3) obtain attention output (called the context vector) by doing a weighted sum of values (eg. expected value calculation using the attention distribution over individual values).


Other Resources on Attention


Tips and Tricks


Gated Recurrent Units (GRUs)-- Relation with attention. We want to create shortcut connections that can help to amplify the backprop signal thereby preventing vanishing gradients.
<UNK> Unknown word token. Not a problem if there are UNKs in the source vocabulary unless you are using pre-trained embedding that may not have embedding for these works (eg. Indian names). For the output, softmax can become slow. Use a hierarchical softmax-- tree structure that speeds up the computation. Noise-contrastive estimation speeds up training in general. Work with vocabulary subset (Jean, Cho, Memisevic, Bengio ACL 2015)[]; Copying words (eg. Names, FedEx IDs).
BLEU (Bilingual Evaluation Understudy)-- At the initial stage, it correlated with human judgments. Of late, people have gamed the score and build models that have BLEU scores that make it appear that translation quality is at par with human translation. This is clearly not the case (when one directly optimizes a metric, it stops begin usefully).

There are many alternatives that have come up-- TER, METEOR, MaxSim, SEPIA, and RTE-MT.
TERpA can handle variation in work choice and is good.
Rouge is also a decent metric; has been used in Summarization.


Evaluation (other tips)-- Never use test data as validation data (to tune hyperparameters) if it is no available. Overfitting to training data, to an extent, can be good for deep neural networks. But too much memorization can lead to bad test time performance. Use a dev/validation set to strike this balance. Realize that manually changing your model at testing accuracy on the dev data (or test data) can make you overfit to them. Thus, competitions on Kaggle have a secret test data set.
Debugging is the most challenging part. Incremental work is a good way to go.
Minute dataset
Smaller models
Even on a large dataset, train your model till it reaches 100% accuracy. If it does not, increase your model size. Then regularize it (eg. using dropouts) till validation accuracy levels off.


Vanishing Gradients and Fancy RNNs


Vanishing Gradient -- What is the gradient dJ / dh_1 (loss w.r.t to hidden state). If the value of dJ / dh_n is small, then the gradients reaching earlier hidden states become smaller and smaller. This happens because the weight matrix has small values and the gradient considers a multiplicative of W (since W < 1, it shrinks exponentially). Over the RNNs unrolled architecture, the gradient vanishes.

The magnitude of gradient signals for losses near a hidden state is far away for losses far away. Thus, the long term effects of modifying weights are lost.
Hence, the vanishing gradient is a problem if we aren't sure that a dependency exists between t and t+n; it doesn't help your answer that question.
The writer of the books ___ (is or are). Syntactic recency (writer) vs sequential recency (books).


Exploding Gradients-- too large steps can lead to updating parameters to a loss function space that is bad (also some of them might become Inf and NaN). To solve this, use gradient clipping. If the norm of gradient is greater than some threshold, scale it down (by threshold / ||g||). Steps near a cliff (Deep Learning textbook Chap 10.11.1).

Is this similar to regularization of some kind? While in regularization, one wants to reduce the weights of the parameters (and hopefully this helps to address overfitting), here the reason to ensure the norm of the gradients don't become large is to ensure you are careful and don't take large steps when the gradient value is large (not sure if this address overfitting). In comparison to momentum, while gradient clipping cares about taking large steps when gradients are large, momentum encourages large steps if they are in the same (or similar direction) but discourages them if the direction changes.


Fixing vanishing gradients

In vanilla RNN, the hidden state is being constantly rewritten. Thus, preserving information over long time steps becomes hard.
Can we use a separate memory? The motivating idea behind LSTM Hochreiter and Schmidhuber, 1997.
At each time step, there is a hidden state and a cell state, the latter being a memory cell which can be erased, written to and read from.
Gates are also vectors of length n. Each element of the gate can be open(1), closed(0), or somewhere in between. Gates are dynamic and can possess different values based on the current context. Formally, to understand that (one of the best slides about LSTMs),


Its relation to the mandatory picture in the popular post on understanding LSTMs:

(Why two tanh? Does it give some extra expressivity?)
(Why compute forget gate only from hidden state and input but not the content of the cell state? Probably because you want to learn a general function, but not completely convinced; Also, the hidden state is computed using the previous cell state, so it has some indirect influence).
Note that LSTMs can still have the vanishing and exploding gradient problem. It has just been shown that it is more robust to it and can handle long-term dependencies. State of the art in handwriting recognition, speech recognition, machine translation, parsing, image captioning, etc. (2013-15). Currently (i.e. 2019), Transformers have become more dominant.

Gated Recurrent Units (GRUs) [Cho et. al. 2014] -- Update gate replaces the forget and the input gate in the LSTM cell (it does not seem like a precise analogy though). GRU's don't have the concept of an explicit memory cell, but the update gate helps to retain long term information, helping to address (to an extent) the vanishing gradient problem.


LSTM vs. GRUs -- GRUs are quicker to compute but no conclusive evidence that one consistently outperforms the other. But LSTM is a good default choice (a little bit of evidence that they might be better for longer dependencies). Go with LSTMs, if efficiency is a question, try GRUs. If performance stays the same or improves, stick to GRU for that particular task.


Vanishing gradient also occurs in CNNs, FFNs. It makes the lower layers very hard to train. Hence, people add direct connections (from lower to higher layers). ResNets have residual connections (or identity skip connections) to address this. These networks are much easier to learn. The idea of DenseNets is to add skip connections from all layers to all layers. Also, HighwayNets add highway connections, basically residual connections controlled by a dynamic gate (inspired from LSTMs).

RNNs are more susceptible to the problem due to the vanishing gradient problem because the same weight matrix is used for multiplying in the repeated unrolling steps Bengio et. al. 1994.


Bidirectional RNNs -- One can lose information if they are only considering the left context (eg. the movie was terribly exciting! in case of sentiment classification) Using the right context (i.e. exciting w.r.t the work terribly) might be important to conclude that the sentence indicates positive sentiment. Use a forward RNN (encodes sentence left to right) and a backward RNN (encodes the sentence right to left). Use concatenated hidden states obtained from both the RNNs to compute the final representation works better in many cases (eg. sentiment classification).

Note that BNNs may not apply to all scenarios.  Eg. Language Modeling (because you may not have the latter context). BERT (Bidirectional Encoder Representation form Transformers) is a powerful contextual representation system build on bidirectionality (later in the course).


Multi-layer RNNs (also called Stacked RNNs)-- Applying multiple RNNs might helps learn more complex representations. Pytorch probably computes all hidden states of layer 1 then for layer 2 etc. But you can do it in a different order (i.e. hidden layer 1 of all the RNNs and then hidden state 2 of all RNNs). Note that the flexibility goes away if you have a bidirectional RNN.

Given RNN needs sequential computation as opposed to parallel computation, multi-layer RNNs are hard to compute if they become too deep. (Skip/dense connections are needed when it becomes 8 layers). Transformer-based nets (eg. BERT) go up to 24 layers.


Giggles- Clip or you fall off a cliff, Deep multi-layer RNNs need deep pockets
Language Modeling


Given a phrase, what is the next word? You can also ask what is the probability of a piece of text? (Google search, Gmail, mobile keyboard).
N-Gram LMs have to ignore context before n-1 words. What if a particular n or n-1 gram was not seen in the corpus (sparsity problem)? When the numerator is zero (n-gram not seen) generally fixed by adding an epsilon to prob of all words (smoothing). When the denominator is zero, use n-2 gram in the denominator and so on. Larger values of n exacerbate the sparsity problem. Also, the number of counts stores will increase with an increase in n.
Fixed window neural language model-- Window size number of words, concatenate embeddings, and treats language modeling as a classification task. No sparsity problem. The fixed window might still not be large enough. Enlarging window size needs more network capacity. A subtle point is that using Multi-Layer Perceptrons (MLPs) learn weights that may be redundant (columns of the weight matrix share commonalities with other columns).
Recurrent Neural Networks -- Replace the weight vectors with a single weight vector that is used for all input words. A word in input at a time, along with the present state (called the hidden state). The use of an RNN unit outputs a new state which is used in conjunction with the next work to generate the next state and so on, till the last hidden state is generated. The final hidden state is then used to generate a softmax over the vocabulary.

How to back-prop in this case? Does one add gradients over every unrolling of the recurrent layer? Yes. Each step has a loss associated with it, i.e. the word predicted with the hidden state in step t and the actual word at step t. Average of all the losses over the entire corpus (too big). Hence, the sequence is a shorter unit of text such as sentences (or utterance in for chatbots). It is called backpropagation through time. Realize that you have to hold on to all the intermediate W values for the backpropagation through time to work.
Learns a general weight matrix that can be used with any prefix of text to generate a meaningful next word. It probably works because of the high dimensionality of the hidden state.
How well does it capture the history of words seen? Apparently, vanilla RNN's hidden state does not capture a long history.
RNN + handwritten hacky rules (eg. Beam search but in general can be difficult to do.)
Perplexity is a standard evaluation metric for Language Models. It is equal to the exponential of the cross-entropy loss. Lower is better.
Why is language modeling important? A benchmark task about understanding language (grammar, syntax, real-world knowledge, logic etc.) (2) it is a sub-component of many NLP tasks that seek to generate text of estimate the probability of text-- predictive typing, speech recognition, handwriting recognition, spelling/grammar correction, authorship identification (train separate LM for different authors. Given a text, which model gives the highest probability to this text?), machine translation, summarization, dialog, sentiment classification, question answering.


Dependency Parsing


Syntactic Structure - Consistency and Dependency

Phrase structure -- sentences are built out of units that can be nested. (words -> phrases -> sentences) Use of context-free grammar (CFGs). (Idea: Using CFGs as a validator.) This is dominant among Linguists.


Dependency Grammar and Treebanks -- What words depend on or modify other words. Prepositional Phrase attachment ambiguity (eg. Scientists count whales from space). Coordination scope ambiguity (eg. Doctor: No heart and cognitive issues). Adjective Modifier Ambiguity (eg. Students get first hand job experience). Verb Phrase ambiguity (Mutilated body washes up on Rio beach to be used for the Olympics beach volleyball).

Ambiguity about which ways do the arrows point?
Start at the head, point it to the dependent.
Treebanks -- Universal Dependencies
when constructing a dependency tree, each node has one parent before it or the ROOT. Avoid cycles because you want a tree. Generally, don't allow dependencies to cross.
Arc based dependency parsing. Dynamic programming approach in the 1960s. Then in 2005, Nivre and Hall showed that one can use ML. Fast linear time parsing.
Evaluating Dependency parsing. Human's labels are called gold arcs. Parsed arcs are one generated by an algorithm. Unlabelled attachment scores (UAS) ignore the labels (subj, root, det, obj etc.) and just find the ratio of correct arcs / no of dependency arcs.


Neural Dependency Parsing [Chen and Manning 2014]

Hand engineered are very sparse and thus, incomplete.
Computing these hand-engineered features was difficult/costly. (95% of parsing time).
Use a neural network to guess the correct action (left/right arc move) that was used in Nivre's parser. Beats graph motivated shift-reducing parsing. (Comparable accuracy but much faster.)
Words represented as a d-dimensional dense vector (word embedding).
They used part-of-speech and dependency labels also represented as d-dimensional vectors.


Backpropagation


Backprop

Regularization -- L2 regularization
Vectorization helps speed. (CPU: 600 vs 52 nano-sec for loop vs. matrix-based dot product)
Non-linearities -- Hard tanh is fast to calculate; exponentiations for tanh slow you down. Hard tanh has been shown to work well. Thus, people tried simpler stuff like ReLU (negative value yields zero gradients, a positive value is an identity function). You can do a bit better by putting a little bit of slope on the negative side (Leaky ReLU, Parametric ReLU).
Parameter Initialization -- Xavier initialization uses variance that is inversely proportional to the sum of the previous layer size (fan-in) and next layer size (fan-out).
Decaying learning rate as epochs increase (eg. half after x epochs, cyclic learning rates).


Continuous representation of words


Understanding work to vectors
Glove Embeddings
For most neural models used in NLP, you will need to embed words to a continuous vector space. If you training corpus in not large, using pre-trained embeddings generally works best (basically, do not propogate the gradients back to the input embedding layer). If your training corpus is large, fine-tuning works wonders. For tasks where there are lots of data (eg. Chinese to english translation), learning from scratch may work.
Note that the vector space represents contextual information. Thus, antonym words, eg. good and bad, that are found in the same context may occur close to one another. Work on counter-fitted word embeddings uses an after the fact optimization to inject synonym and antonym relation by adjusting the existing vector space (cuts close to work done by me in 2018).