atinsood/summary.md

## summary.md

      
    Raw
  

              summary.md
            
          
    A Deep Dive into the Wonderful World of Preprocessing in NLP

https://mlexplained.com/2019/11/06/a-deep-dive-into-the-wonderful-world-of-preprocessing-in-nlp/
Many thanks to the author for writing in such depth and taking the time to explaining these topics.
Normalization(Cleaning) → Segmentation(tokenization) → Numericalization(vocab matching)
features

can be dense + sparse
word count etc are sparse features. rasa has a good post around all the sparse features it uses for training
what are the dense features?? (word embeddings, bag of words)

1. normalization


example, changing case to lowercase etc etc
unicode normalization. For instance, the character ë can be represented as a single unicode character "ë" or two unicode characters: the character "e" and an accent. Unicode normalization maps both of these to a single, canonical form.
if you have large amounts of data possibly can skip norm, since some of the patterns can be learnt from the large data

2. tokenization


1 rule base tokenizers


spacy tokenizer first splits on spaces and then looks through each of the substrings applying additional tokenization per the rules defined.
all the in built rules for spacy can be overriden by custom rules
hits limitations with languages like chinese which require a lot of sophisticated rules


2 subword tokenization


more frequent words should be given unique ids while less frequent words should be decomposed into subword units that best retain their meaning. the idea is that the common words would appear enough in our dataset and the model will be able to learn its meaning.
4 major subword tokenization algs

byte-pair encoding
wordpiece
unigram lang model
sentencepiece


limitations of subword tokenizer

better off using rule based tokenizer when data is less (why??)
hard to learn a model for subword tokenizer even though the tokenization itself might be cheap
tokenizer that operate on character/byte level might allocate vocab space to variations of the same word; eg. dog , dog!, dog?


2.2.1 BPE

Byte Pair Encoding


derives roots from information theory and compression


bottom up(because we are splitting all the ways to chars first and then going up) token alg that learns a subword vocab of certain size(vocab size is a hyper param)

assumes that some pre-tokenization technique has already given you a list of words in a sentence/document
start splitting all words into unicode characters each unode character corresponds to a symbol in final vocab. this is our minimal vocab and we expand on this. (this is the second part of the subword tokenization, less frequent words should be decomposed into subword units that best retain their meanings)
while we still have room in vocab;

find most frequent symbol bigrams
merge those symbols to create a new symbol and add this to your vocab.
this is the first part of the subword tokenization, more frequent words should be given unique id
dealing with unseen tokens in vocab

one approach is to assign UNKNOWN token
another approach is to assign a token for each of the unicode character irrespective of whether the symbol was in the dataset. but this is unrealistic considering the possible number of unicode characters
GPT-2 paper approach is to treat input as a sequence of bytes instead of unicde and allocate id to every possible byte. since unicode characters are represented by variable number of bytes, even if we encounter a character that is unknown, it can be broken into constituent bytes and get the token accordingly.


2.2.2 Wordpiece (need to read more)

Wordpiece

most famous because of its usage in BERT
identical to BPE. The only difference is that instead of merging the most frequent symbol bigram, the model merges the bigram that, when merged, would increase the likelihood of a unigram language model trained on the training data.

2.2.3 Unigram language model (need to read more)

2.2.4 Sentencepiece


all the algs so far require pre-tokenization (they operate at word level and the sentence will still need to be tokenized before feeding it to the algorithms), which makes it harder for languages like chinese that are hard to tokenize on say whitespaces.


can be hard to de-tokenize


resolves both these by treating input as a raw stream of unicode characters and then using either BPE or unigram LM at character level to construct vocab. this means that the whitespaces are also included in tokenization.


For example, depending on the trained model, "I like natural language processing" might be tokenized like
"I", "_like", "_natural", "_lang", "uage", "_process", "ing"
where the whitespace character is replaced with the underscore ("_") for clarity. Note the distinction with BPE, where the above sequence with the same subwords might be tokenized as
"I", "like", "natural", "lang", "##uage", "process", "##ing"
where subwords are prepended with a special marker.


Prepending subwords with a special marker only makes sense with a model that pretokenizes, since the sentencepiece model does not know anything about word boundaries.


why sentencepiece can afford to treat the input as a single stream of characters when we established earlier that finding the most frequent symbol bigram is a prohibitibly expensive operation in BPE. The reason is that sentencepiece uses a priority queue-based algorithm, reducing the asymtopic runtime from O(N^2) to O(NlogN) .


sentence piece also applies unicode normalization.


3. Numericalization (vocab mapping)


its a good practice to check what words are being treated as out of vocab periodically as a sanity check. GPT-2's tricky of dealing with input as byte characters is an interesting way of dealing with handing unknown inputs.

Open vocabularies


vocab is not created ahead of time, but tokens are mapped to ids on the fly
especially useful when there is a need to handle continuous streams of data and building vocab over and over is expensive and error prone.
uses hashing trick to map token to ids based on their hash value. since the vocab is determined solely by hash function it never needs to be rebuilt. there's a obvious chance of hash collision.
overfitting vocab on training set if the size of the vocab is extremely large. One potential solution to this is to train a subword tokenizer on a huge, unlabeled corpus so that it can extract relevant subwords for the language as a whole and not for a particular dataset. other approach would be train using a subset of train set and encounter the unknown tokens during the training set so that the model knows how to deal with it.
interesting paper  to use student teacher model to learn a new vocab and/or reduce the size of the vocab.
possibly can use data augmetation techniques as pre-processing step and might help in strengthening the vocab
stemming and lemmatization might not be of much use in neural network based models since the features they will extract will already be addressed by subword tokenization