Skip to content

Instantly share code, notes, and snippets.



Last active Jun 30, 2020
What would you like to do?
Summary for the article "A Deep Dive into the Wonderful World of Preprocessing in NLP"

A Deep Dive into the Wonderful World of Preprocessing in NLP

Many thanks to the author for writing in such depth and taking the time to explaining these topics.

Normalization(Cleaning) → Segmentation(tokenization) → Numericalization(vocab matching)


  • can be dense + sparse
  • word count etc are sparse features. rasa has a good post around all the sparse features it uses for training
  • what are the dense features?? (word embeddings, bag of words)

1. normalization

  • example, changing case to lowercase etc etc
  • unicode normalization. For instance, the character ë can be represented as a single unicode character "ë" or two unicode characters: the character "e" and an accent. Unicode normalization maps both of these to a single, canonical form.
  • if you have large amounts of data possibly can skip norm, since some of the patterns can be learnt from the large data

2. tokenization

  1. 1 rule base tokenizers
  • spacy tokenizer first splits on spaces and then looks through each of the substrings applying additional tokenization per the rules defined.
  • all the in built rules for spacy can be overriden by custom rules
  • hits limitations with languages like chinese which require a lot of sophisticated rules
  1. 2 subword tokenization
  • more frequent words should be given unique ids while less frequent words should be decomposed into subword units that best retain their meaning. the idea is that the common words would appear enough in our dataset and the model will be able to learn its meaning.
  • 4 major subword tokenization algs
    • byte-pair encoding
    • wordpiece
    • unigram lang model
    • sentencepiece
  • limitations of subword tokenizer
    • better off using rule based tokenizer when data is less (why??)
    • hard to learn a model for subword tokenizer even though the tokenization itself might be cheap
    • tokenizer that operate on character/byte level might allocate vocab space to variations of the same word; eg. dog , dog!, dog?

2.2.1 BPE

Byte Pair Encoding

  • derives roots from information theory and compression

  • bottom up(because we are splitting all the ways to chars first and then going up) token alg that learns a subword vocab of certain size(vocab size is a hyper param)

    • assumes that some pre-tokenization technique has already given you a list of words in a sentence/document
    • start splitting all words into unicode characters each unode character corresponds to a symbol in final vocab. this is our minimal vocab and we expand on this. (this is the second part of the subword tokenization, less frequent words should be decomposed into subword units that best retain their meanings)
    • while we still have room in vocab;
      • find most frequent symbol bigrams
      • merge those symbols to create a new symbol and add this to your vocab.
      • this is the first part of the subword tokenization, more frequent words should be given unique id
      • dealing with unseen tokens in vocab
        • one approach is to assign UNKNOWN token
        • another approach is to assign a token for each of the unicode character irrespective of whether the symbol was in the dataset. but this is unrealistic considering the possible number of unicode characters
        • GPT-2 paper approach is to treat input as a sequence of bytes instead of unicde and allocate id to every possible byte. since unicode characters are represented by variable number of bytes, even if we encounter a character that is unknown, it can be broken into constituent bytes and get the token accordingly.

    2.2.2 Wordpiece (need to read more)


    • most famous because of its usage in BERT
    • identical to BPE. The only difference is that instead of merging the most frequent symbol bigram, the model merges the bigram that, when merged, would increase the likelihood of a unigram language model trained on the training data.

    2.2.3 Unigram language model (need to read more)

    2.2.4 Sentencepiece

    • all the algs so far require pre-tokenization (they operate at word level and the sentence will still need to be tokenized before feeding it to the algorithms), which makes it harder for languages like chinese that are hard to tokenize on say whitespaces.

    • can be hard to de-tokenize

    • resolves both these by treating input as a raw stream of unicode characters and then using either BPE or unigram LM at character level to construct vocab. this means that the whitespaces are also included in tokenization.

    • For example, depending on the trained model, "I like natural language processing" might be tokenized like

      "I", "_like", "_natural", "_lang", "uage", "_process", "ing"

      where the whitespace character is replaced with the underscore ("_") for clarity. Note the distinction with BPE, where the above sequence with the same subwords might be tokenized as

      "I", "like", "natural", "lang", "##uage", "process", "##ing"

      where subwords are prepended with a special marker.

    • Prepending subwords with a special marker only makes sense with a model that pretokenizes, since the sentencepiece model does not know anything about word boundaries.

    • why sentencepiece can afford to treat the input as a single stream of characters when we established earlier that finding the most frequent symbol bigram is a prohibitibly expensive operation in BPE. The reason is that sentencepiece uses a priority queue-based algorithm, reducing the asymtopic runtime from O(N^2) to O(NlogN) .

    • sentence piece also applies unicode normalization.

3. Numericalization (vocab mapping)

  • its a good practice to check what words are being treated as out of vocab periodically as a sanity check. GPT-2's tricky of dealing with input as byte characters is an interesting way of dealing with handing unknown inputs.

Open vocabularies

  • vocab is not created ahead of time, but tokens are mapped to ids on the fly
  • especially useful when there is a need to handle continuous streams of data and building vocab over and over is expensive and error prone.
  • uses hashing trick to map token to ids based on their hash value. since the vocab is determined solely by hash function it never needs to be rebuilt. there's a obvious chance of hash collision.
  • overfitting vocab on training set if the size of the vocab is extremely large. One potential solution to this is to train a subword tokenizer on a huge, unlabeled corpus so that it can extract relevant subwords for the language as a whole and not for a particular dataset. other approach would be train using a subset of train set and encounter the unknown tokens during the training set so that the model knows how to deal with it.
  • interesting paper to use student teacher model to learn a new vocab and/or reduce the size of the vocab.
  • possibly can use data augmetation techniques as pre-processing step and might help in strengthening the vocab
  • stemming and lemmatization might not be of much use in neural network based models since the features they will extract will already be addressed by subword tokenization
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.