Skip to content

Instantly share code, notes, and snippets.

@Astroneko404
Last active October 15, 2019 21:03
Show Gist options
  • Save Astroneko404/4e28c8b5e6faeb37c896d54a7702b15e to your computer and use it in GitHub Desktop.
Save Astroneko404/4e28c8b5e6faeb37c896d54a7702b15e to your computer and use it in GitHub Desktop.
INFSCI 2420 Midterm Review Notes (Part 1)

Text Normalization

Regular Expression

Example
grep “elect” news.txt
  • Matching strings that we should not have matched (False Positives) -> Increase accuracy or precision
  • Not matching things that we should have matched (False Negatives) -> Increase coverage or recall

Text Normalization

Purpose
  • It guarantees the input to be consistent before other operations on it
Categories
  • Word Tokenization
  • Case Folding
  • Stemming
  • Lemmatization
Stemming vs. Lemmatization

Reference:
https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

Similarity

  • Both of them will reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

Difference

  • Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes
  • Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma

Language Models with N-grams

Intuition

  • Word prediction

Probabilistic Language Models

Goal
  • Assign a probability to a sentence
Conception

A model that computes either

  • the probability of a sentence (or a word sequence), or
  • the probability of an upcoming word,

is a language model

Applications
  • Machine Translation
  • Spell Correction
  • Speech Recognition
  • etc.
Reminder: The Chain Rule

Markov Assumption
  • or

Unigram

Bigram

Sentence probabilities

Notes: Using log space to avoid underflow

Evaluation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment