Astroneko404/INFSCI 2420 Midterm Review Notes (Part 1).md

## INFSCI 2420 Midterm Review Notes (Part 1).md

      
    Raw
  

              INFSCI 2420 Midterm Review Notes (Part 1).md
            
          
    Text Normalization

Regular Expression

Example

grep “elect” news.txt


Matching strings that we should not have matched (False Positives) -> Increase accuracy or precision
Not matching things that we should have matched (False Negatives) -> Increase coverage or recall

Text Normalization

Purpose


It guarantees the input to be consistent before other operations on it

Categories


Word Tokenization
Case Folding
Stemming
Lemmatization

Stemming vs. Lemmatization

Reference:

https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html


Similarity

Both of them will reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

Difference

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes
Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma

Language Models with N-grams

Intuition


Word prediction

Probabilistic Language Models

Goal


Assign a probability to a sentence

Conception

A model that computes either

the probability of a sentence (or a word sequence), or
the probability of an upcoming word,

is a language model
Applications


Machine Translation
Spell Correction
Speech Recognition
etc.

Reminder: The Chain Rule


Markov Assumption


or 

Unigram


Bigram


Sentence probabilities


Notes: Using log space to avoid underflow
Evaluation