grep “elect” news.txt
- Matching strings that we should not have matched (False Positives) -> Increase accuracy or precision
- Not matching things that we should have matched (False Negatives) -> Increase coverage or recall
- It guarantees the input to be consistent before other operations on it
- Word Tokenization
- Case Folding
- Stemming
- Lemmatization
Reference:
https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
Similarity
- Both of them will reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.
Difference
- Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes
- Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma
- Word prediction
- Assign a probability to a sentence
A model that computes either
- the probability of a sentence (or a word sequence), or
- the probability of an upcoming word,
is a language model
- Machine Translation
- Spell Correction
- Speech Recognition
- etc.
Notes: Using log space to avoid underflow