Instantly share code, notes, and snippets.

# ttezel/gist:4138642

Last active July 27, 2024 14:46
Show Gist options
• Save ttezel/4138642 to your computer and use it in GitHub Desktop.
Natural Language Processing Notes

#A Collection of NLP notes

##N-grams

###Calculating unigram probabilities:

P( wi ) = count ( wi ) ) / count ( total number of words )

In english..

Probability of wordi = Frequency of word (i) in our corpus / total number of words in our corpus

##Calcuting bigram probabilities:

P( wi | wi-1 ) = count ( wi-1, wi ) / count ( wi-1 )

In english..

Probability that wordi-1 is followed by wordi = [Num times we saw wordi-1 followed by wordi] / [Num times we saw wordi-1]

## Example

• s = beginning of sentence
• /s = end of sentence

####Given the following corpus:

s I am Sam /s

s Sam I am /s

s I do not like green eggs and ham /s

We can calculate bigram probabilities as such:

• P( I | s ) = 2/3

=> Probability that an s is followed by an `I`

= [Num times we saw `I` follow s ] / [Num times we saw an s ] = 2 / 3

• P( Sam | am ) = 1/2

=> Probability that `am` is followed by `Sam`

= [Num times we saw `Sam` follow `am` ] / [Num times we saw `am`] = 1 / 2

###Calculating trigram probabilities:

Building off the logic in bigram probabilities,

P( wi | wi-1 wi-2 ) = count ( wi, wi-1, wi-2 ) / count ( wi-1, wi-2 )

In english...

Probability that we saw wordi-1 followed by wordi-2 followed by wordi = [Num times we saw the three words in order] / [Num times we saw wordi-1 followed by wordi-2]

## Example

• P( Sam | I am ) = count( Sam I am ) / count(I am) = 1 / 2

###Interpolation using N-grams

We can combine knowledge from each of our n-grams by using interpolation.

E.g. assuming we have calculated unigram, bigram, and trigram probabilities, we can do:

P ( Sam | I am ) = Θ1 x P( Sam ) + Θ2 x P( Sam | am ) + Θ3 x P( Sam | I am )

Using our corpus and assuming all lambdas = 1/3,

P ( Sam | I am ) = (1/3)x(2/20) + (1/3)x(1/2) + (1/3)x(1/2)

In web-scale applications, there's too much information to use interpolation effectively, so we use Stupid Backoff instead.

In Stupid Backoff, we use the trigram if we have enough data points to make it seem credible, otherwise if we don't have enough of a trigram count, we back-off and use the bigram, and if there still isn't enough of a bigram count, we use the unigram probability.

###Smoothing Algorithms

Let's say we've calculated some n-gram probabilities, and now we're analyzing some text. What happens when we encounter a word we haven't seen before? How do we know what probability to assign to it?

We use `smoothing` to give it a probability.

=> Use the count of things we've only seen once in our corpus to estimate the count of things we've never seen.

This is the intuition used by many smoothing algorithms.

###Good-Turing Smoothing

Notation:

Nc = the count of things with frequency `c` - how many things occur with frequency `c` in our corpus.

Good Turing modifies our:

• n-gram probability function for things we've never seen (things that have count 0)
• count for things we have seen (since all probabilites add to 1, we have to modify this count if we are introducing new probabilities for things we've never seen)

Modified Good-Turing probability function:

P* ( things with 0 count ) = N1 / N

=> [Num things with frequency 1] / [Num things]

Modified Good-Turing count:

count* = [ (count + 1) x Nc+1 ] / [ Nc ]

## Example

Assuming our corpus has the following frequency count:

carp: 10 perch: 3 whitefish: 2 trout: 1 salmon: 1 eel: 1

Calculating the probability of something we've never seen before:

P ( catfish ) = N1 / N = 3 / 18

Calculating the modified count of something we've seen:

count* ( trout )

= [ (1 + 1) x N2 ] / [ N1 ] = [ 2 x 1 ] / [ 3 ] = 2 / 3

Calculating the probability of something we've seen:

P* ( trout ) = count ( trout ) / count ( all things ) = (2/3) / 18 = 1/27

What happens if we don't have a word that occurred exactly Nc+1 times?

=> Once we have a sufficient amount of training data, we generate a best-fit curve to make sure we can calculate an estimate of Nc+1 for any `c`.

###Kneser-Ney Smoothing

A problem with Good-Turing smoothing is apparent in analyzing the following sentence, to determine what word comes next:

``````I can't see without my reading ___________
``````

The word `Francisco` is more common than the word `glasses`, so we may end up choosing `Francisco` here, instead of the correct choice, `glasses`.

The Kneser-Ney smoothing algorithm has a notion of `continuation probability` which helps with these sorts of cases. It also saves you from having to recalculate all your counts using `Good-Turing` smoothing.

Here's how you calculate the K-N probabilty with bigrams:

Pkn( wi | wi-1 ) = [ max( count( wi-1, wi ) - `d`, 0) ] / [ count( wi-1 ) ] + Θ( wi-1 ) x Pcontinuation( wi )

Where:

Pcontinuation( wi )

represents the continuation probability of wi. This is the number of bigrams where wi followed wi-1, divided by the total number of bigrams that appear with a frequency > 0. It gives an indication of the probability that a given word will be used as the second word in an unseen bigram (such as `reading ________`)

Θ( ) This is a `normalizing constant`; since we are subtracting by a `discount weight` d, we need to re-add that probability mass we have discounted. It gives us a weighting for our Pcontinuation.

We calculate this as follows:

Θ( wi-1 ) = { d * [ Num words that can follow wi-1 ] } / [ count( wi-1 ) ]

###Kneser-Ney Smoothing for N-grams

The Kneser-Ney probability we discussed above showed only the bigram case.

For N-grams, the probability can be generalized as follows:

Pkn( wi | wi-n+1i-1) = [ max( countkn( wi-n+1i ) - `d`, 0) ] / [ countkn( wi-n+1i-1 ) ] + Θ( wi-n+1i-1 ) x Pkn( wi | wi-n+2i-1 )

Where:

ckn(•) =

• the actual count(•) for the highest order n-gram

or

• continuation_count(•) for lower order n-gram

=> continuation_count = Number of unique single word contexts for •

##Spelling Correction

We can imagine a noisy channel model for this (representing the keyboard).

`original word` ~~~~~~~~~Noisy Channel~~~~~~~~> `noisy word`

Our decoder receives a `noisy word`, and must try to guess what the `original` (intended) word was.

So what we can do is generate N possible `original words`, and run them through our noisy channel and see which one looks most like the `noisy word` we received.

The corrected word, w*, is the word in our vocabulary (`V`) that has the maximum probability of being the correct word (`w`), given the input `x` (the misspelled word).

w* = argmaxw∈V P( w | x )

Using Bayes' Rule, we can rewrite this as:

w* = argmaxw∈V P( x | w ) x P( w )

P( x | w ) is determined by our channel model. P( w ) is determined by our language model (using N-grams).

The first thing we have to do is generate candidate words to compare to the misspelled word.

###Confusion Matrix

This is how we model our noisy channel. A confusion matrix gives us the probabilty that a given spelling mistake (or word edit) happened at a given location in the word. We use the Damerau-Levenshtein edit types (deletion, insertion, substitution, transposition). These account for 80% of human spelling errors.

• del[a,b]: count( ab typed as a )
• ins[a,b]: count( a typed as ab )
• sub[a,b]: count( a typed as b )
• trans[a,b]: count( ab typed as ba )

Our confusion matrix keeps counts of the frequencies of each of these operations for each letter in our alphabet, and from this matrix we can generate probabilities.

We would need to train our confusion matrix, for example using wikipedia's list of common english word misspellings.

After we've generated our confusion matrix, we can generate probabilities.

Let wi denote the ith character in the word w.

p( x | w ) =

• if deletion, [ del( wi-1, wi ) ] / [ count(wi-1 wi) ]
• if insertion, [ ins( wi-1, xi ) ] / [ count(wi-1 ]
• if substitution, [ sub( xi, wi ) ] / [ count(wi ]
• if transposition, [ trans( wi, wi+1 ) ] / [ count(wi wi+1 ]

Suppose we have the misspelled word x = acress

We can generate our channel model for acress as follows:

actress

=> Correct letter : `t`

=> Error letter : `-`

=> x | w : `c` | `ct` (probability of deleting a `t` given the correct spelling has a `ct`)

=> P( x | w ) : 0.000117

cress

=> Correct letter : `-`

=> Error letter : `a`

=> x | w : `a` | `-`

=> P( x | w ) : 0.00000144

caress

=> Correct letter : `ca`

=> Error letter : `ac`

=> x | w : `ac` | `ca`

=> P( x | w ) : 0.00000164

... and so on

We would combine the information from out channel model by multiplying it by our n-gram probability.

###Real-Word Spelling Correction

What happens when a user misspells a word as another, valid english word?

Eg. I have fifteen minuets to leave the house.

We find valid english words that have an edit distance of 1 from the input word.

Given a sentence w1, w2, w3, ..., wn

Generate a set of candidate words for each wi

• Candidate( w1 ) = { w1, w1', w1'', ... }
• Candidate( w2 ) = { w2, w2', w2'', ... }
• Candidate( wn ) = { wn, wn', wn'', ... }

Note that the candidate sets include the original word itself (since it may actually be correct!)

Then we choose the sequence of candidates W that has the maximal probability.

## Example

Given the sentence `two of thew`, our sequences of candidates may look like:

• two of thew
• two of the
• to off threw
• to off the
• to on threw
• to on the
• to of threw
• to of the
• too of threw
• too of the

Then we ask ourselves, of all possible sentences, which has the highest probability?

In practice, we simplify by looking at the cases where only 1 word of the sentence was mistyped (note that above we were considering all possible cases where each word could have been mistyped). So we look at all possibilities with one word replaced at a time. This changes our run-time from O(n2) to O(n).

Where do we get these probabilities?

• Our language model (unigrams, bigrams, ..., n-grams)
• Our Channel model (same as for non-word spelling correction)

Our Noisy Channel model can be further improved by looking at factors like:

• The nearby keys in the keyboard
• Letters or word-parts that are pronounced similarly (such as `ant`->`ent`)

##Text Classification

Text Classification allows us to do things like:

• determining if an email is spam
• determining who is the author of some piece of text
• determining the likelihood that a piece of text was written by a man or a woman
• Perform sentiment analysis on some text

Let's define the Task of Text Classification

Given:

• a document d
• a fixed set of classes C = { c1, c2, ... , cn }

Determine:

• the predicted class c ∈ C

Put simply, we want to take a piece of text, and assign a class to it.

###Classification Methods

We can use Supervised Machine Learning:

Given:

• a document d
• a fixed set of classes C = { c1, c2, ... , cn }
• a training set of m documents that we have pre-determined to belong to a specific class

We train our classifier using the training set, and result in a learned classifier.

We can then use this learned classifier to classify new documents.

Notation: we use Υ(d) = C to represent our classifier, where Υ() is the classifier, d is the document, and c is the class we assigned to the document.

(Google's `mark as spam` button probably works this way).

####Naive Bayes Classifier

This is a simple (naive) classification method based on Bayes rule. It relies on a very simple representation of the document (called the bag of words representation)

Imagine we have 2 classes ( positive and negative ), and our input is a text representing a review of a movie. We want to know whether the review was positive or negative. So we may have a bag of positive words (e.g. `love`, `amazing`, `hilarious`, `great`), and a bag of negative words (e.g. `hate`, `terrible`).

We may then count the number of times each of those words appears in the document, in order to classify the document as positive or negative.

This technique works well for topic classification; say we have a set of academic papers, and we want to classify them into different topics (computer science, biology, mathematics).

####Bayes' Rule applied to Documents and Classes

For a document d and a class c, and using Bayes' rule,

P( c | d ) = [ P( d | c ) x P( c ) ] / [ P( d ) ]

The class mapping for a given document is the class which has the maximum value of the above probability.

Since all probabilities have P( d ) as their denominator, we can eliminate the denominator, and simply compare the different values of the numerator:

P( c | d ) = P( d | c ) x P( c )

Now, what do we mean by the term P( d | c ) ?

Let's represent the document as a set of features (words or tokens) x1, x2, x3, ...

We can then re-write P( d | c ) as:

P( x1, x2, x3, ... , xn | c )

What about P( c ) ? How do we calculate it?

=> P( c ) is the total probability of a class. => How often does this class occur in total?

E.g. in the case of classes positive and negative, we would be calculating the probability that any given review is positive or negative, without actually analyzing the current input document.

This is calculated by counting the relative frequencies of each class in a corpus.

E.g. out of 10 reviews we have seen, 3 have been classified as positive.

=> P ( positive ) = 3 / 10

Now let's go back to the first term in the Naive Bayes equation:

P( d | c ), or P( x1, x2, x3, ... , xn | c ).

## How do we actually calculate this?

We use some assumptions to simplify the computation of this probability:

• the Bag of Words assumption => assume the position of the words in the document doesn't matter.
• Conditional Independence => Assume the feature probabilities P( xi | cj ) are independent given the class c.

It is important to note that both of these assumptions aren't actually correct - of course, the order of words matter, and they are not independent. A phrase like `this movie was incredibly terrible` shows an example of how both of these assumptions don't hold up in regular english.

However, these assumptions greatly simplify the complexity of calculating the classification probability. And in practice, we can calculate probabilities with a reasonable level of accuracy given these assumptions.

## So..

To calculate the Naive Bayes probability, P( d | c ) x P( c ), we calculate P( xi | c ) for each xi in d, and multiply them together.

Then we multiply the result by P( c ) for the current class. We do this for each of our classes, and choose the class that has the maximum overall value.

## ###How do we learn the values of P ( c ) and P ( xi | c ) ?

=> We can use Maximum Likelihood estimates.

Simply put, we look at frequency counts.

P ( ci ) = [ Num documents that have been classified as ci ] / [ Num documents ]

In english..

Out of all the documents, how many of them were in class i ?

P ( wi | cj ) = [ count( wi, cj ) ] / [ Σw∈V count ( w, cj ) ]

In english...

The probability of word i given class j is the count that the word occurred in documents of class j, divided by the sum of the counts of each word in our vocabulary in class j.

So for the denominator, we iterate thru each word in our vocabulary, look up the frequency that it has occurred in class j, and add these up.

####Problems with Maximum-Likelihood Estimate.

What if we haven't seen any training documents with the word fantastic in our class positive ?

In this case, P ( fantastic | positive ) = 0

Since we are calculating the overall probability of the class by multiplying individual probabilities for each word, we would end up with an overall probability of 0 for the positive class.

So how do we fix this issue?

We can use a Smoothing Algorithm, for example Add-one smoothing (or Laplace smoothing).

####Laplace Smoothing

We modify our conditional word probability by adding 1 to the numerator and modifying the denominator as such:

P ( wi | cj ) = [ count( wi, cj ) + 1 ] / [ Σw∈V( count ( w, cj ) + 1 ) ]

This can be simplified to

P ( wi | cj ) = [ count( wi, cj ) + 1 ] / [ Σw∈V( count ( w, cj ) ) + |V| ]

where |V| is our vocabulary size (we can do this since we are adding 1 for each word in the vocabulary in the previous equation).

####So in Summary, to Machine-Learn your Naive-Bayes Classifier:

Given:

• an input document
• the category that this document belongs to

We do:

• increment the count of total documents we have learned from N.
• increment the count of documents that have been mapped to this category Nc.
• if we encounter new words in this document, add them to our vocabulary, and update our vocabulary size |V|.
• update count( w, c ) => the frequency with which each word in the document has been mapped to this category.
• update count ( c ) => the total count of all words that have been mapped to this class.

## So when we are confronted with a new document, we calculate for each class:

P( c ) = Nc / N

=> how many documents were mapped to class c, divided by the total number of documents we have ever looked at. This is the overall, or prior probability of this class.

Then we iterate thru each word in the document, and calculate:

P( w | c ) = [ count( w, c ) + 1 ] / [ count( c ) + |V| ]

=> the count of how many times this word has appeared in class c, plus 1, divided by the total count of all words that have ever been mapped to class c, plus the vocabulary size. This uses the Laplace-Smoothing, so we don't get tripped up by words we've never seen before. This equation is used both for words we have seen, as well as words we haven't seen.

=> we multiply each P( w | c ) for each word w in the new document, then multiply by P( c ), and the result is the probability that this document belongs to this class.

####Some Ways that we can tweak our Naive Bayes Classifier

Depending on the domain we are working with, we can do things like

• Collapse Part Numbers or Chemical Names into a single token
• Upweighting (counting a word as if it occurred twice)
• Feature selection (since not all words in the document are usually important in assigning it a class, we can look for specific words in the document that are good indicators of a particular class, and drop the other words - those that are viewed to be semantically empty)

=> If we have a sentence that contains a title word, we can upweight the sentence (multiply all the words in it by 2 or 3 for example), or we can upweight the title word itself (multiply it by a constant).

##Sentiment Analysis

###Scherer Typology of Affective States

• Emotion

Brief, organically synchronized.. evaluation of a major event => angry, sad, joyful, fearful, ashamed, proud, elated

• Mood

diffuse non-caused low-intensity long-duration change in subjective feeling => cheerful, gloomy, irritable, listless, depressed, buoyant

• Interpersonal Stances

Affective stance towards another person in a specific interaction => friendly, flirtatious, distant, cold, warm, supportive, contemtuous

• Attitudes

Enduring, affectively colored beliefs, disposition towards objects or persons => liking, loving, hating, valuing, desiring

• *Personality Traits

Stable personality dispositions and typical behavior tendencies => nervous, anxious, reckless, morose, hostile, jealous

Sentiment Analysis is the detection of attitudes (2nd from the bottom of the above list).

## We want to know:

• The Holder (source) of the attitude

• The Target (aspect) of the attitude

• The Type of the attitude from a set of types (like, love, hate, value, desire, etc.). Or, more commonly, simply the weighted polarity (positive, negative, neutral, together with strength).

###Baseline Algorithm for Sentiment Analysis

Given a piece of text, we perform:

• Tokenization
• Feature Extraction
• Classification using different classifiers
• Naive Bayes
• MaxEnt
• SVM

####Tokenization Issues

Depending on what type of text we're dealing with, we can have the following issues:

• Dealing with HTML or XML markup
• Twitter Markup (names, hash tags)
• Capitalization
• Phone Numbers, dates, emoticons

Some useful code for tokenizing:

• Christopher Potts Sentiment Tokenizer

####Classification

We will have to deal with handling negation:

`I didn't like this movie` vs `I really like this movie`

####So, how do we handle negation?

One way is to prepend NOT_ to every word between the negation and the beginning of the next punctuation character.

E.g. I didn't really like this movie, but ...

=> I didn't NOT_really NOT_like NOT_this NOT_movie, but ...

This doubles our vocabulary, but helps in tokenizing negative sentiments and classifying them.

####Hatzivassiloglou and McKeown intuition for identifying word polarity

• Adjectives conjoined by and have the same polarity

=> Fair and legitimate, corrupt and brutal

• Adjectives conjoined by but do not

=> Fair but brutal

We can use this intuition to learn new adjectives.

Imagine we have a set of adjectives, and we have identified the polarity of each adjective. Whenever we see a new word we haven't seen before, and it is joined to an adjective we have seen before by an and, we can assign it the same polarity.

For example, say we know the poloarity of nice.

When we see the phrase `nice and helpful`, we can learn that the word helpful has the same polarity as the word nice. In this way, we can learn the polarity of new words we haven't encountered before.

So we can expand our seed set of adjectives using these rules. Then, as we count the frequency that but has occurred between a pair of words versus the frequency with which and has occurred between the pair, we can start to build a ratio of buts to ands, and thus establish a degree of polarity for a given word.

####What about learning the polarity of phrases?

• Take a corpus, and divide it up into phrases.

Then run through the corpus, and extract the first two words of every phrase that matches one these rules:

• 1st word is adjective, 2nd word is noun_singular or noun_plural, 3rd word is anything
• 1st word is adverb, 2nd word is adjective, 3rd word is NOT noun_singular or noun_plural
• 1st word is adjective, 2nd word is adjective, 3rd word is NOT noun_singular or noun_plural
• 1st word is noun_singular or noun_plural, 2nd word is adjective, 3rd word is NOT noun_singular or noun_plural
• 1st word is adverb, 2nd word is verb, 3rd word is anything

Note: To do this, we'd have to run each phrase through a Part-of-Speech tagger.

Then, we can look at how often they co-occur with positive words.

• Positive phrases co-occur more with excellent
• Negative phrases co-occur more with poor

But how do we measure co-occurrence?

We can use Pointwise Mutual Information:

How much more do events x and y occur than if they were independent?

PMI( word1, word2 ) = log2 { [ P( word1, word2 ] / [ P( word1 ) x P( word2 ) ] }

Then we can determine the polarity of the phrase as follows:

Polarity( phrase ) = PMI( phrase, excellent ) - PMI( phrase, poor )

= log2 { [ P( phrase, excellent ] / [ P( phrase ) x P( excellent ) ] } - log2 { [ P( phrase, poor ] / [ P( phrase ) x P( poor ) ] }

## Another way to learn polarity (of words)

Then, look them up in a thesaurus, and:

• add synonyms of each of the positive words to the positive set

• add antonyms of each of the positive words to the negative set

• add synonyms of each of the negative words to the negative set

• add antonyms of each of the negative words to the positive set

and.. repeat.. with the new set of words we have discovered, to build out our lexicon.

###Summary on learning Lexicons

• Start with a seed set* of words ( `good`, `poor`, ... )
• Find other words that have similar polarity:
• using and and but
• using words that appear nearby in the same document
• using synonyms and antonyms

###Sentiment Aspect Analysis

What happens if we get the following phrase:

`The food was great, but the service was awful.`

This phrase doesn't really have an overall sentiment; it has two separate sentiments; great food and awful service. So sometimes, instead of trying to tackle the problem of figuring out the overall sentiment of a phrase, we can instead look at finding the target of any sentiment.

## How do we do this?

=> We look at frequent phrases, and rules

• Find all highly frequent phrases across a set of reviews (e.g. `fish tacos`) => this can help identify the targets of different sentiments.
• Filter these highly frequent phrases by rules like occurs right after a sentiment word

=> `... great fish tacos ...` means that fish tacos is a likely target of sentiment, since we know great is a sentiment word.

Let's say we already know the important aspects of a piece of text. For example, if we are analyzing restaurant reviews, we know that aspects we will come across include food, decor, service, value, ...

Then we can train our classifier to assign an aspect to a given sentence or phrase.

"Given this sentence, is it talking about food or decor or ..."

=> This only applies to text where we KNOW what we will come across.

So overall, our flow could look like:

Text (e.g. reviews) --> Text extractor (extract sentences/phrases) --> Sentiment Classifier (assign a sentiment to each sentence/phrase) --> Aspect Extractor (assign an aspect to each sentence/phrase) --> Aggregator --> Final Summary

##Conditional Models

Naive Bayes Classifiers use a joint probability model. We evaluate probabilities P( d, c ) and try to maximize this joint likelihood.

=> maximizing P( text, class )

rather than a conditional probability model

-> maximizing P( class | text )

If we instead try to maximize the conditional probability of P( class | text ), we can achieve higher accuracy in our classifier.

A conditional model gives probabilities P( c | d ). It takes the data as given and models only the conditional probability of the class.

##MaxEnt Classifiers (Maximum Entropy Classifiers)

We define a feature as an elementary piece of evidence that links aspects of what we observe ( d ), with a category ( c ) that we want to predict.

So a feature is a function that maps from the space of classes and data onto a Real Number (it has a bounded, real value).

ƒ: C x D --> R

Models will assign a weight to each feature:

• A positive weight votes that the configuration is likely correct
• A negative weight votes that the configuration is likely incorrect

## What do features look like?

Here is an example feature:

• ƒ1(c,d) ≡ [ c = LOCATION & w-1="in" & isCapitalized(w) ]

This feature picks out from the data cases where the class is LOCATION, the previous word is "in" and the current word is capitalized.

This feature would match the following scenarios:

• class = LOCATION, data = "in Quebec"
• class = LOCATION, data = "in Arcadia"

Another example feature:

• ƒ2(c,d) ≡ [ c = DRUG & ends(w, "c") ]

This feature picks out from the data cases where the class is DRUG and the current word ends with the letter c.

This feature would match:

• class = DRUG, data = "taking Zantac"

Features generally use both the bag of words, as we saw with the Naive-Bayes Classifier, as well as looking at adjacent words (like the example features above).

###Feature-Based Linear Classifiers

Feature-Based Linear Classifiers:

• Are a linear function from feature sets {ƒi} to classes {c}.

• Assign a weight λi to each feature ƒi

• We consider each class for an observed datum d

• For a pair (c,d), features vote with their weights:

vote(c) = Σ λiƒi(c,d)

• Choose the class c which maximizes vote(c)

As you can see in the equation above, the vote is just a weighted sum of the features; each feature has its own weight. So we try to find the class that maximizes the weighted sum of all the features.

MaxEnt Models make a probabilistic model from the linear combination Σ λiƒi(c,d).

Since the weights can be negative values, we need to convert them to positive values since we want to calculating a non-negative probability for a given class. So we use the value as such:

exp Σ λiƒi(c,d)

This way we will always have a positive value.

We make this value into a probability by dividing by the sum of the probabilities of all classes:

[ exp Σ λiƒi(c,d) ] / [ ΣC exp Σ λiƒi(c,d) ]

##Named Entity Recognition

Named Entity Recognition (NER) is the task of extracting entities (people, organizations, dates, etc.) from text.

###Machine-Learning sequence model approach to NER

####Training

• Collect a set of representative Training Documents
• Label each token for its entity class, or Other (O) if no match
• Design feature extractors appropriate to the text and classes
• Train a sequence classifier to predict the labels from the data

####Testing

• Get a set of testing documents
• Run the model on the document to label each token
• Output the recognized entities

### hanifa2102 commented Aug 2, 2015

Nice Concise Summarization of NLP in one page.

### tristang commented Jun 5, 2016

Thanks for the excellent notes

### nerzid commented Apr 2, 2017

Thank you, this is useful.

Thanks

### hailiang-wang commented Jun 24, 2017

Thanks, great notes.

### Madhivarman commented Jan 17, 2018

Thanks for great tutorial

### morrieinmaas commented Apr 19, 2018

Thanks Tolga, great and very useful notes!

I have a question about the conditional probabilities for n-grams pretty much right at the top. You write:
P( Sam | I am ) = count( Sam I am ) / count(I am) = 1 / 2
I might be wrong here, but I thought that this means in English: Probability of getting Sam given I am so the equation would change slightly to (note: count(I am Sam) instead of count(Sam I am)):
P( Sam | I am ) = count(I am Sam) / count(I am) = 1 / 2
In your example case this doesn't change the result anyhow. I just got a bit confused because my lecture notes/slides state the proposed change. In case yours is correct, I'd appreciate it if you could clarify why.

Either way, great summary and thanks a bunch!

Thanks

### suganthan24 commented Aug 13, 2018

I'm going to calculate laplace smoothing. so should I consider s and /s for count N and V?

### chinusharma123 commented May 21, 2020

can you please provide code for finding out the probability of bigram..
mail- sharmachinu4u@gmail.com

### samibulti commented Apr 24, 2023

Great thanks! It is best note.

### samibulti commented Apr 24, 2023

I got what I challenged with it!