eunicekokor/gist:ae351283cc8d2dcdea45

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    #A Collection of NLP notes
##N-grams
###Calculating unigram probabilities:
P( w_i ) = count ( w_i ) ) / count ( total number of words )
In english..
Probability of word_i =
Frequency of word (i) in our corpus / total number of words in our corpus
##Calcuting bigram probabilities:
P( w_i | w_i-1 ) = count ( w_i-1, w_i ) / count ( w_i-1 )
In english..
Probability that word_i-1 is followed by word_i =
[Num times we saw word_i-1 followed by word_i] / [Num times we saw word_i-1]
Example


s = beginning of sentence
/s = end of sentence

####Given the following corpus:
s I am Sam /s
s Sam I am /s
s I do not like green eggs and ham /s
We can calculate bigram probabilities as such:

P( I | s ) = 2/3

=> Probability that an s is followed by an I
= [Num times we saw I follow s ] / [Num times we saw an s ]
= 2 / 3

P( Sam | am ) = 1/2

=> Probability that am is followed by Sam
= [Num times we saw Sam follow am ] / [Num times we saw am]
= 1 / 2
###Calculating trigram probabilities:
Building off the logic in bigram probabilities,
P( w_i | w_i-1 w_i-2 ) = count ( w_i, w_i-1, w_i-2 ) / count ( w_i-1, w_i-2 )
In english...
Probability that we saw word_i-1 followed by word_i-2 followed by word_i = [Num times we saw the three words in order] / [Num times we saw word_i-1 followed by word_i-2]
Example


P( Sam | I am )
= count( Sam I am ) / count(I am)
= 1 / 2

###Interpolation using N-grams
We can combine knowledge from each of our n-grams by using interpolation.
E.g. assuming we have calculated unigram, bigram, and trigram probabilities, we can do:
P ( Sam | I am ) = Θ₁ x P( Sam ) + Θ₂ x P( Sam | am ) + Θ₃ x P( Sam | I am )
Using our corpus and assuming all lambdas = 1/3,
P ( Sam | I am ) = (1/3)x(2/20) + (1/3)x(1/2) + (1/3)x(1/2)
In web-scale applications, there's too much information to use interpolation effectively, so we use Stupid Backoff instead.
In Stupid Backoff, we use the trigram if we have enough data points to make it seem credible, otherwise if we don't have enough of a trigram count, we back-off and use the bigram, and if there still isn't enough of a bigram count, we use the unigram probability.
###Smoothing Algorithms
Let's say we've calculated some n-gram probabilities, and now we're analyzing some text. What happens when we encounter a word we haven't seen before? How do we know what probability to assign to it?
We use smoothing to give it a probability.
=> Use the count of things we've only seen once in our corpus to estimate the count of things we've never seen.
This is the intuition used by many smoothing algorithms.
###Good-Turing Smoothing
Notation:
N_c = the count of things with frequency c - how many things occur with frequency c in our corpus.
Good Turing modifies our:

n-gram probability function for things we've never seen (things that have count 0)
count for things we have seen (since all probabilites add to 1, we have to modify this count if we are introducing new probabilities for things we've never seen)

Modified Good-Turing probability function:
P^* ( things with 0 count ) = N₁ / N
=> [Num things with frequency 1] / [Num things]
Modified Good-Turing count:
count^* = [ (count + 1) x N_c+1 ] / [ N_c ]
Example

Assuming our corpus has the following frequency count:
carp:       10
perch:      3
whitefish:  2
trout:      1
salmon:     1
eel:        1
Calculating the probability of something we've never seen before:
P ( catfish ) = N₁ / N = 3 / 18
Calculating the modified count of something we've seen:
count^* ( trout )
= [ (1 + 1) x N₂ ] / [ N₁ ]
= [ 2 x 1 ] / [ 3 ]
= 2 / 3
Calculating the probability of something we've seen:
P^* ( trout ) = count ( trout ) / count ( all things ) = (2/3) / 18 = 1/27
What happens if we don't have a word that occurred exactly N_c+1 times?
=> Once we have a sufficient amount of training data, we generate a best-fit curve to make sure we can calculate an estimate of N_c+1 for any c.
###Kneser-Ney Smoothing
A problem with Good-Turing smoothing is apparent in analyzing the following sentence, to determine what word comes next:
I can't see without my reading ___________

The word Francisco is more common than the word glasses, so we may end up choosing Francisco here, instead of the correct choice, glasses.
The Kneser-Ney smoothing algorithm has a notion of continuation probability which helps with these sorts of cases. It also saves you from having to recalculate all your counts using Good-Turing smoothing.
Here's how you calculate the K-N probabilty with bigrams:
P_kn( w_i | w_i-1 ) = [ max( count( w_i-1, w_i ) - d, 0) ] / [ count( w_i-1 ) ] + Θ( w_i-1 ) x P_continuation( w_i )
Where:
P_continuation( w_i )
represents the continuation probability of w_i. This is the number of bigrams where w_i followed w_i-1, divided by the total number of bigrams that appear with a frequency > 0. It gives an indication of the probability that a given word will be used as the second word in an unseen bigram (such as reading ________)
Θ( )
This is a normalizing constant; since we are subtracting by a discount weight d, we need to re-add that probability mass we have discounted. It gives us a weighting for our P_continuation.
We calculate this as follows:
Θ( w_i-1 ) = { d * [ Num words that can follow w_i-1 ] } / [ count( w_i-1 ) ]
###Kneser-Ney Smoothing for N-grams
The Kneser-Ney probability we discussed above showed only the bigram case.
For N-grams, the probability can be generalized as follows:
P_kn( w_i | w_i-n+1^i-1) = [ max( count_kn( w_i-n+1ⁱ ) - d, 0) ] / [ count_kn( w_i-n+1^i-1 ) ] + Θ( w_i-n+1^i-1 ) x P_kn( w_i | w_i-n+2^i-1 )
Where:
c_kn(•) =

the actual count(•) for the highest order n-gram

or

continuation_count(•) for lower order n-gram

=> continuation_count = Number of unique single word contexts for •
##Spelling Correction
We can imagine a noisy channel model for this (representing the keyboard).
original word ~~~~~~~~~Noisy Channel~~~~~~~~> noisy word
Our decoder receives a noisy word, and must try to guess what the original (intended) word was.
So what we can do is generate N possible original words, and run them through our noisy channel and see which one looks most like the noisy word we received.
The corrected word, w^*, is the word in our vocabulary (V) that has the maximum probability of being the correct word (w), given the input x (the misspelled word).
w^* = argmax_w∈V P( w | x )
Using Bayes' Rule, we can rewrite this as:
w^* = argmax_w∈V P( x | w ) x P( w )
P( x | w ) is determined by our channel model.
P( w ) is determined by our language model (using N-grams).
The first thing we have to do is generate candidate words to compare to the misspelled word.
###Confusion Matrix
This is how we model our noisy channel. A confusion matrix gives us the probabilty that a given spelling mistake (or word edit) happened at a given location in the word. We use the Damerau-Levenshtein edit types (deletion, insertion, substitution, transposition). These account for 80% of human spelling errors.

del[a,b]: count( ab typed as a )
ins[a,b]: count( a typed as ab )
sub[a,b]: count( a typed as b )
trans[a,b]: count( ab typed as ba )

Our confusion matrix keeps counts of the frequencies of each of these operations for each letter in our alphabet, and from this matrix we can generate probabilities.
We would need to train our confusion matrix, for example using wikipedia's list of common english word misspellings.
After we've generated our confusion matrix, we can generate probabilities.
Let w_i denote the i^th character in the word w.
p( x | w ) =

if deletion, [ del( w_i-1, w_i ) ] / [ count(w_i-1 w_i) ]
if insertion, [ ins( w_i-1, x_i ) ] / [ count(w_i-1 ]
if substitution, [ sub( x_i, w_i ) ] / [ count(w_i ]
if transposition, [ trans( w_i, w_i+1 ) ] / [ count(w_i w_i+1 ]

Suppose we have the misspelled word x = acress
We can generate our channel model for acress as follows:
actress
=> Correct letter   : t
=> Error letter     : -
=> x | w        : c | ct (probability of deleting a t given the correct spelling has a ct)
=> P( x | w )       : 0.000117
cress
=> Correct letter   : -
=> Error letter     : a
=> x | w        : a | -
=> P( x | w )       : 0.00000144
caress
=> Correct letter   : ca
=> Error letter     : ac
=> x | w        : ac | ca
=> P( x | w )       : 0.00000164
... and so on
We would combine the information from out channel model by multiplying it by our n-gram probability.
###Real-Word Spelling Correction
What happens when a user misspells a word as another, valid english word?
Eg. I have fifteen minuets to leave the house.
We find valid english words that have an edit distance of 1 from the input word.
Given a sentence w₁, w₂, w₃, ..., w_n
Generate a set of candidate words for each w_i

Candidate( w₁ ) = { w₁, w₁^', w₁^'', ... }
Candidate( w₂ ) = { w₂, w₂^', w₂^'', ... }
Candidate( w_n ) = { w_n, w_n^', w_n^'', ... }

Note that the candidate sets include the original word itself (since it may actually be correct!)
Then we choose the sequence of candidates W that has the maximal probability.
Example

Given the sentence two of thew, our sequences of candidates may look like:

two of thew
two of the
to off threw
to off the
to on threw
to on the
to of threw
to of the
too of threw
too of the

Then we ask ourselves, of all possible sentences, which has the highest probability?
In practice, we simplify by looking at the cases where only 1 word of the sentence was mistyped (note that above we were considering all possible cases where each word could have been mistyped). So we look at all possibilities with one word replaced at a time. This changes our run-time from O(n²) to O(n).
Where do we get these probabilities?

Our language model (unigrams, bigrams, ..., n-grams)
Our Channel model (same as for non-word spelling correction)

Our Noisy Channel model can be further improved by looking at factors like:

The nearby keys in the keyboard
Letters or word-parts that are pronounced similarly (such as ant->ent)

##Text Classification
Text Classification allows us to do things like:

determining if an email is spam
determining who is the author of some piece of text
determining the likelihood that a piece of text was written by a man or a woman
Perform sentiment analysis on some text

Let's define the Task of Text Classification
Given:

a document d
a fixed set of classes C = { c₁, c₂, ... , c_n }

Determine:

the predicted class c ∈ C

Put simply, we want to take a piece of text, and assign a class to it.
###Classification Methods
We can use Supervised Machine Learning:
Given:

a document d
a fixed set of classes C = { c₁, c₂, ... , c_n }
a training set of m documents that we have pre-determined to belong to a specific class

We train our classifier using the training set, and result in a learned classifier.
We can then use this learned classifier to classify new documents.
Notation: we use Υ(d) = C to represent our classifier, where Υ() is the classifier, d is the document, and c is the class we assigned to the document.
(Google's mark as spam button probably works this way).
####Naive Bayes Classifier
This is a simple (naive) classification method based on Bayes rule. It relies on a very simple representation of the document (called the bag of words representation)
Imagine we have 2 classes ( positive and negative ), and our input is a text representing a review of a movie. We want to know whether the review was positive or negative. So we may have a bag of positive words (e.g. love, amazing, hilarious, great), and a bag of negative words (e.g. hate, terrible).
We may then count the number of times each of those words appears in the document, in order to classify the document as positive or negative.
This technique works well for topic classification; say we have a set of academic papers, and we want to classify them into different topics (computer science, biology, mathematics).
####Bayes' Rule applied to Documents and Classes
For a document d and a class c, and using Bayes' rule,
P( c | d ) = [ P( d | c ) x P( c ) ] / [ P( d ) ]
The class mapping for a given document is the class which has the maximum value of the above probability.
Since all probabilities have P( d ) as their denominator, we can eliminate the denominator, and simply compare the different values of the numerator:
P( c | d ) = P( d | c ) x P( c )
Now, what do we mean by the term P( d | c ) ?
Let's represent the document as a set of features (words or tokens) x₁, x₂, x₃, ...
We can then re-write P( d | c ) as:
P( x₁, x₂, x₃, ... , x_n | c )
What about P( c ) ? How do we calculate it?
=> P( c ) is the total probability of a class.
=> How often does this class occur in total?
E.g. in the case of classes positive and negative, we would be calculating the probability that any given review is positive or negative, without actually analyzing the current input document.
This is calculated by counting the relative frequencies of each class in a corpus.
E.g. out of 10 reviews we have seen, 3 have been classified as positive.
=> P ( positive ) = 3 / 10
Now let's go back to the first term in the Naive Bayes equation:
P( d | c ), or P( x₁, x₂, x₃, ... , x_n | c ).
How do we actually calculate this?

We use some assumptions to simplify the computation of this probability:

the Bag of Words assumption => assume the position of the words in the document doesn't matter.
Conditional Independence => Assume the feature probabilities P( x_i | c_j ) are independent given the class c.

It is important to note that both of these assumptions aren't actually correct - of course, the order of words matter, and they are not independent. A phrase like this movie was incredibly terrible shows an example of how both of these assumptions don't hold up in regular english.
However, these assumptions greatly simplify the complexity of calculating the classification probability. And in practice, we can calculate probabilities with a reasonable level of accuracy given these assumptions.
So..

To calculate the Naive Bayes probability, P( d | c ) x P( c ), we calculate P( x_i | c ) for each x_i in d, and multiply them together.
Then we multiply the result by P( c ) for the current class. We do this for each of our classes, and choose the class that has the maximum overall value.
###How do we learn the values of P ( c ) and P ( x_i | c ) ?

=> We can use Maximum Likelihood estimates.
Simply put, we look at frequency counts.
P ( c_i ) = [ Num documents that have been classified as c_i ] / [ Num documents ]
In english..
Out of all the documents, how many of them were in class i ?
P ( w_i | c_j ) = [ count( w_i, c_j ) ] / [ Σ_w∈V count ( w, c_j ) ]
In english...
The probability of word i given class j is the count that the word occurred in documents of class j, divided by the sum of the counts of each word in our vocabulary in class j.
So for the denominator, we iterate thru each word in our vocabulary, look up the frequency that it has occurred in class j, and add these up.
####Problems with Maximum-Likelihood Estimate.
What if we haven't seen any training documents with the word fantastic in our class positive ?
In this case, P ( fantastic | positive ) = 0
=> This is BAD
Since we are calculating the overall probability of the class by multiplying individual probabilities for each word, we would end up with an overall probability of 0 for the positive class.
So how do we fix this issue?
We can use a Smoothing Algorithm, for example Add-one smoothing (or Laplace smoothing).
####Laplace Smoothing
We modify our conditional word probability by adding 1 to the numerator and modifying the denominator as such:
P ( w_i | c_j ) = [ count( w_i, c_j ) + 1 ] / [ Σ_w∈V( count ( w, c_j ) + 1 ) ]
This can be simplified to
P ( w_i | c_j ) = [ count( w_i, c_j ) + 1 ] / [ Σ_w∈V( count ( w, c_j ) ) + |V| ]
where |V| is our vocabulary size (we can do this since we are adding 1 for each word in the vocabulary in the previous equation).
####So in Summary, to Machine-Learn your Naive-Bayes Classifier:
Given:

an input document
the category that this document belongs to

We do:

increment the count of total documents we have learned from N.
increment the count of documents that have been mapped to this category N_c.
if we encounter new words in this document, add them to our vocabulary, and update our vocabulary size |V|.
update count( w, c ) => the frequency with which each word in the document has been mapped to this category.
update count ( c ) => the total count of all words that have been mapped to this class.

So when we are confronted with a new document, we calculate for each class:

P( c ) = N_c / N
=> how many documents were mapped to class c, divided by the total number of documents we have ever looked at. This is the overall, or prior probability of this class.
Then we iterate thru each word in the document, and calculate:
P( w | c ) = [ count( w, c ) + 1 ] / [ count( c ) + |V| ]
=> the count of how many times this word has appeared in class c, plus 1, divided by the total count of all words that have ever been mapped to class c, plus the vocabulary size.
This uses the Laplace-Smoothing, so we don't get tripped up by words we've never seen before. This equation is used both for words we have seen, as well as words we haven't seen.
=> we multiply each P( w | c ) for each word w in the new document, then multiply by P( c ), and the result is the probability that this document belongs to this class.
####Some Ways that we can tweak our Naive Bayes Classifier
Depending on the domain we are working with, we can do things like

Collapse Part Numbers or Chemical Names into a single token
Upweighting (counting a word as if it occurred twice)
Feature selection (since not all words in the document are usually important in assigning it a class, we can look for specific words in the document that are good indicators of a particular class, and drop the other words - those that are viewed to be semantically empty)

=> If we have a sentence that contains a title word, we can upweight the sentence (multiply all the words in it by 2 or 3 for example), or we can upweight the title word itself (multiply it by a constant).
##Sentiment Analysis
###Scherer Typology of Affective States

Emotion

Brief, organically synchronized.. evaluation of a major event
=> angry, sad, joyful, fearful, ashamed, proud, elated

Mood

diffuse non-caused low-intensity long-duration change in subjective feeling
=> cheerful, gloomy, irritable, listless, depressed, buoyant

Interpersonal Stances

Affective stance towards another person in a specific interaction
=> friendly, flirtatious, distant, cold, warm, supportive, contemtuous

Attitudes

Enduring, affectively colored beliefs, disposition towards objects or persons
=> liking, loving, hating, valuing, desiring

*Personality Traits

Stable personality dispositions and typical behavior tendencies
=> nervous, anxious, reckless, morose, hostile, jealous
Sentiment Analysis is the detection of attitudes (2nd from the bottom of the above list).
We want to know:


The Holder (source) of the attitude


The Target (aspect) of the attitude


The Type of the attitude from a set of types (like, love, hate, value, desire, etc.).
Or, more commonly, simply the weighted polarity (positive, negative, neutral, together with strength).


###Baseline Algorithm for Sentiment Analysis
Given a piece of text, we perform:

Tokenization
Feature Extraction
Classification using different classifiers

Naive Bayes
MaxEnt
SVM


####Tokenization Issues
Depending on what type of text we're dealing with, we can have the following issues:

Dealing with HTML or XML markup
Twitter Markup (names, hash tags)
Capitalization
Phone Numbers, dates, emoticons

Some useful code for tokenizing:

Christopher Potts Sentiment Tokenizer
Brendan O'Connor Twitter Tokenizer

####Classification
We will have to deal with handling negation:
I didn't like this movie vs I really like this movie
####So, how do we handle negation?
One way is to prepend NOT_ to every word between the negation and the beginning of the next punctuation character.
E.g. I didn't really like this movie, but ...
=> I didn't NOT_really NOT_like NOT_this NOT_movie, but ...
This doubles our vocabulary, but helps in tokenizing negative sentiments and classifying them.
####Hatzivassiloglou and McKeown intuition for identifying word polarity

Adjectives conjoined by and have the same polarity

=> Fair and legitimate, corrupt and brutal

Adjectives conjoined by but do not

=> Fair but brutal
We can use this intuition to learn new adjectives.
Imagine we have a set of adjectives, and we have identified the polarity of each adjective. Whenever we see a new word we haven't seen before, and it is joined to an adjective we have seen before by an and, we can assign it the same polarity.
For example, say we know the poloarity of nice.
When we see the phrase nice and helpful, we can learn that the word helpful has the same polarity as the word nice. In this way, we can learn the polarity of new words we haven't encountered before.
So we can expand our seed set of adjectives using these rules. Then, as we count the frequency that but has occurred between a pair of words versus the frequency with which  and has occurred between the pair, we can start to build a ratio of buts to ands, and thus establish a degree of polarity for a given word.
####What about learning the polarity of phrases?

Take a corpus, and divide it up into phrases.

Then run through the corpus, and extract the first two words of every phrase that matches one these rules:

1st word is adjective, 2nd word is noun_singular or noun_plural, 3rd word is anything
1st word is adverb, 2nd word is adjective, 3rd word is NOT noun_singular or noun_plural
1st word is adjective, 2nd word is adjective, 3rd word is NOT noun_singular or noun_plural
1st word is noun_singular or noun_plural, 2nd word is adjective, 3rd word is NOT noun_singular or noun_plural
1st word is adverb, 2nd word is verb, 3rd word is anything

Note: To do this, we'd have to run each phrase through a Part-of-Speech tagger.
Then, we can look at how often they co-occur with positive words.

Positive phrases co-occur more with excellent
Negative phrases co-occur more with poor

But how do we measure co-occurrence?
We can use Pointwise Mutual Information:
How much more do events x and y occur than if they were independent?
PMI( word₁, word₂ ) = log₂ { [ P( word₁, word₂ ] / [ P( word₁ ) x P( word₂ ) ] }
Then we can determine the polarity of the phrase as follows:
Polarity( phrase ) = PMI( phrase, excellent ) - PMI( phrase, poor )
= log₂ { [ P( phrase, excellent ] / [ P( phrase ) x P( excellent ) ] } - log₂ { [ P( phrase, poor ] / [ P( phrase ) x P( poor ) ] }
Another way to learn polarity (of words)

Start with a seed set of positive and negative words.
Then, look them up in a thesaurus, and:


add synonyms of each of the positive words to the positive set


add antonyms of each of the positive words to the negative set


add synonyms of each of the negative words to the negative set


add antonyms of each of the negative words to the positive set


and.. repeat.. with the new set of words we have discovered, to build out our lexicon.
###Summary on learning Lexicons

Start with a seed set* of words ( good, poor, ... )
Find other words that have similar polarity:

using and and but
using words that appear nearby in the same document
using synonyms and antonyms


###Sentiment Aspect Analysis
What happens if we get the following phrase:
The food was great, but the service was awful.
This phrase doesn't really have an overall sentiment; it has two separate sentiments; great food and awful service. So sometimes, instead of trying to tackle the problem of figuring out the overall sentiment of a phrase, we can instead look at finding the target of any sentiment.
How do we do this?

=> We look at frequent phrases, and rules

Find all highly frequent phrases across a set of reviews (e.g. fish tacos) => this can help identify the targets of different sentiments.
Filter these highly frequent phrases by rules like occurs right after a sentiment word

=> ... great fish tacos ... means that fish tacos is a likely target of sentiment, since we know great is a sentiment word.
Let's say we already know the important aspects of a piece of text. For example, if we are analyzing restaurant reviews, we know that aspects we will come across include food, decor, service, value, ...
Then we can train our classifier to assign an aspect to a given sentence or phrase.
"Given this sentence, is it talking about food or decor or ..."
=> This only applies to text where we KNOW what we will come across.
So overall, our flow could look like:
Text (e.g. reviews) --> Text extractor (extract sentences/phrases) --> Sentiment Classifier (assign a sentiment to each sentence/phrase) --> Aspect Extractor (assign an aspect to each sentence/phrase) --> Aggregator --> Final Summary
##Conditional Models
Naive Bayes Classifiers use a joint probability model. We evaluate probabilities P( d, c ) and try to maximize this joint likelihood.
=> maximizing P( text, class )
rather than a conditional probability model
-> maximizing P( class | text )
If we instead try to maximize the conditional probability of P( class | text ), we can achieve higher accuracy in our classifier.
A conditional model gives probabilities P( c | d ). It takes the data as given and models only the  conditional probability of the class.
##MaxEnt Classifiers (Maximum Entropy Classifiers)
We define a feature as an elementary piece of evidence that links aspects of what we observe ( d ), with a category ( c ) that we want to predict.
So a feature is a function that maps from the space of classes and data onto a Real Number (it has a bounded, real value).
ƒ: C x D --> R
Models will assign a weight to each feature:

A positive weight votes that the configuration is likely correct
A negative weight votes that the configuration is likely incorrect

What do features look like?

Here is an example feature:

ƒ₁(c,d) ≡ [ c = LOCATION & w_-1="in" & isCapitalized(w) ]

This feature picks out from the data cases where the class is LOCATION, the previous word is "in" and the current word is capitalized.
This feature would match the following scenarios:

class = LOCATION, data = "in Quebec"
class = LOCATION, data = "in Arcadia"

Another example feature:

ƒ₂(c,d) ≡ [ c = DRUG & ends(w, "c") ]

This feature picks out from the data cases where the class is DRUG and the current word ends with the letter c.
This feature would match:

class = DRUG, data = "taking Zantac"

Features generally use both the bag of words, as we saw with the Naive-Bayes Classifier, as well as looking at adjacent words (like the example features above).
###Feature-Based Linear Classifiers
Feature-Based Linear Classifiers:


Are a linear function from feature sets {ƒ_i} to classes {c}.


Assign a weight λ_i to each feature ƒ_i


We consider each class for an observed datum d


For a pair (c,d), features vote with their weights:
vote(c) = Σ λ_iƒ_i(c,d)


Choose the class c which maximizes vote(c)


As you can see in the equation above, the vote is just a weighted sum of the features; each feature has its own weight. So we try to find the class that maximizes the weighted sum of all the features.
MaxEnt Models make a probabilistic model from the linear combination Σ λ_iƒ_i(c,d).
Since the weights can be negative values, we need to convert them to positive values since we want to calculating a non-negative probability for a given class. So we use the value as such:
exp Σ λ_iƒ_i(c,d)
This way we will always have a positive value.
We make this value into a probability by dividing by the sum of the probabilities of all classes:
[ exp Σ λ_iƒ_i(c,d) ] / [ Σ_C exp Σ λ_iƒ_i(c,d) ]
##Named Entity Recognition
Named Entity Recognition (NER) is the task of extracting entities (people, organizations, dates, etc.) from text.
###Machine-Learning sequence model approach to NER
####Training

Collect a set of representative Training Documents
Label each token for its entity class, or Other (O) if no match
Design feature extractors appropriate to the text and classes
Train a sequence classifier to predict the labels from the data

####Testing

Get a set of testing documents
Run the model on the document to label each token
Output the recognized entities