dannguyen/nltk-play.md

## nltk-play.md

      
    Raw
  

              nltk-play.md
            
          
    Using the Natural Language Toolkit to Classify Text the Bayesian way

Via the NLTK book, Chapter 6:

Classification is the task of choosing the correct class label for a given input. In basic classification tasks, each input is considered in isolation from all other inputs, and the set of labels is defined in advance. Some examples of classification tasks are:

Deciding whether an email is spam or not.
Deciding what the topic of a news article is, from a fixed list of topic areas such as "sports," "technology," and "politics."
Deciding whether a given occurrence of the word bank is used to refer to a river bank, a financial institution, the act of tilting to the side, or the act of depositing something in a financial institution.


Bayesian Gender Classification with the Natural Language Toolkit in Python

This is an example of how to use naive Bayesian concepts to train a classifier to guess the gender of a name without the use of Social Security Administration statistics. While the statistical method is likely to be more accurate, this kind of classifier could work for specialized use cases, such as making up a totally new name and estimating whether it seems more "male" or "female".
The phases of machine learning for this scenario:

Get a dataset: a list of human names
Label the dataset: For each name, label it as being "male" or "female"
Decide on the "features": In other words, think about what makes the text of a name male or female. It could be something as simple as, "the kinds of letters at the end of the name", e.g. "ah" for "Sarah"
"Vectorize" the features: Naive bayesian, and pretty much every other mathematical algorithm, deals with numbers. Whatever we decide are the "features", we need to express them in a way that can be put into a mathematical formula.
Create a training and test set: from the labeled data, split it so that some names are used to train the classifier. The test set is then used to see if the classifier can guess the labels correctly. Obviously, you don't want names to show up in both the training and test set.

NLTK in action: Gender identification

This example is basically taken verbatim from the NLTK book, Chapter 6:
Get a dataset

This is part of the NLTK python library:
import nltk
from nltk.corpus import names
print names.raw()
Abagael
Abagail
Abbe
Abbey
Abbi
Abbie
Abby
Abigael
Abigail
Abigale
Abra
...
Zeus
Zippy
Zollie
Zolly
Zorro

Label a dataset

The NLTK library does this as a convenience for us: the names are separated into male.txt and female.txt:
m_names = names.words('male.txt')
f_names = names.words('female.txt')

print m_names[1000:1005]
# [u'Giovanni', u'Giraldo', u'Giraud', u'Giuseppe', u'Glen']

print f_names[1000:1005]
# [u'Claresta', u'Clareta', u'Claretta', u'Clarette', u'Clarey']
Set a variable labeled_names to contain both male and female names. To keep them distinct, though, we label them individually (as male or female) before throwing them together into labeled_names
labeled_names = ([(n, 'male') for n in m_names] + 
  [(n, 'female') for n in f_names])

# and to make sure we are sampling across all the names
# we shuffle them so they aren't alphabetical

random.shuffle(labeled_names)
Come up with a "feature"

This is the more "creative" part: looking just at the text of a name.
def dumb_features(word) :
  return { 'first': word[0] }

dumb_features('daniel')
# {'first': 'd'}