Classification is the task of choosing the correct class label for a given input. In basic classification tasks, each input is considered in isolation from all other inputs, and the set of labels is defined in advance. Some examples of classification tasks are:
- Deciding whether an email is spam or not.
- Deciding what the topic of a news article is, from a fixed list of topic areas such as "sports," "technology," and "politics."
- Deciding whether a given occurrence of the word bank is used to refer to a river bank, a financial institution, the act of tilting to the side, or the act of depositing something in a financial institution.
This is an example of how to use naive Bayesian concepts to train a classifier to guess the gender of a name without the use of Social Security Administration statistics. While the statistical method is likely to be more accurate, this kind of classifier could work for specialized use cases, such as making up a totally new name and estimating whether it seems more "male" or "female".
The phases of machine learning for this scenario:
- Get a dataset: a list of human names
- Label the dataset: For each name, label it as being
"male"
or"female"
- Decide on the "features": In other words, think about what makes the text of a name male or female. It could be something as simple as, "the kinds of letters at the end of the name", e.g. "ah" for "Sarah"
- "Vectorize" the features: Naive bayesian, and pretty much every other mathematical algorithm, deals with numbers. Whatever we decide are the "features", we need to express them in a way that can be put into a mathematical formula.
- Create a training and test set: from the labeled data, split it so that some names are used to train the classifier. The test set is then used to see if the classifier can guess the labels correctly. Obviously, you don't want names to show up in both the training and test set.
This example is basically taken verbatim from the NLTK book, Chapter 6:
This is part of the NLTK python library:
import nltk
from nltk.corpus import names
print names.raw()
Abagael
Abagail
Abbe
Abbey
Abbi
Abbie
Abby
Abigael
Abigail
Abigale
Abra
...
Zeus
Zippy
Zollie
Zolly
Zorro
The NLTK library does this as a convenience for us: the names are separated into male.txt
and female.txt
:
m_names = names.words('male.txt')
f_names = names.words('female.txt')
print m_names[1000:1005]
# [u'Giovanni', u'Giraldo', u'Giraud', u'Giuseppe', u'Glen']
print f_names[1000:1005]
# [u'Claresta', u'Clareta', u'Claretta', u'Clarette', u'Clarey']
Set a variable labeled_names
to contain both male and female names. To keep them distinct, though, we label them individually (as male
or female
) before throwing them together into labeled_names
labeled_names = ([(n, 'male') for n in m_names] +
[(n, 'female') for n in f_names])
# and to make sure we are sampling across all the names
# we shuffle them so they aren't alphabetical
random.shuffle(labeled_names)
This is the more "creative" part: looking just at the text of a name.
def dumb_features(word) :
return { 'first': word[0] }
dumb_features('daniel')
# {'first': 'd'}