Skip to content

Instantly share code, notes, and snippets.

@cmgerber
Created October 22, 2014 05:30
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cmgerber/f0f6e90722f18c859621 to your computer and use it in GitHub Desktop.
Save cmgerber/f0f6e90722f18c859621 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"worksheets": [
{
"cells": [
{
"metadata": {},
"cell_type": "code",
"input": "import nltk\nfrom nltk.corpus import names\nimport random",
"prompt_number": 2,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "** A feature recognition function **"
},
{
"metadata": {},
"cell_type": "code",
"input": "def gender_features(word):\n return {'last_letter': word[-1]}\ngender_features('Samantha')",
"prompt_number": 3,
"outputs": [
{
"text": "{'last_letter': 'a'}",
"output_type": "pyout",
"metadata": {},
"prompt_number": 3
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "** Create name datasets ** "
},
{
"metadata": {},
"cell_type": "code",
"input": "def create_name_data():\n male_names = [(name, 'male') for name in names.words('male.txt')]\n female_names = [(name, 'female') for name in names.words('female.txt')]\n allnames = male_names + female_names\n \n # Randomize the order of male and female names, and de-alphabatize\n random.shuffle(allnames)\n return allnames\n\nnames_data = create_name_data()",
"prompt_number": 4,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "** First Pass at Training and Testing Data **"
},
{
"metadata": {},
"cell_type": "code",
"input": "\n# This function allows experimentation with different feature definitions\n# items is a list of (key, value) pairs from which features are extracted and training sets are made\ndef create_training_sets (feature_function, items):\n # Create the features sets. Call the function that was passed in.\n # For names, key is the name, and value is the gender\n featuresets = [(feature_function(key), value) for (key, value) in items]\n \n # Divided training and testing in half. Could divide in other proportions instead.\n halfsize = int(float(len(featuresets)) / 10.0)\n train_set, test_set = featuresets[halfsize:], featuresets[:halfsize]\n return train_set, test_set",
"prompt_number": 5,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "** Train the classifier on the training data, with the first definition of features **"
},
{
"metadata": {},
"cell_type": "code",
"input": "# pass in a function name\ntrain_set, test_set = create_training_sets(gender_features, names_data)\ncl = nltk.NaiveBayesClassifier.train(train_set)",
"prompt_number": 6,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "** Test the classifier on some examples **"
},
{
"metadata": {},
"cell_type": "code",
"input": "print cl.classify(gender_features('Carl'))\nprint cl.classify(gender_features('Carla'))\nprint cl.classify(gender_features('Carly'))\nprint cl.classify(gender_features('Carlo'))\nprint cl.classify(gender_features('Carlos'))\n",
"prompt_number": 7,
"outputs": [
{
"output_type": "stream",
"text": "male\nfemale\nfemale\nmale\nmale\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "print cl.classify(gender_features('Carli'))\nprint cl.classify(gender_features('Carle'))\nprint cl.classify(gender_features('Charles'))\nprint cl.classify(gender_features('Carlie'))\nprint cl.classify(gender_features('Charlie'))",
"prompt_number": 8,
"outputs": [
{
"output_type": "stream",
"text": "female\nfemale\nmale\nfemale\nfemale\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "** Run the NLTK evaluation function on the test set **"
},
{
"metadata": {},
"cell_type": "code",
"input": "print \"%.3f\" % nltk.classify.accuracy(cl, test_set)",
"prompt_number": 9,
"outputs": [
{
"output_type": "stream",
"text": "0.781\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "** Run the NLTK feature inspection function on the classifier **"
},
{
"metadata": {},
"cell_type": "code",
"input": "cl.show_most_informative_features(15)",
"prompt_number": 10,
"outputs": [
{
"output_type": "stream",
"text": "Most Informative Features\n last_letter = u'a' female : male = 34.0 : 1.0\n last_letter = u'k' male : female = 29.3 : 1.0\n last_letter = u'f' male : female = 15.9 : 1.0\n last_letter = u'v' male : female = 11.2 : 1.0\n last_letter = u'p' male : female = 9.8 : 1.0\n last_letter = u'd' male : female = 9.5 : 1.0\n last_letter = u'o' male : female = 8.4 : 1.0\n last_letter = u'm' male : female = 7.5 : 1.0\n last_letter = u'r' male : female = 6.5 : 1.0\n last_letter = u'g' male : female = 5.3 : 1.0\n last_letter = u'w' male : female = 4.8 : 1.0\n last_letter = u'z' male : female = 4.3 : 1.0\n last_letter = u's' male : female = 4.1 : 1.0\n last_letter = u't' male : female = 4.0 : 1.0\n last_letter = u'i' female : male = 3.5 : 1.0\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "** Let's add some more features to improve results **"
},
{
"metadata": {},
"cell_type": "code",
"input": "def gender_features2(word):\n features = {}\n word = word.lower()\n features['last'] = word[-1]\n features['first'] = word[:1]\n features['second'] = word[1:2] # get the 'h' in Charlie?\n return features\ngender_features2('Samantha') ",
"prompt_number": 11,
"outputs": [
{
"text": "{'first': 's', 'last': 'a', 'second': 'a'}",
"output_type": "pyout",
"metadata": {},
"prompt_number": 11
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "** We wrote the code so that we can easily pass in the new feature function. **"
},
{
"metadata": {},
"cell_type": "code",
"input": "train_set2, test_set2 = create_training_sets(gender_features2, names_data)\ncl2 = nltk.NaiveBayesClassifier.train(train_set2)\nprint \"%.3f\" % nltk.classify.accuracy(cl2, test_set2)",
"prompt_number": 12,
"outputs": [
{
"output_type": "stream",
"text": "0.801\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "** Let's hand check some of the harder cases ... oops some are right but some are now wrong. **"
},
{
"metadata": {},
"cell_type": "code",
"input": "print cl2.classify(gender_features2('Carli'))\nprint cl2.classify(gender_features2('Carle'))\nprint cl2.classify(gender_features2('Charles')) #oops ... gets this wrong now!\nprint cl2.classify(gender_features2('Carlie'))\nprint cl2.classify(gender_features2('Charlie'))",
"prompt_number": 13,
"outputs": [
{
"output_type": "stream",
"text": "female\nfemale\nmale\nfemale\nfemale\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "** We can see the influence of some of the new features **"
},
{
"metadata": {},
"cell_type": "code",
"input": "cl2.show_most_informative_features(15)",
"prompt_number": 14,
"outputs": [
{
"output_type": "stream",
"text": "Most Informative Features\n last = u'a' female : male = 34.0 : 1.0\n last = u'k' male : female = 29.3 : 1.0\n last = u'f' male : female = 15.9 : 1.0\n last = u'v' male : female = 11.2 : 1.0\n last = u'p' male : female = 9.8 : 1.0\n last = u'd' male : female = 9.5 : 1.0\n last = u'o' male : female = 8.4 : 1.0\n last = u'm' male : female = 7.5 : 1.0\n second = u'k' male : female = 6.5 : 1.0\n last = u'r' male : female = 6.5 : 1.0\n second = u'z' male : female = 5.8 : 1.0\n last = u'g' male : female = 5.3 : 1.0\n last = u'w' male : female = 4.8 : 1.0\n first = u'w' male : female = 4.6 : 1.0\n last = u'z' male : female = 4.3 : 1.0\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "** We really need a development set to test our features on before testing on the real test set. So let's redo our division of the data. In this case we do the dividing up before applying the feature selection so we can keep track of the names **"
},
{
"metadata": {},
"cell_type": "code",
"input": "def create_training_sets3 (feature_function, items):\n # Create the features sets. Call the function that was passed in.\n # For names, key is the name, and value is the gender\n featuresets = [(feature_function(key), value) for (key, value) in items]\n \n # Divide data into thirds\n third = int(float(len(featuresets)) / 3.0)\n return items[0:third], items[third:third*2], items[third*2:], featuresets[0:third], featuresets[third:third*2], featuresets[third*2:]\n \ntrain_items, dev_items, test_items, train_features, dev_features, test_features = create_training_sets3(gender_features2, names_data)\n",
"prompt_number": 15,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "cl3 = nltk.NaiveBayesClassifier.train(train_features)\n# This is code from the NLTK chapter\nerrors = []\nfor (name, tag) in dev_items:\n guess = cl3.classify(gender_features2(name))\n if guess != tag:\n errors.append( (tag, guess, name) )",
"prompt_number": 16,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "** Print out the correct vs. the guessed answer for the errors, in order to inspect those that were wrong. **"
},
{
"metadata": {},
"cell_type": "code",
"input": "for (tag, guess, name) in sorted(errors[:10]): \n print 'correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name)",
"prompt_number": 20,
"outputs": [
{
"output_type": "stream",
"text": "correct=female guess=male name=Hester \ncorrect=female guess=male name=Jesselyn \ncorrect=female guess=male name=Lark \ncorrect=female guess=male name=Linet \ncorrect=female guess=male name=Sybyl \ncorrect=male guess=female name=Alfonse \ncorrect=male guess=female name=Anthony \ncorrect=male guess=female name=Chelton \ncorrect=male guess=female name=Darrel \ncorrect=male guess=female name=Iggie \n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "** Exercise** Rewrite the functions above to add some additional features, and then rerun the classifier to evaluate if they improve or degrade results.\n\nIdeas for features:\n* name length\n* other position information\n* pairs of letters\n* your idea goes here"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Below is my features.\n\nThe features that are commented out made the results worse"
},
{
"metadata": {},
"cell_type": "code",
"input": "w = Counter('hello')\nw.",
"prompt_number": 62,
"outputs": [
{
"text": "Counter({'l': 2, 'h': 1, 'e': 1, 'o': 1})",
"output_type": "pyout",
"metadata": {},
"prompt_number": 62
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "from collections import Counter\ndef multiLetter(word):\n '''Definition to count the number of multiple letters in a word.\n It sums the counts of all letter that occur more than once'''\n multi_count = 0\n count = Counter(word)\n for key, value in count.items():\n if value > 1:\n multi_count += value\n return multi_count",
"prompt_number": 65,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "def gender_features3(word):\n features = {}\n word = word.lower()\n features['last'] = word[-1]\n features['second'] = word[1:2] # get the 'h' in Charlie?\n features['length'] = len(word)\n features['first_two'] = word[:2]\n features['last_two'] = word[-2:]\n features['pal'] = word==word[::-1]\n# features['multi_count'] = multiLetter(word)\n# features['vowel_count'] = len([letter for letter in word if\n# letter in ['a', 'e', 'i', 'o', 'u']])\n return features\ngender_features3('Hannah')",
"prompt_number": 74,
"outputs": [
{
"text": "{'first_two': 'ha',\n 'last': 'h',\n 'last_two': 'ah',\n 'length': 6,\n 'pal': True,\n 'second': 'a'}",
"output_type": "pyout",
"metadata": {},
"prompt_number": 74
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "train_items, dev_items, test_items, train_features, dev_features, test_features = create_training_sets3(gender_features3, names_data)\ncl3 = nltk.NaiveBayesClassifier.train(train_features)\nprint \"%.3f\" % nltk.classify.accuracy(cl3, test_features)",
"prompt_number": 77,
"outputs": [
{
"output_type": "stream",
"text": "0.799\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "cl3.show_most_informative_features(15)",
"prompt_number": 76,
"outputs": [
{
"output_type": "stream",
"text": "Most Informative Features\n last_two = u'ia' female : male = 34.4 : 1.0\n last_two = u'rd' male : female = 32.2 : 1.0\n last = u'k' male : female = 27.7 : 1.0\n last_two = u'la' female : male = 26.7 : 1.0\n last = u'a' female : male = 25.4 : 1.0\n last_two = u'ta' female : male = 23.0 : 1.0\n first_two = u'ka' female : male = 22.6 : 1.0\n last_two = u'ra' female : male = 18.9 : 1.0\n last_two = u'er' male : female = 18.0 : 1.0\n last = u'd' male : female = 10.9 : 1.0\n last = u'r' male : female = 10.7 : 1.0\n last_two = u'on' male : female = 10.7 : 1.0\n first_two = u'we' male : female = 10.4 : 1.0\n last_two = u'tt' male : female = 10.4 : 1.0\n last_two = u'os' male : female = 10.4 : 1.0\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "incorrect = []\nfor (name, tag) in dev_items:\n guess = cl3.classify(gender_features3(name))\n if guess != tag:\n incorrect.append( (tag, guess, name) )",
"prompt_number": 69,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "print 'tag, guess, name'\nincorrect[:15]",
"prompt_number": 70,
"outputs": [
{
"output_type": "stream",
"text": "tag, guess, name\n",
"stream": "stdout"
},
{
"text": "[('female', 'male', u'Lark'),\n ('female', 'male', u'Linet'),\n ('female', 'male', u'Meghan'),\n ('female', 'male', u'Sybyl'),\n ('male', 'female', u'Glen'),\n ('male', 'female', u'Anthony'),\n ('female', 'male', u'Hester'),\n ('male', 'female', u'Alfonse'),\n ('female', 'male', u'Charmion'),\n ('male', 'female', u'Terrel'),\n ('female', 'male', u'Easter'),\n ('female', 'male', u'Tory'),\n ('female', 'male', u'Chrysler'),\n ('female', 'male', u'Barry'),\n ('female', 'male', u'Diamond')]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 70
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Result on test_set: 0.799**"
},
{
"metadata": {},
"cell_type": "code",
"input": "",
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
}
],
"metadata": {}
}
],
"metadata": {
"name": "",
"signature": "sha256:8bd223ef85e410f067def3cd5069b4451d206c90afc4124576a6d9278d594bc7"
},
"nbformat": 3
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment