jasonost/Compare Stemmers.ipynb

## Compare Stemmers.ipynb
{
 "worksheets": [
  {
   "cells": [
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "## Let's compare stemmers!\nFirst load in the libraries and set up the stemmers."
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "import nltk\nfrom nltk.corpus import stopwords\nenglish_stopwords = stopwords.words('english')\n\nfrom nltk.corpus import gutenberg\nimport re\npstemmer = nltk.PorterStemmer()\nlstemmer = nltk.LancasterStemmer()\nwnlemmatizer = nltk.WordNetLemmatizer()\n",
     "prompt_number": 1,
     "outputs": [],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "We'll also need a bit of tokenization and normalization here.  Add some code here to remove punctuation and stopwords."
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "def tokenize_text(text):    \n    pattern = r'''(?x)\n    ([A-Z]\\.)+\n    |\\w+([-']\\w+)*\n    |\\$?\\d+(\\.\\d+)?%?\n    |\\.\\.\\.\n    |[.,?;]+\n    '''\n    tokens = nltk.regexp_tokenize(text,pattern)\n    # add code to remove punctuation and stopwords\n    return tokens",
     "prompt_number": 2,
     "outputs": [],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Let's tokenize good ol' Walt; make sure it worked."
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "tokens = tokenize_text(gutenberg.raw('whitman-leaves.txt'))\nprint len(tokens)\ntokens[0:10]",
     "prompt_number": 3,
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "143541\n"
      },
      {
       "output_type": "pyout",
       "prompt_number": 3,
       "metadata": {},
       "text": "['Leaves', 'of', 'Grass', 'by', 'Walt', 'Whitman', '1855', 'Come', ',', 'said']"
      }
     ],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Here is one example of running the stemmer; you can fill in the other two below."
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "import re\nfrom string import punctuation\npstemmed_tokens = [pstemmer.stem(word.lower()) for word in tokens if word.lower()\n                   not in nltk.corpus.stopwords.words('english') and word.lower() not in punctuation]\n",
     "prompt_number": 21,
     "outputs": [],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "pstemmed_tokens[:10]",
     "prompt_number": 22,
     "outputs": [
      {
       "output_type": "pyout",
       "prompt_number": 22,
       "metadata": {},
       "text": "['leav',\n 'grass',\n 'walt',\n 'whitman',\n '1855',\n 'come',\n 'said',\n 'soul',\n 'vers',\n 'bodi']"
      }
     ],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "lstemmer_tokens = [lstemmer.stem(word.lower()) for word in tokens if word.lower()\n                   not in set(nltk.corpus.stopwords.words('english')) and word.lower() not in punctuation]",
     "prompt_number": 25,
     "outputs": [],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "lstemmer_tokens[:10]",
     "prompt_number": 26,
     "outputs": [
      {
       "output_type": "pyout",
       "prompt_number": 26,
       "metadata": {},
       "text": "['leav',\n 'grass',\n 'walt',\n 'whitm',\n '1855',\n 'com',\n 'said',\n 'soul',\n 'vers',\n 'body']"
      }
     ],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "wnlemmatizer_tokens = [wnlemmatizer.lemmatize(word.lower()) for word in tokens if word.lower()\n                   not in set(nltk.corpus.stopwords.words('english')) and word.lower() not in punctuation]",
     "prompt_number": 27,
     "outputs": [],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "wnlemmatizer_tokens[:10]",
     "prompt_number": 28,
     "outputs": [
      {
       "output_type": "pyout",
       "prompt_number": 28,
       "metadata": {},
       "text": "['leaf',\n 'grass',\n 'walt',\n 'whitman',\n '1855',\n 'come',\n 'said',\n 'soul',\n 'verse',\n 'body']"
      }
     ],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Now we want to make 3 FreqDists to count up how common each word is now that we've done the stemming."
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "p_freq = nltk.FreqDist(pstemmed_tokens)",
     "prompt_number": 29,
     "outputs": [],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "l_freq = nltk.FreqDist(lstemmer_tokens)",
     "prompt_number": 30,
     "outputs": [],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "w_freq = nltk.FreqDist(wnlemmatizer_tokens)",
     "prompt_number": 31,
     "outputs": [],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Now we want to compare the three outputs.  What is a good function that quickly lets you do side-by-side comparisons of vertical lists?"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "for i in range(100):\n    print '%-20s %-20s %-20s' %(p_freq.keys()[i], l_freq.keys()[i], w_freq.keys()[i]) \n    \n    #could also use zip so that you get tuples of 3",
     "prompt_number": 34,
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "o                    o                    o                   \nsee                  see                  see                 \none                  lov                  one                 \nlove                 on                   old                 \nold                  ear                  life                \nshall                old                  love                "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\nyet                  the                  shall               \nsoul                 com                  yet                 \nthee                 shal                 soul                \ncome                 yet                  thee                \nlife                 soul                 day                 "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\nday                  long                 earth               \nlong                 lif                  long                \nearth                day                  thy                 \nthi                  man                  come                \nnight                ev                   night               "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\nland                 thy                  thou                \nthou                 night                land                \nman                  pass                 man                 \nknow                 sing                 know                \nhand                 land                 time                "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\ntime                 thou                 hand                \nsong                 know                 song                \nface                 gre                  death               \ndeath                hand                 face                \nsea                  tim                  sea                 "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\nmen                  song                 men                 \neveri                fac                  every               \nsing                 wom                  woman               \ngreat                men                  great               \nciti                 sea                  city                "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\nupon                 every                upon                \nword                 dea                  word                \ngive                 giv                  ever                \nhear                 city                 world               \never                 upon                 body                \nworld                joy                  go                  "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\nbodi                 word                 year                \nlike                 hear                 many                \npass                 us                   give                \nyear                 world                hear                \ngo                   body                 like                "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\nmani                 lik                  good                \nlook                 year                 never               \ngood                 many                 joy                 \njoy                  look                 thing               \nnever                nat                  eye                 "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\nstate                good                 voice               \nthing                stat                 ship                \neye                  nev                  last                \nlast                 rest                 child               \nleav                 thing                young               \nship                 real                 well                "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\nvoic                 ey                   sun                 \nthought              last                 war                 \nrest                 leav                 thought             \nstand                ship                 rest                \nyoung                mak                  air                 "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\nwell                 voic                 look                \nsun                  stand                u                   \nwar                  young                sing                \nbeauti               thought              state               \nmake                 light                stand               "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\nair                  war                  nothing             \nlive                 wel                  think               \nthink                sun                  dead                \nus                   beauty               far                 \nnoth                 go                   mother              "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\ndead                 liv                  star                \nfar                  sil                  water               \nsound                think                new                 \nwork                 air                  little              \ntake                 tak                  make                \nmother               sail                 shore               "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\nstar                 new                  back                \nwait                 noth                 others              \nwater                chant                would               \nwomen                dead                 house               \nlight                far                  head                "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\nnew                  friend               work                \npart                 part                 forth               \nwalk                 ris                  light               \nback                 sound                poem                \nchant                work                 alone               "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\nhous                 moth                 past                \nlittl                past                 let                 \nrise                 star                 pas                 \nshore                wait                 saw                 \nother                wat                  take                "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\nwould                walk                 may                 \nclose                clos                 part                \ntoward               fal                  strong              \nhead                 back                 around              \nhold                 sleep                much                \nsilent               hous                 place               "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\nforth                littl                sound               \npoem                 shor                 toward              \nreturn               would                away                \nalon                 let                  age                 \narm                  oth                  side                "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\n"
      }
     ],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Now it's time to analyze your data.  How do the outputs differ?  How are they similar?"
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "* Nouns that aren't also verbs seem to do better than other kinds of nouns (night, !\n* Porter Stemmer converts to base form ending in -i for words that have English orthography baseforms ending in -y (with mixed results: works for vary, city, but not every, thy)\n* Lancaster Stemmer yields worst quality. Truncating 'th' in death??? When does that make sense???\n* Because the Lancaster truncates too aggressively, it yields pretty differently frequency (usual, use, us conflated? Nat came from national, maybe)\n* In the WordNet Lemmatizer, 'pas' shows up. We're wondering if 'pass' got stemmed because there was a 'pas' in the dictionary. That seems inaccurate. \n* The frequencies get differenter and differenter as you get lower in the list.\n* Mother stemmed to moth??\n* Has anyone thought of stemming words only if the resulting base form has more than 6 characters, or at least two syllables? To avoid things like building -> build, silent -> sil. \n\n"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "",
     "outputs": [],
     "language": "python",
     "trusted": false,
     "collapsed": false
    }
   ],
   "metadata": {}
  }
 ],
 "metadata": {
  "name": "",
  "signature": "sha256:44d3ebc72cd17a20a5c793640e033b420d29c00bd0f19908484f3eda291ef08e"
 },
 "nbformat": 3
}
	{
	"worksheets": [
	{
	"cells": [
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "## Let's compare stemmers!\nFirst load in the libraries and set up the stemmers."
	},
	{
	"metadata": {},
	"cell_type": "code",
	"input": "import nltk\nfrom nltk.corpus import stopwords\nenglish_stopwords = stopwords.words('english')\n\nfrom nltk.corpus import gutenberg\nimport re\npstemmer = nltk.PorterStemmer()\nlstemmer = nltk.LancasterStemmer()\nwnlemmatizer = nltk.WordNetLemmatizer()\n",
	"prompt_number": 1,
	"outputs": [],
	"language": "python",
	"trusted": false,
	"collapsed": false
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "We'll also need a bit of tokenization and normalization here. Add some code here to remove punctuation and stopwords."
	},
	{
	"metadata": {},
	"cell_type": "code",
	"input": "def tokenize_text(text): \n pattern = r'''(?x)\n ([A-Z]\\.)+\n \|\\w+([-']\\w+)*\n \|\\$?\\d+(\\.\\d+)?%?\n \|\\.\\.\\.\n \|[.,?;]+\n '''\n tokens = nltk.regexp_tokenize(text,pattern)\n # add code to remove punctuation and stopwords\n return tokens",
	"prompt_number": 2,
	"outputs": [],
	"language": "python",
	"trusted": false,
	"collapsed": false
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "Let's tokenize good ol' Walt; make sure it worked."
	},
	{
	"metadata": {},
	"cell_type": "code",
	"input": "tokens = tokenize_text(gutenberg.raw('whitman-leaves.txt'))\nprint len(tokens)\ntokens[0:10]",
	"prompt_number": 3,
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "143541\n"
	},
	{
	"output_type": "pyout",
	"prompt_number": 3,
	"metadata": {},
	"text": "['Leaves', 'of', 'Grass', 'by', 'Walt', 'Whitman', '1855', 'Come', ',', 'said']"
	}
	],
	"language": "python",
	"trusted": false,
	"collapsed": false
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "Here is one example of running the stemmer; you can fill in the other two below."
	},
	{
	"metadata": {},
	"cell_type": "code",
	"input": "import re\nfrom string import punctuation\npstemmed_tokens = [pstemmer.stem(word.lower()) for word in tokens if word.lower()\n not in nltk.corpus.stopwords.words('english') and word.lower() not in punctuation]\n",
	"prompt_number": 21,
	"outputs": [],
	"language": "python",
	"trusted": false,
	"collapsed": false
	},
	{
	"metadata": {},
	"cell_type": "code",
	"input": "pstemmed_tokens[:10]",
	"prompt_number": 22,
	"outputs": [
	{
	"output_type": "pyout",
	"prompt_number": 22,
	"metadata": {},
	"text": "['leav',\n 'grass',\n 'walt',\n 'whitman',\n '1855',\n 'come',\n 'said',\n 'soul',\n 'vers',\n 'bodi']"
	}
	],
	"language": "python",
	"trusted": false,
	"collapsed": false
	},
	{
	"metadata": {},
	"cell_type": "code",
	"input": "lstemmer_tokens = [lstemmer.stem(word.lower()) for word in tokens if word.lower()\n not in set(nltk.corpus.stopwords.words('english')) and word.lower() not in punctuation]",
	"prompt_number": 25,
	"outputs": [],
	"language": "python",
	"trusted": false,
	"collapsed": false
	},
	{
	"metadata": {},
	"cell_type": "code",
	"input": "lstemmer_tokens[:10]",
	"prompt_number": 26,
	"outputs": [
	{
	"output_type": "pyout",
	"prompt_number": 26,
	"metadata": {},
	"text": "['leav',\n 'grass',\n 'walt',\n 'whitm',\n '1855',\n 'com',\n 'said',\n 'soul',\n 'vers',\n 'body']"
	}
	],
	"language": "python",
	"trusted": false,
	"collapsed": false
	},
	{
	"metadata": {},
	"cell_type": "code",
	"input": "wnlemmatizer_tokens = [wnlemmatizer.lemmatize(word.lower()) for word in tokens if word.lower()\n not in set(nltk.corpus.stopwords.words('english')) and word.lower() not in punctuation]",
	"prompt_number": 27,
	"outputs": [],
	"language": "python",
	"trusted": false,
	"collapsed": false
	},
	{
	"metadata": {},
	"cell_type": "code",
	"input": "wnlemmatizer_tokens[:10]",
	"prompt_number": 28,
	"outputs": [
	{
	"output_type": "pyout",
	"prompt_number": 28,
	"metadata": {},
	"text": "['leaf',\n 'grass',\n 'walt',\n 'whitman',\n '1855',\n 'come',\n 'said',\n 'soul',\n 'verse',\n 'body']"
	}
	],
	"language": "python",
	"trusted": false,
	"collapsed": false
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "Now we want to make 3 FreqDists to count up how common each word is now that we've done the stemming."
	},
	{
	"metadata": {},
	"cell_type": "code",
	"input": "p_freq = nltk.FreqDist(pstemmed_tokens)",
	"prompt_number": 29,
	"outputs": [],
	"language": "python",
	"trusted": false,
	"collapsed": false
	},
	{
	"metadata": {},
	"cell_type": "code",
	"input": "l_freq = nltk.FreqDist(lstemmer_tokens)",
	"prompt_number": 30,
	"outputs": [],
	"language": "python",
	"trusted": false,
	"collapsed": false
	},
	{
	"metadata": {},
	"cell_type": "code",
	"input": "w_freq = nltk.FreqDist(wnlemmatizer_tokens)",
	"prompt_number": 31,
	"outputs": [],
	"language": "python",
	"trusted": false,
	"collapsed": false
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "Now we want to compare the three outputs. What is a good function that quickly lets you do side-by-side comparisons of vertical lists?"
	},
	{
	"metadata": {},
	"cell_type": "code",
	"input": "for i in range(100):\n print '%-20s %-20s %-20s' %(p_freq.keys()[i], l_freq.keys()[i], w_freq.keys()[i]) \n \n #could also use zip so that you get tuples of 3",
	"prompt_number": 34,
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "o o o \nsee see see \none lov one \nlove on old \nold ear life \nshall old love "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "\nyet the shall \nsoul com yet \nthee shal soul \ncome yet thee \nlife soul day "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "\nday long earth \nlong lif long \nearth day thy \nthi man come \nnight ev night "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "\nland thy thou \nthou night land \nman pass man \nknow sing know \nhand land time "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "\ntime thou hand \nsong know song \nface gre death \ndeath hand face \nsea tim sea "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "\nmen song men \neveri fac every \nsing wom woman \ngreat men great \nciti sea city "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "\nupon every upon \nword dea word \ngive giv ever \nhear city world \never upon body \nworld joy go "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "\nbodi word year \nlike hear many \npass us give \nyear world hear \ngo body like "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "\nmani lik good \nlook year never \ngood many joy \njoy look thing \nnever nat eye "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "\nstate good voice \nthing stat ship \neye nev last \nlast rest child \nleav thing young \nship real well "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "\nvoic ey sun \nthought last war \nrest leav thought \nstand ship rest \nyoung mak air "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "\nwell voic look \nsun stand u \nwar young sing \nbeauti thought state \nmake light stand "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "\nair war nothing \nlive wel think \nthink sun dead \nus beauty far \nnoth go mother "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "\ndead liv star \nfar sil water \nsound think new \nwork air little \ntake tak make \nmother sail shore "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "\nstar new back \nwait noth others \nwater chant would \nwomen dead house \nlight far head "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "\nnew friend work \npart part forth \nwalk ris light \nback sound poem \nchant work alone "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "\nhous moth past \nlittl past let \nrise star pas \nshore wait saw \nother wat take "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "\nwould walk may \nclose clos part \ntoward fal strong \nhead back around \nhold sleep much \nsilent hous place "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "\nforth littl sound \npoem shor toward \nreturn would away \nalon let age \narm oth side "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "\n"
	}
	],
	"language": "python",
	"trusted": false,
	"collapsed": false
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "Now it's time to analyze your data. How do the outputs differ? How are they similar?"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "* Nouns that aren't also verbs seem to do better than other kinds of nouns (night, !\n* Porter Stemmer converts to base form ending in -i for words that have English orthography baseforms ending in -y (with mixed results: works for vary, city, but not every, thy)\n* Lancaster Stemmer yields worst quality. Truncating 'th' in death??? When does that make sense???\n* Because the Lancaster truncates too aggressively, it yields pretty differently frequency (usual, use, us conflated? Nat came from national, maybe)\n* In the WordNet Lemmatizer, 'pas' shows up. We're wondering if 'pass' got stemmed because there was a 'pas' in the dictionary. That seems inaccurate. \n* The frequencies get differenter and differenter as you get lower in the list.\n* Mother stemmed to moth??\n* Has anyone thought of stemming words only if the resulting base form has more than 6 characters, or at least two syllables? To avoid things like building -> build, silent -> sil. \n\n"
	},
	{
	"metadata": {},
	"cell_type": "code",
	"input": "",
	"outputs": [],
	"language": "python",
	"trusted": false,
	"collapsed": false
	}
	],
	"metadata": {}
	}
	],
	"metadata": {
	"name": "",
	"signature": "sha256:44d3ebc72cd17a20a5c793640e033b420d29c00bd0f19908484f3eda291ef08e"
	},
	"nbformat": 3
	}