Created
October 20, 2014 19:14
-
-
Save jasonost/7836323b0a149c7cd486 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## Let's compare stemmers!\nFirst load in the libraries and set up the stemmers." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "import nltk\nfrom nltk.corpus import stopwords\nenglish_stopwords = stopwords.words('english')\n\nfrom nltk.corpus import gutenberg\nimport re\npstemmer = nltk.PorterStemmer()\nlstemmer = nltk.LancasterStemmer()\nwnlemmatizer = nltk.WordNetLemmatizer()\n", | |
"prompt_number": 1, | |
"outputs": [], | |
"language": "python", | |
"trusted": false, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "We'll also need a bit of tokenization and normalization here. Add some code here to remove punctuation and stopwords." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "def tokenize_text(text): \n pattern = r'''(?x)\n ([A-Z]\\.)+\n |\\w+([-']\\w+)*\n |\\$?\\d+(\\.\\d+)?%?\n |\\.\\.\\.\n |[.,?;]+\n '''\n tokens = nltk.regexp_tokenize(text,pattern)\n # add code to remove punctuation and stopwords\n return tokens", | |
"prompt_number": 2, | |
"outputs": [], | |
"language": "python", | |
"trusted": false, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Let's tokenize good ol' Walt; make sure it worked." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "tokens = tokenize_text(gutenberg.raw('whitman-leaves.txt'))\nprint len(tokens)\ntokens[0:10]", | |
"prompt_number": 3, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "143541\n" | |
}, | |
{ | |
"output_type": "pyout", | |
"prompt_number": 3, | |
"metadata": {}, | |
"text": "['Leaves', 'of', 'Grass', 'by', 'Walt', 'Whitman', '1855', 'Come', ',', 'said']" | |
} | |
], | |
"language": "python", | |
"trusted": false, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Here is one example of running the stemmer; you can fill in the other two below." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "import re\nfrom string import punctuation\npstemmed_tokens = [pstemmer.stem(word.lower()) for word in tokens if word.lower()\n not in nltk.corpus.stopwords.words('english') and word.lower() not in punctuation]\n", | |
"prompt_number": 21, | |
"outputs": [], | |
"language": "python", | |
"trusted": false, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "pstemmed_tokens[:10]", | |
"prompt_number": 22, | |
"outputs": [ | |
{ | |
"output_type": "pyout", | |
"prompt_number": 22, | |
"metadata": {}, | |
"text": "['leav',\n 'grass',\n 'walt',\n 'whitman',\n '1855',\n 'come',\n 'said',\n 'soul',\n 'vers',\n 'bodi']" | |
} | |
], | |
"language": "python", | |
"trusted": false, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "lstemmer_tokens = [lstemmer.stem(word.lower()) for word in tokens if word.lower()\n not in set(nltk.corpus.stopwords.words('english')) and word.lower() not in punctuation]", | |
"prompt_number": 25, | |
"outputs": [], | |
"language": "python", | |
"trusted": false, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "lstemmer_tokens[:10]", | |
"prompt_number": 26, | |
"outputs": [ | |
{ | |
"output_type": "pyout", | |
"prompt_number": 26, | |
"metadata": {}, | |
"text": "['leav',\n 'grass',\n 'walt',\n 'whitm',\n '1855',\n 'com',\n 'said',\n 'soul',\n 'vers',\n 'body']" | |
} | |
], | |
"language": "python", | |
"trusted": false, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "wnlemmatizer_tokens = [wnlemmatizer.lemmatize(word.lower()) for word in tokens if word.lower()\n not in set(nltk.corpus.stopwords.words('english')) and word.lower() not in punctuation]", | |
"prompt_number": 27, | |
"outputs": [], | |
"language": "python", | |
"trusted": false, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "wnlemmatizer_tokens[:10]", | |
"prompt_number": 28, | |
"outputs": [ | |
{ | |
"output_type": "pyout", | |
"prompt_number": 28, | |
"metadata": {}, | |
"text": "['leaf',\n 'grass',\n 'walt',\n 'whitman',\n '1855',\n 'come',\n 'said',\n 'soul',\n 'verse',\n 'body']" | |
} | |
], | |
"language": "python", | |
"trusted": false, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Now we want to make 3 FreqDists to count up how common each word is now that we've done the stemming." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "p_freq = nltk.FreqDist(pstemmed_tokens)", | |
"prompt_number": 29, | |
"outputs": [], | |
"language": "python", | |
"trusted": false, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "l_freq = nltk.FreqDist(lstemmer_tokens)", | |
"prompt_number": 30, | |
"outputs": [], | |
"language": "python", | |
"trusted": false, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "w_freq = nltk.FreqDist(wnlemmatizer_tokens)", | |
"prompt_number": 31, | |
"outputs": [], | |
"language": "python", | |
"trusted": false, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Now we want to compare the three outputs. What is a good function that quickly lets you do side-by-side comparisons of vertical lists?" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "for i in range(100):\n print '%-20s %-20s %-20s' %(p_freq.keys()[i], l_freq.keys()[i], w_freq.keys()[i]) \n \n #could also use zip so that you get tuples of 3", | |
"prompt_number": 34, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "o o o \nsee see see \none lov one \nlove on old \nold ear life \nshall old love " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nyet the shall \nsoul com yet \nthee shal soul \ncome yet thee \nlife soul day " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nday long earth \nlong lif long \nearth day thy \nthi man come \nnight ev night " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nland thy thou \nthou night land \nman pass man \nknow sing know \nhand land time " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\ntime thou hand \nsong know song \nface gre death \ndeath hand face \nsea tim sea " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nmen song men \neveri fac every \nsing wom woman \ngreat men great \nciti sea city " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nupon every upon \nword dea word \ngive giv ever \nhear city world \never upon body \nworld joy go " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nbodi word year \nlike hear many \npass us give \nyear world hear \ngo body like " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nmani lik good \nlook year never \ngood many joy \njoy look thing \nnever nat eye " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nstate good voice \nthing stat ship \neye nev last \nlast rest child \nleav thing young \nship real well " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nvoic ey sun \nthought last war \nrest leav thought \nstand ship rest \nyoung mak air " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nwell voic look \nsun stand u \nwar young sing \nbeauti thought state \nmake light stand " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nair war nothing \nlive wel think \nthink sun dead \nus beauty far \nnoth go mother " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\ndead liv star \nfar sil water \nsound think new \nwork air little \ntake tak make \nmother sail shore " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nstar new back \nwait noth others \nwater chant would \nwomen dead house \nlight far head " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nnew friend work \npart part forth \nwalk ris light \nback sound poem \nchant work alone " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nhous moth past \nlittl past let \nrise star pas \nshore wait saw \nother wat take " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nwould walk may \nclose clos part \ntoward fal strong \nhead back around \nhold sleep much \nsilent hous place " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nforth littl sound \npoem shor toward \nreturn would away \nalon let age \narm oth side " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"language": "python", | |
"trusted": false, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Now it's time to analyze your data. How do the outputs differ? How are they similar?" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "* Nouns that aren't also verbs seem to do better than other kinds of nouns (night, !\n* Porter Stemmer converts to base form ending in -i for words that have English orthography baseforms ending in -y (with mixed results: works for vary, city, but not every, thy)\n* Lancaster Stemmer yields worst quality. Truncating 'th' in death??? When does that make sense???\n* Because the Lancaster truncates too aggressively, it yields pretty differently frequency (usual, use, us conflated? Nat came from national, maybe)\n* In the WordNet Lemmatizer, 'pas' shows up. We're wondering if 'pass' got stemmed because there was a 'pas' in the dictionary. That seems inaccurate. \n* The frequencies get differenter and differenter as you get lower in the list.\n* Mother stemmed to moth??\n* Has anyone thought of stemming words only if the resulting base form has more than 6 characters, or at least two syllables? To avoid things like building -> build, silent -> sil. \n\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "", | |
"outputs": [], | |
"language": "python", | |
"trusted": false, | |
"collapsed": false | |
} | |
], | |
"metadata": {} | |
} | |
], | |
"metadata": { | |
"name": "", | |
"signature": "sha256:44d3ebc72cd17a20a5c793640e033b420d29c00bd0f19908484f3eda291ef08e" | |
}, | |
"nbformat": 3 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment