Skip to content

Instantly share code, notes, and snippets.

@jasonost
Created October 20, 2014 19:14
Show Gist options
  • Save jasonost/7836323b0a149c7cd486 to your computer and use it in GitHub Desktop.
Save jasonost/7836323b0a149c7cd486 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"worksheets": [
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "## Let's compare stemmers!\nFirst load in the libraries and set up the stemmers."
},
{
"metadata": {},
"cell_type": "code",
"input": "import nltk\nfrom nltk.corpus import stopwords\nenglish_stopwords = stopwords.words('english')\n\nfrom nltk.corpus import gutenberg\nimport re\npstemmer = nltk.PorterStemmer()\nlstemmer = nltk.LancasterStemmer()\nwnlemmatizer = nltk.WordNetLemmatizer()\n",
"prompt_number": 1,
"outputs": [],
"language": "python",
"trusted": false,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "We'll also need a bit of tokenization and normalization here. Add some code here to remove punctuation and stopwords."
},
{
"metadata": {},
"cell_type": "code",
"input": "def tokenize_text(text): \n pattern = r'''(?x)\n ([A-Z]\\.)+\n |\\w+([-']\\w+)*\n |\\$?\\d+(\\.\\d+)?%?\n |\\.\\.\\.\n |[.,?;]+\n '''\n tokens = nltk.regexp_tokenize(text,pattern)\n # add code to remove punctuation and stopwords\n return tokens",
"prompt_number": 2,
"outputs": [],
"language": "python",
"trusted": false,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Let's tokenize good ol' Walt; make sure it worked."
},
{
"metadata": {},
"cell_type": "code",
"input": "tokens = tokenize_text(gutenberg.raw('whitman-leaves.txt'))\nprint len(tokens)\ntokens[0:10]",
"prompt_number": 3,
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "143541\n"
},
{
"output_type": "pyout",
"prompt_number": 3,
"metadata": {},
"text": "['Leaves', 'of', 'Grass', 'by', 'Walt', 'Whitman', '1855', 'Come', ',', 'said']"
}
],
"language": "python",
"trusted": false,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Here is one example of running the stemmer; you can fill in the other two below."
},
{
"metadata": {},
"cell_type": "code",
"input": "import re\nfrom string import punctuation\npstemmed_tokens = [pstemmer.stem(word.lower()) for word in tokens if word.lower()\n not in nltk.corpus.stopwords.words('english') and word.lower() not in punctuation]\n",
"prompt_number": 21,
"outputs": [],
"language": "python",
"trusted": false,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "pstemmed_tokens[:10]",
"prompt_number": 22,
"outputs": [
{
"output_type": "pyout",
"prompt_number": 22,
"metadata": {},
"text": "['leav',\n 'grass',\n 'walt',\n 'whitman',\n '1855',\n 'come',\n 'said',\n 'soul',\n 'vers',\n 'bodi']"
}
],
"language": "python",
"trusted": false,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "lstemmer_tokens = [lstemmer.stem(word.lower()) for word in tokens if word.lower()\n not in set(nltk.corpus.stopwords.words('english')) and word.lower() not in punctuation]",
"prompt_number": 25,
"outputs": [],
"language": "python",
"trusted": false,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "lstemmer_tokens[:10]",
"prompt_number": 26,
"outputs": [
{
"output_type": "pyout",
"prompt_number": 26,
"metadata": {},
"text": "['leav',\n 'grass',\n 'walt',\n 'whitm',\n '1855',\n 'com',\n 'said',\n 'soul',\n 'vers',\n 'body']"
}
],
"language": "python",
"trusted": false,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "wnlemmatizer_tokens = [wnlemmatizer.lemmatize(word.lower()) for word in tokens if word.lower()\n not in set(nltk.corpus.stopwords.words('english')) and word.lower() not in punctuation]",
"prompt_number": 27,
"outputs": [],
"language": "python",
"trusted": false,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "wnlemmatizer_tokens[:10]",
"prompt_number": 28,
"outputs": [
{
"output_type": "pyout",
"prompt_number": 28,
"metadata": {},
"text": "['leaf',\n 'grass',\n 'walt',\n 'whitman',\n '1855',\n 'come',\n 'said',\n 'soul',\n 'verse',\n 'body']"
}
],
"language": "python",
"trusted": false,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Now we want to make 3 FreqDists to count up how common each word is now that we've done the stemming."
},
{
"metadata": {},
"cell_type": "code",
"input": "p_freq = nltk.FreqDist(pstemmed_tokens)",
"prompt_number": 29,
"outputs": [],
"language": "python",
"trusted": false,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "l_freq = nltk.FreqDist(lstemmer_tokens)",
"prompt_number": 30,
"outputs": [],
"language": "python",
"trusted": false,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "w_freq = nltk.FreqDist(wnlemmatizer_tokens)",
"prompt_number": 31,
"outputs": [],
"language": "python",
"trusted": false,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Now we want to compare the three outputs. What is a good function that quickly lets you do side-by-side comparisons of vertical lists?"
},
{
"metadata": {},
"cell_type": "code",
"input": "for i in range(100):\n print '%-20s %-20s %-20s' %(p_freq.keys()[i], l_freq.keys()[i], w_freq.keys()[i]) \n \n #could also use zip so that you get tuples of 3",
"prompt_number": 34,
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "o o o \nsee see see \none lov one \nlove on old \nold ear life \nshall old love "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\nyet the shall \nsoul com yet \nthee shal soul \ncome yet thee \nlife soul day "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\nday long earth \nlong lif long \nearth day thy \nthi man come \nnight ev night "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\nland thy thou \nthou night land \nman pass man \nknow sing know \nhand land time "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\ntime thou hand \nsong know song \nface gre death \ndeath hand face \nsea tim sea "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\nmen song men \neveri fac every \nsing wom woman \ngreat men great \nciti sea city "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\nupon every upon \nword dea word \ngive giv ever \nhear city world \never upon body \nworld joy go "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\nbodi word year \nlike hear many \npass us give \nyear world hear \ngo body like "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\nmani lik good \nlook year never \ngood many joy \njoy look thing \nnever nat eye "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\nstate good voice \nthing stat ship \neye nev last \nlast rest child \nleav thing young \nship real well "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\nvoic ey sun \nthought last war \nrest leav thought \nstand ship rest \nyoung mak air "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\nwell voic look \nsun stand u \nwar young sing \nbeauti thought state \nmake light stand "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\nair war nothing \nlive wel think \nthink sun dead \nus beauty far \nnoth go mother "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\ndead liv star \nfar sil water \nsound think new \nwork air little \ntake tak make \nmother sail shore "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\nstar new back \nwait noth others \nwater chant would \nwomen dead house \nlight far head "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\nnew friend work \npart part forth \nwalk ris light \nback sound poem \nchant work alone "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\nhous moth past \nlittl past let \nrise star pas \nshore wait saw \nother wat take "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\nwould walk may \nclose clos part \ntoward fal strong \nhead back around \nhold sleep much \nsilent hous place "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\nforth littl sound \npoem shor toward \nreturn would away \nalon let age \narm oth side "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n"
}
],
"language": "python",
"trusted": false,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Now it's time to analyze your data. How do the outputs differ? How are they similar?"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "* Nouns that aren't also verbs seem to do better than other kinds of nouns (night, !\n* Porter Stemmer converts to base form ending in -i for words that have English orthography baseforms ending in -y (with mixed results: works for vary, city, but not every, thy)\n* Lancaster Stemmer yields worst quality. Truncating 'th' in death??? When does that make sense???\n* Because the Lancaster truncates too aggressively, it yields pretty differently frequency (usual, use, us conflated? Nat came from national, maybe)\n* In the WordNet Lemmatizer, 'pas' shows up. We're wondering if 'pass' got stemmed because there was a 'pas' in the dictionary. That seems inaccurate. \n* The frequencies get differenter and differenter as you get lower in the list.\n* Mother stemmed to moth??\n* Has anyone thought of stemming words only if the resulting base form has more than 6 characters, or at least two syllables? To avoid things like building -> build, silent -> sil. \n\n"
},
{
"metadata": {},
"cell_type": "code",
"input": "",
"outputs": [],
"language": "python",
"trusted": false,
"collapsed": false
}
],
"metadata": {}
}
],
"metadata": {
"name": "",
"signature": "sha256:44d3ebc72cd17a20a5c793640e033b420d29c00bd0f19908484f3eda291ef08e"
},
"nbformat": 3
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment