View on nbviewer.ipython.org:
Last active
November 23, 2015 05:38
-
-
Save sente/9340872 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"name": "" | |
}, | |
"nbformat": 3, | |
"nbformat_minor": 0, | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Word Frequency\n", | |
"\n", | |
"This exercise in counting the occurrences of words in a file serves to illustrate some important iteration and data structure concepts in Python.\n", | |
"Here we compare the number of times words occur in Shakespeare's Hamlet (http://www.gutenberg.org/cache/epub/1524/pg1524.txt) with the number of times words occur in Mark Twain's Adventures of Huckleberry Finn (http://www.gutenberg.org/cache/epub/76/pg76.txt) to see how the English language has changed over time." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"###Download Content\n", | |
"\n", | |
"We use the Request library to download html pages." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"import requests\n", | |
"hamlet_url = 'http://www.gutenberg.org/cache/epub/1524/pg1524.txt' \n", | |
"hamlet_page = requests.get(hamlet_url)\n", | |
"hamlet_text = hamlet_page.text" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 1 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"###Explore/Verify Content" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"len(hamlet_text)\n", | |
"hamlet_text[:120]" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 6, | |
"text": [ | |
"u\"\\ufeffProject Gutenberg Etext of Hamlet by Shakespeare\\r\\nPG has multiple editions of William Shakespeare's Complete Works\\r\\n\\r\\n\\r\"" | |
] | |
} | |
], | |
"prompt_number": 6 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Content needs to be split, it all in one big string." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"hamlet_lines = hamlet_text.splitlines() \n", | |
"hamlet_lines[:5]" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 7, | |
"text": [ | |
"[u'\\ufeffProject Gutenberg Etext of Hamlet by Shakespeare',\n", | |
" u\"PG has multiple editions of William Shakespeare's Complete Works\",\n", | |
" u'',\n", | |
" u'',\n", | |
" u'Copyright laws are changing all over the world, be sure to check']" | |
] | |
} | |
], | |
"prompt_number": 7 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Lets remove the headers and footer content, which are irrelevant to our word count" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"hamlet_start = 290\n", | |
"hamlet_lines[hamlet_start]" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 14, | |
"text": [ | |
"u'HAMLET, PRINCE OF DENMARK'" | |
] | |
} | |
], | |
"prompt_number": 14 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"\"\"\"figure out the end\ufffc\"\"\"\n", | |
"hamlet_lines[-10:]\n", | |
"hamlet_end = len(hamlet_lines) - 7 \n", | |
"hamlet_lines[hamlet_end:]" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 20, | |
"text": [ | |
"[u'',\n", | |
" u'',\n", | |
" u'',\n", | |
" u'',\n", | |
" u'',\n", | |
" u'The End of Project Gutenberg Etext of Hamlet by Shakespeare',\n", | |
" u\"PG has multiple editions of William Shakespeare's Complete Works\"]" | |
] | |
} | |
], | |
"prompt_number": 20 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Let us now loop through and extract words." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"#use strings split function\n", | |
"hamlet_lines[hamlet_start].split()\n", | |
"\n", | |
"#how do we track the number of occurance of each word, we use a dictionary\n", | |
"counts = {}\n", | |
"for word in hamlet_lines[hamlet_start].split():\n", | |
" counts[word] += 1\n", | |
" \n" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"ename": "KeyError", | |
"evalue": "u'HAMLET,'", | |
"output_type": "pyerr", | |
"traceback": [ | |
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", | |
"\u001b[0;32m<ipython-input-21-84964c5b301c>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0mcounts\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mword\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mhamlet_lines\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mhamlet_start\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m \u001b[0mcounts\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mword\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m+=\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 8\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0;31m#take 2\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", | |
"\u001b[0;31mKeyError\u001b[0m: u'HAMLET,'" | |
] | |
} | |
], | |
"prompt_number": 21 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"#take 2\n", | |
"counts = {}\n", | |
"for line in hamlet_lines[hamlet_start:hamlet_end]:\n", | |
" for word in line.split(): \n", | |
" if word not in counts:\n", | |
" counts[word] = 0 \n", | |
" counts[word] += 1" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 46 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"###Analyze" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"#What's the most frequent word?\n", | |
"max(counts.values())\n", | |
"\n", | |
"#Ok, we know how often the most frequent word occurred, but what was it?\n", | |
"reverse = [(count, word) for word, count in counts.items()] \n", | |
"max(reverse)\n", | |
"\n", | |
"#Duh. What about the top 20 most common words?\n", | |
"sorted(reverse)[:20]\n", | |
"\n", | |
"#No, those are single occurrances.\n", | |
"sorted(reverse, reverse=True)[:20]\n", | |
"\n" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 47, | |
"text": [ | |
"[(989, u'the'),\n", | |
" (696, u'and'),\n", | |
" (625, u'of'),\n", | |
" (604, u'to'),\n", | |
" (510, u'I'),\n", | |
" (448, u'a'),\n", | |
" (444, u'my'),\n", | |
" (384, u'in'),\n", | |
" (363, u'you'),\n", | |
" (358, u'Ham.'),\n", | |
" (296, u'is'),\n", | |
" (278, u'his'),\n", | |
" (269, u'it'),\n", | |
" (255, u'not'),\n", | |
" (247, u'And'),\n", | |
" (225, u'that'),\n", | |
" (224, u'your'),\n", | |
" (222, u'with'),\n", | |
" (203, u'this'),\n", | |
" (186, u'be')]" | |
] | |
} | |
], | |
"prompt_number": 47 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"###Excercise\n", | |
"1. Do the same analysis for Huckleberry Finn and compare http://www.gutenberg.org/cache/epub/76/pg76.txt\n", | |
"2. Replace the dictionary data structure with the Counter data structure \n", | |
" http://docs.python.org/2/library/collections.html#collections.Counter\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
} | |
], | |
"metadata": {} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment