Skip to content

Instantly share code, notes, and snippets.

@sente
Last active November 23, 2015 05:38
Show Gist options
  • Save sente/9340872 to your computer and use it in GitHub Desktop.
Save sente/9340872 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": ""
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Word Frequency\n",
"\n",
"This exercise in counting the occurrences of words in a file serves to illustrate some important iteration and data structure concepts in Python.\n",
"Here we compare the number of times words occur in Shakespeare's Hamlet (http://www.gutenberg.org/cache/epub/1524/pg1524.txt) with the number of times words occur in Mark Twain's Adventures of Huckleberry Finn (http://www.gutenberg.org/cache/epub/76/pg76.txt) to see how the English language has changed over time."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###Download Content\n",
"\n",
"We use the Request library to download html pages."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import requests\n",
"hamlet_url = 'http://www.gutenberg.org/cache/epub/1524/pg1524.txt' \n",
"hamlet_page = requests.get(hamlet_url)\n",
"hamlet_text = hamlet_page.text"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###Explore/Verify Content"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"len(hamlet_text)\n",
"hamlet_text[:120]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 6,
"text": [
"u\"\\ufeffProject Gutenberg Etext of Hamlet by Shakespeare\\r\\nPG has multiple editions of William Shakespeare's Complete Works\\r\\n\\r\\n\\r\""
]
}
],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Content needs to be split, it all in one big string."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hamlet_lines = hamlet_text.splitlines() \n",
"hamlet_lines[:5]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 7,
"text": [
"[u'\\ufeffProject Gutenberg Etext of Hamlet by Shakespeare',\n",
" u\"PG has multiple editions of William Shakespeare's Complete Works\",\n",
" u'',\n",
" u'',\n",
" u'Copyright laws are changing all over the world, be sure to check']"
]
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets remove the headers and footer content, which are irrelevant to our word count"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hamlet_start = 290\n",
"hamlet_lines[hamlet_start]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 14,
"text": [
"u'HAMLET, PRINCE OF DENMARK'"
]
}
],
"prompt_number": 14
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"\"\"\"figure out the end\ufffc\"\"\"\n",
"hamlet_lines[-10:]\n",
"hamlet_end = len(hamlet_lines) - 7 \n",
"hamlet_lines[hamlet_end:]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 20,
"text": [
"[u'',\n",
" u'',\n",
" u'',\n",
" u'',\n",
" u'',\n",
" u'The End of Project Gutenberg Etext of Hamlet by Shakespeare',\n",
" u\"PG has multiple editions of William Shakespeare's Complete Works\"]"
]
}
],
"prompt_number": 20
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us now loop through and extract words."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#use strings split function\n",
"hamlet_lines[hamlet_start].split()\n",
"\n",
"#how do we track the number of occurance of each word, we use a dictionary\n",
"counts = {}\n",
"for word in hamlet_lines[hamlet_start].split():\n",
" counts[word] += 1\n",
" \n"
],
"language": "python",
"metadata": {},
"outputs": [
{
"ename": "KeyError",
"evalue": "u'HAMLET,'",
"output_type": "pyerr",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-21-84964c5b301c>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0mcounts\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mword\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mhamlet_lines\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mhamlet_start\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m \u001b[0mcounts\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mword\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m+=\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 8\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0;31m#take 2\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mKeyError\u001b[0m: u'HAMLET,'"
]
}
],
"prompt_number": 21
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#take 2\n",
"counts = {}\n",
"for line in hamlet_lines[hamlet_start:hamlet_end]:\n",
" for word in line.split(): \n",
" if word not in counts:\n",
" counts[word] = 0 \n",
" counts[word] += 1"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 46
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###Analyze"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#What's the most frequent word?\n",
"max(counts.values())\n",
"\n",
"#Ok, we know how often the most frequent word occurred, but what was it?\n",
"reverse = [(count, word) for word, count in counts.items()] \n",
"max(reverse)\n",
"\n",
"#Duh. What about the top 20 most common words?\n",
"sorted(reverse)[:20]\n",
"\n",
"#No, those are single occurrances.\n",
"sorted(reverse, reverse=True)[:20]\n",
"\n"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 47,
"text": [
"[(989, u'the'),\n",
" (696, u'and'),\n",
" (625, u'of'),\n",
" (604, u'to'),\n",
" (510, u'I'),\n",
" (448, u'a'),\n",
" (444, u'my'),\n",
" (384, u'in'),\n",
" (363, u'you'),\n",
" (358, u'Ham.'),\n",
" (296, u'is'),\n",
" (278, u'his'),\n",
" (269, u'it'),\n",
" (255, u'not'),\n",
" (247, u'And'),\n",
" (225, u'that'),\n",
" (224, u'your'),\n",
" (222, u'with'),\n",
" (203, u'this'),\n",
" (186, u'be')]"
]
}
],
"prompt_number": 47
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###Excercise\n",
"1. Do the same analysis for Huckleberry Finn and compare http://www.gutenberg.org/cache/epub/76/pg76.txt\n",
"2. Replace the dictionary data structure with the Counter data structure \n",
" http://docs.python.org/2/library/collections.html#collections.Counter\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment