sente/Data Analysis With Python.ipynb

## README.md

      
    Raw
  

              README.md
            
          
    View on nbviewer.ipython.org:

Word Frequency.ipynb
Data Analysis With Python.ipynb
Multiple Regression with Python.ipynb


## Data Analysis With Python.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              Data Analysis With Python.ipynb
            
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## Multiple Regression with Python.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              Multiple Regression with Python.ipynb
            
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## Word Frequency.ipynb
{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Word Frequency\n",
      "\n",
      "This exercise in counting the occurrences of words in a file serves to illustrate some important iteration and data structure concepts in Python.\n",
      "Here we compare the number of times words occur in Shakespeare's Hamlet (http://www.gutenberg.org/cache/epub/1524/pg1524.txt) with the number of times words occur in Mark Twain's Adventures of Huckleberry Finn (http://www.gutenberg.org/cache/epub/76/pg76.txt) to see how the English language has changed over time."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Download Content\n",
      "\n",
      "We use the Request library to download html pages."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import requests\n",
      "hamlet_url = 'http://www.gutenberg.org/cache/epub/1524/pg1524.txt' \n",
      "hamlet_page = requests.get(hamlet_url)\n",
      "hamlet_text = hamlet_page.text"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Explore/Verify Content"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "len(hamlet_text)\n",
      "hamlet_text[:120]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 6,
       "text": [
        "u\"\\ufeffProject Gutenberg Etext of Hamlet by Shakespeare\\r\\nPG has multiple editions of William Shakespeare's Complete Works\\r\\n\\r\\n\\r\""
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Content needs to be split, it all in one big string."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "hamlet_lines = hamlet_text.splitlines() \n",
      "hamlet_lines[:5]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 7,
       "text": [
        "[u'\\ufeffProject Gutenberg Etext of Hamlet by Shakespeare',\n",
        " u\"PG has multiple editions of William Shakespeare's Complete Works\",\n",
        " u'',\n",
        " u'',\n",
        " u'Copyright laws are changing all over the world, be sure to check']"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Lets remove the headers and footer content, which are irrelevant to our word count"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "hamlet_start = 290\n",
      "hamlet_lines[hamlet_start]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 14,
       "text": [
        "u'HAMLET, PRINCE OF DENMARK'"
       ]
      }
     ],
     "prompt_number": 14
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "\"\"\"figure out the end\ufffc\"\"\"\n",
      "hamlet_lines[-10:]\n",
      "hamlet_end = len(hamlet_lines) - 7 \n",
      "hamlet_lines[hamlet_end:]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 20,
       "text": [
        "[u'',\n",
        " u'',\n",
        " u'',\n",
        " u'',\n",
        " u'',\n",
        " u'The End of Project Gutenberg Etext of Hamlet by Shakespeare',\n",
        " u\"PG has multiple editions of William Shakespeare's Complete Works\"]"
       ]
      }
     ],
     "prompt_number": 20
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let us now loop through and extract words."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#use strings split function\n",
      "hamlet_lines[hamlet_start].split()\n",
      "\n",
      "#how do we track the number of occurance of each word, we use a dictionary\n",
      "counts = {}\n",
      "for word in hamlet_lines[hamlet_start].split():\n",
      "    counts[word] += 1\n",
      "    \n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "ename": "KeyError",
       "evalue": "u'HAMLET,'",
       "output_type": "pyerr",
       "traceback": [
        "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mKeyError\u001b[0m                                  Traceback (most recent call last)",
        "\u001b[0;32m<ipython-input-21-84964c5b301c>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mcounts\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      6\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mword\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mhamlet_lines\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mhamlet_start\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m     \u001b[0mcounts\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mword\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m+=\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      8\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0;31m#take 2\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
        "\u001b[0;31mKeyError\u001b[0m: u'HAMLET,'"
       ]
      }
     ],
     "prompt_number": 21
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#take 2\n",
      "counts = {}\n",
      "for line in hamlet_lines[hamlet_start:hamlet_end]:\n",
      "    for word in line.split(): \n",
      "        if word not in counts:\n",
      "            counts[word] = 0 \n",
      "        counts[word] += 1"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 46
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Analyze"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#What's the most frequent word?\n",
      "max(counts.values())\n",
      "\n",
      "#Ok, we know how often the most frequent word occurred, but what was it?\n",
      "reverse = [(count, word) for word, count in counts.items()] \n",
      "max(reverse)\n",
      "\n",
      "#Duh. What about the top 20 most common words?\n",
      "sorted(reverse)[:20]\n",
      "\n",
      "#No, those are single occurrances.\n",
      "sorted(reverse, reverse=True)[:20]\n",
      "\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 47,
       "text": [
        "[(989, u'the'),\n",
        " (696, u'and'),\n",
        " (625, u'of'),\n",
        " (604, u'to'),\n",
        " (510, u'I'),\n",
        " (448, u'a'),\n",
        " (444, u'my'),\n",
        " (384, u'in'),\n",
        " (363, u'you'),\n",
        " (358, u'Ham.'),\n",
        " (296, u'is'),\n",
        " (278, u'his'),\n",
        " (269, u'it'),\n",
        " (255, u'not'),\n",
        " (247, u'And'),\n",
        " (225, u'that'),\n",
        " (224, u'your'),\n",
        " (222, u'with'),\n",
        " (203, u'this'),\n",
        " (186, u'be')]"
       ]
      }
     ],
     "prompt_number": 47
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Excercise\n",
      "1. Do the same analysis for Huckleberry Finn and compare http://www.gutenberg.org/cache/epub/76/pg76.txt\n",
      "2. Replace the dictionary data structure with the Counter data structure \n",
      "   http://docs.python.org/2/library/collections.html#collections.Counter\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}
	{
	"metadata": {
	"name": ""
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Word Frequency\n",
	"\n",
	"This exercise in counting the occurrences of words in a file serves to illustrate some important iteration and data structure concepts in Python.\n",
	"Here we compare the number of times words occur in Shakespeare's Hamlet (http://www.gutenberg.org/cache/epub/1524/pg1524.txt) with the number of times words occur in Mark Twain's Adventures of Huckleberry Finn (http://www.gutenberg.org/cache/epub/76/pg76.txt) to see how the English language has changed over time."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"###Download Content\n",
	"\n",
	"We use the Request library to download html pages."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"import requests\n",
	"hamlet_url = 'http://www.gutenberg.org/cache/epub/1524/pg1524.txt' \n",
	"hamlet_page = requests.get(hamlet_url)\n",
	"hamlet_text = hamlet_page.text"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 1
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"###Explore/Verify Content"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"len(hamlet_text)\n",
	"hamlet_text[:120]"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 6,
	"text": [
	"u\"\\ufeffProject Gutenberg Etext of Hamlet by Shakespeare\\r\\nPG has multiple editions of William Shakespeare's Complete Works\\r\\n\\r\\n\\r\""
	]
	}
	],
	"prompt_number": 6
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Content needs to be split, it all in one big string."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"hamlet_lines = hamlet_text.splitlines() \n",
	"hamlet_lines[:5]"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 7,
	"text": [
	"[u'\\ufeffProject Gutenberg Etext of Hamlet by Shakespeare',\n",
	" u\"PG has multiple editions of William Shakespeare's Complete Works\",\n",
	" u'',\n",
	" u'',\n",
	" u'Copyright laws are changing all over the world, be sure to check']"
	]
	}
	],
	"prompt_number": 7
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Lets remove the headers and footer content, which are irrelevant to our word count"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"hamlet_start = 290\n",
	"hamlet_lines[hamlet_start]"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 14,
	"text": [
	"u'HAMLET, PRINCE OF DENMARK'"
	]
	}
	],
	"prompt_number": 14
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"\"\"\"figure out the end\ufffc\"\"\"\n",
	"hamlet_lines[-10:]\n",
	"hamlet_end = len(hamlet_lines) - 7 \n",
	"hamlet_lines[hamlet_end:]"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 20,
	"text": [
	"[u'',\n",
	" u'',\n",
	" u'',\n",
	" u'',\n",
	" u'',\n",
	" u'The End of Project Gutenberg Etext of Hamlet by Shakespeare',\n",
	" u\"PG has multiple editions of William Shakespeare's Complete Works\"]"
	]
	}
	],
	"prompt_number": 20
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Let us now loop through and extract words."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"#use strings split function\n",
	"hamlet_lines[hamlet_start].split()\n",
	"\n",
	"#how do we track the number of occurance of each word, we use a dictionary\n",
	"counts = {}\n",
	"for word in hamlet_lines[hamlet_start].split():\n",
	" counts[word] += 1\n",
	" \n"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"ename": "KeyError",
	"evalue": "u'HAMLET,'",
	"output_type": "pyerr",
	"traceback": [
	"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
	"\u001b[0;32m<ipython-input-21-84964c5b301c>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0mcounts\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mword\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mhamlet_lines\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mhamlet_start\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m \u001b[0mcounts\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mword\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m+=\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 8\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0;31m#take 2\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
	"\u001b[0;31mKeyError\u001b[0m: u'HAMLET,'"
	]
	}
	],
	"prompt_number": 21
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"#take 2\n",
	"counts = {}\n",
	"for line in hamlet_lines[hamlet_start:hamlet_end]:\n",
	" for word in line.split(): \n",
	" if word not in counts:\n",
	" counts[word] = 0 \n",
	" counts[word] += 1"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 46
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"###Analyze"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"#What's the most frequent word?\n",
	"max(counts.values())\n",
	"\n",
	"#Ok, we know how often the most frequent word occurred, but what was it?\n",
	"reverse = [(count, word) for word, count in counts.items()] \n",
	"max(reverse)\n",
	"\n",
	"#Duh. What about the top 20 most common words?\n",
	"sorted(reverse)[:20]\n",
	"\n",
	"#No, those are single occurrances.\n",
	"sorted(reverse, reverse=True)[:20]\n",
	"\n"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 47,
	"text": [
	"[(989, u'the'),\n",
	" (696, u'and'),\n",
	" (625, u'of'),\n",
	" (604, u'to'),\n",
	" (510, u'I'),\n",
	" (448, u'a'),\n",
	" (444, u'my'),\n",
	" (384, u'in'),\n",
	" (363, u'you'),\n",
	" (358, u'Ham.'),\n",
	" (296, u'is'),\n",
	" (278, u'his'),\n",
	" (269, u'it'),\n",
	" (255, u'not'),\n",
	" (247, u'And'),\n",
	" (225, u'that'),\n",
	" (224, u'your'),\n",
	" (222, u'with'),\n",
	" (203, u'this'),\n",
	" (186, u'be')]"
	]
	}
	],
	"prompt_number": 47
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"###Excercise\n",
	"1. Do the same analysis for Huckleberry Finn and compare http://www.gutenberg.org/cache/epub/76/pg76.txt\n",
	"2. Replace the dictionary data structure with the Counter data structure \n",
	" http://docs.python.org/2/library/collections.html#collections.Counter\n"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [],
	"language": "python",
	"metadata": {},
	"outputs": []
	}
	],
	"metadata": {}
	}
	]
	}