aparrish/data-as-documents.ipynb

## data-as-documents.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Documents as data\n",
    "\n",
    "In this tutorial, we're going to show you how some basic text analysis tasks work in Python. We don't have time to go over a ton of Python basics, so we're just going to point out how you can modify the code in small ways to make it do different things.\n",
    "\n",
    "This is a \"Jupyter Notebook,\" which consists of text and \"cells\" of code. After you've loaded the notebook, you can execute the code in a cell by highlighting it and hitting Ctrl+Enter. In general, you need to execute the cells from top to bottom, but you can usually run a cell more than once without messing anything up. Experiment!\n",
    "\n",
    "If things start acting strange, you can interrupt the Python process by selecting \"Kernel > Interrupt\"—this tells Python to stop doing whatever it was doing. Select \"Kernel > Restart\" to clear all of your variables and start from scratch.\n",
    "\n",
    "We'll start with a very simple task: getting all of the words from a text file.\n",
    "\n",
    "## Getting all of the words from a text file\n",
    "\n",
    "The first thing you'll want to do is get a [plain text](http://air.decontextualize.com/plain-text/) file! One place to look is [Project Gutenberg](http://www.gutenberg.org), which is a repository of books in English that are in the public domain.\n",
    "\n",
    "Once you've found a plain text file, save it to the same folder as the folder that contains this Jupyter Notebook file. Replace `pg84.txt` in the cell below with the filename of your plain text file and execute the cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "words = open(\"pg84.txt\").read().decode('utf8').split()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great! If you got an error, make sure that the file name is correct (keep the quotes!) and run the cell again. You've created a variable `words` that contains a list of all the words in your text file. The `len()` function tells you how many words are in the list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "77986"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The cell below uses Python's `random` module to print out 25 words at random. (You can change the number if you want more or less.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "me:\n",
      "that\n",
      "me,\n",
      "was\n",
      "shutters\n",
      "distributed\n",
      "perceived\n",
      "his\n",
      "a\n",
      "did\n",
      "here\n",
      "come\n",
      "learned\n",
      "will\n",
      "augmented\n",
      "was\n",
      "I\n",
      "multitude\n",
      "You\n",
      "to\n",
      "a\n",
      "into\n",
      "mind\n",
      "your\n",
      "and\n"
     ]
    }
   ],
   "source": [
    "import random\n",
    "for word in random.sample(words, 25):\n",
    "    print word"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are some weirdnesses here, especially the punctuation that you see at the end of some of the strings!\n",
    "\n",
    "But the real question is: what is a word? Consider, in English:\n",
    "\n",
    "* \"Basketball\" and is one word, but \"player piano\" is two (?). Why \"basketball net\" and not \"basketballnet\"?\n",
    "* \"Particleboard\" or \"particle board\"?\n",
    "* \"Mr. Smith\"\n",
    "* \"single-minded,\" \"rent-a-cop,\" \"abso-f###ing-lutely\"\n",
    "* \"power drill\" is two words in English, whereas the equivalent in German is one: \"Schnellschrauber\"\n",
    "* In Mowhawk: \"Sahonwanhotónkwahse\"; one word, roughly translated, \"she opened the door for him again.\"\n",
    "* Likewise, one word in Turkish: \"Muvaffakiyetsizleştiricileştiriveremeyebileceklerimizdenmişsinizcesineyken\" meaning \"As though you are from those whom we may not be able to easily make into a maker of unsuccessful ones\"\n",
    "\n",
    "So in order to turn a text into words, you need to know something about how that language works. (And you have to be willing to accept a little squishiness in how accurate the results are.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Counting words\n",
    "\n",
    "One of the most common tasks in text analysis is counting how many times every word in a text occurs. The easiest way to do this in Python is with the `Counter` object, contained in the `collections` module. Run the following cell to create a `Counter` object to count your words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "word_count = Counter(words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using syntax like what you see in the cell below, you can check to see how often particular words occur in the text:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "9"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_count[\"terrible\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "21"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_count[\"monster\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_count[\"breakfast\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One strange thing you'll notice is that upper-case and lower-case versions of the same word are counted separately:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_count[\"heaven\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_count[\"Heaven\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll figure out a way to mitigate this problem later on.\n",
    "\n",
    "The following cell prints out the twenty most common words in the text, along with the number of times they occur. (Again, you can change the number if you want more or less.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "the 4056\n",
      "and 2972\n",
      "of 2741\n",
      "I 2720\n",
      "to 2142\n",
      "my 1629\n",
      "a 1394\n",
      "in 1126\n",
      "was 993\n",
      "that 987\n",
      "with 696\n",
      "had 679\n",
      "which 547\n",
      "but 542\n",
      "me 529\n",
      "his 500\n",
      "not 498\n",
      "as 486\n",
      "by 464\n",
      "for 449\n"
     ]
    }
   ],
   "source": [
    "for word, number in word_count.most_common(20):\n",
    "    print word, number"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Stopwords\n",
    "\n",
    "Intuitively, it seems strange to count these words like \"the\" and \"and\" among the \"most common,\" because words like these are presumably common across *all* texts, not just this text in particular. To solve this problem, we can use \"stopwords\": a list of commonly-occurring English words that shouldn't be counted for the purpose of word frequency. No one exactly agrees on what this list should be, but here's one attempt (from [here](https://gist.github.com/sebleier/554280)). Make sure to execute this cell before you continue! You can add or remove items from the list if you want; just make sure to put quotes around the word you want to add, and add a comma at the end of the line (outside the quotes)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "stopwords = [\n",
    "    \"i\",\n",
    "    \"me\",\n",
    "    \"my\",\n",
    "    \"myself\",\n",
    "    \"we\",\n",
    "    \"our\",\n",
    "    \"ours\",\n",
    "    \"ourselves\",\n",
    "    \"you\",\n",
    "    \"your\",\n",
    "    \"yours\",\n",
    "    \"yourself\",\n",
    "    \"yourselves\",\n",
    "    \"he\",\n",
    "    \"him\",\n",
    "    \"his\",\n",
    "    \"himself\",\n",
    "    \"she\",\n",
    "    \"her\",\n",
    "    \"hers\",\n",
    "    \"herself\",\n",
    "    \"it\",\n",
    "    \"its\",\n",
    "    \"itself\",\n",
    "    \"they\",\n",
    "    \"them\",\n",
    "    \"their\",\n",
    "    \"theirs\",\n",
    "    \"themselves\",\n",
    "    \"what\",\n",
    "    \"which\",\n",
    "    \"who\",\n",
    "    \"whom\",\n",
    "    \"this\",\n",
    "    \"that\",\n",
    "    \"these\",\n",
    "    \"those\",\n",
    "    \"am\",\n",
    "    \"is\",\n",
    "    \"are\",\n",
    "    \"was\",\n",
    "    \"were\",\n",
    "    \"be\",\n",
    "    \"been\",\n",
    "    \"being\",\n",
    "    \"have\",\n",
    "    \"has\",\n",
    "    \"had\",\n",
    "    \"having\",\n",
    "    \"do\",\n",
    "    \"does\",\n",
    "    \"did\",\n",
    "    \"doing\",\n",
    "    \"a\",\n",
    "    \"an\",\n",
    "    \"the\",\n",
    "    \"and\",\n",
    "    \"but\",\n",
    "    \"if\",\n",
    "    \"or\",\n",
    "    \"because\",\n",
    "    \"as\",\n",
    "    \"until\",\n",
    "    \"while\",\n",
    "    \"of\",\n",
    "    \"at\",\n",
    "    \"by\",\n",
    "    \"for\",\n",
    "    \"with\",\n",
    "    \"about\",\n",
    "    \"against\",\n",
    "    \"between\",\n",
    "    \"into\",\n",
    "    \"through\",\n",
    "    \"during\",\n",
    "    \"before\",\n",
    "    \"after\",\n",
    "    \"above\",\n",
    "    \"below\",\n",
    "    \"to\",\n",
    "    \"from\",\n",
    "    \"up\",\n",
    "    \"down\",\n",
    "    \"in\",\n",
    "    \"out\",\n",
    "    \"on\",\n",
    "    \"off\",\n",
    "    \"over\",\n",
    "    \"under\",\n",
    "    \"again\",\n",
    "    \"further\",\n",
    "    \"then\",\n",
    "    \"once\",\n",
    "    \"here\",\n",
    "    \"there\",\n",
    "    \"when\",\n",
    "    \"where\",\n",
    "    \"why\",\n",
    "    \"how\",\n",
    "    \"all\",\n",
    "    \"any\",\n",
    "    \"both\",\n",
    "    \"each\",\n",
    "    \"few\",\n",
    "    \"more\",\n",
    "    \"most\",\n",
    "    \"other\",\n",
    "    \"some\",\n",
    "    \"such\",\n",
    "    \"no\",\n",
    "    \"nor\",\n",
    "    \"not\",\n",
    "    \"only\",\n",
    "    \"own\",\n",
    "    \"same\",\n",
    "    \"so\",\n",
    "    \"than\",\n",
    "    \"too\",\n",
    "    \"very\",\n",
    "    \"s\",\n",
    "    \"t\",\n",
    "    \"can\",\n",
    "    \"will\",\n",
    "    \"just\",\n",
    "    \"don\",\n",
    "    \"should\",\n",
    "    \"now\"\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To make use of this list, we'll create a new list that only includes those words that are *not* in the stopwords list. The Python code to do this is in the cell below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "clean_words = [w for w in words if w.lower() not in stopwords]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Checking the length of this list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "38137"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(clean_words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're left with far fewer words! But if we create a `Counter` object with this list of words, our list of the most common words is a bit more interesting:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "word_count = Counter(clean_words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "could 188\n",
      "would 177\n",
      "one 175\n",
      "me, 147\n",
      "upon 125\n",
      "yet 109\n",
      "me. 107\n",
      "may 107\n",
      "might 107\n",
      "every 103\n",
      "first 102\n",
      "shall 99\n",
      "towards 93\n",
      "saw 91\n",
      "even 82\n",
      "found 80\n",
      "Project 77\n",
      "man 76\n",
      "time 75\n",
      "father 73\n",
      "felt 72\n",
      "must 72\n",
      "\"I 71\n",
      "said 68\n",
      "many 66\n"
     ]
    }
   ],
   "source": [
    "for word, count in word_count.most_common(25):\n",
    "    print word, count"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Still not perfect, but it's a step forward."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Natural language processing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our word counts would be more interesting if we could reason better about the *language* in the text, not just the individual characters. For example, if we knew the parts of speech of individual words, we could exclude words that are determiners, conjunctions, etc. from the count. If we knew what kinds of things the words were referring to, we could count how many times particular characters or settings are referenced.\n",
    "\n",
    "To do this, we need to do a bit of Natural Language Processing. [More notes and opinions on this.](https://gist.github.com/aparrish/f21f6abbf2367e8eb23438558207e1c3)\n",
    "\n",
    "Most natural language processing is done with the aid of third-party libraries. We're going to use one called spaCy. To use spaCy, you first need to import it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from __future__ import unicode_literals\n",
    "import spacy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load in your text using the following line of code! (Remember to replace `pg84.txt` with the filename of your own text file.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# replace \"1400-0.txt\" with the name of your own text file, then run this cell with CTRL+Enter.\n",
    "text = open(\"pg84.txt\").read().decode('utf8')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, use spaCy to parse it. (This might take a while, depending on the size of your text.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "nlp = spacy.load('en')\n",
    "doc = nlp(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Right off the bat, the spaCy library gives us access to a number of interesting units of text:\n",
    "\n",
    "* All of the sentences (`doc.sents`)\n",
    "* All of the words (`doc`)\n",
    "* All of the \"named entitites,\" like names of places, people, #brands, etc. (`doc.ents`)\n",
    "* All of the \"noun chunks,\" i.e., nouns in the text plus surrounding matter like adjectives and articles\n",
    "\n",
    "The cell below, we extract these into variables so we can play around with them a little bit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sentences = list(doc.sents)\n",
    "words = [w for w in list(doc) if w.is_alpha]\n",
    "noun_chunks = list(doc.noun_chunks)\n",
    "entities = list(doc.ents)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With this information in hand, we can answer interesting questions like: how many sentences are in the text?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3474"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(sentences)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using `random.sample()`, we can get a small, randomly-selected sample from these lists. Here are five random sentences:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\"I remember, the first time that I did this, the young woman, when she opened the door in the morning, appeared greatly astonished on seeing a great pile of wood on the outside.\n",
      "---\n",
      "I cannot see him; for God's sake, do not let him enter!\"\n",
      "---\n",
      "Study had before secluded me from the intercourse of my fellow-creatures, and rendered me unsocial; but Clerval called forth the better feelings of my heart; he again taught me to love the aspect of nature, and the cheerful faces of children.\n",
      "---\n",
      "The trial began, and after the advocate against her had stated the charge, several witnesses were called.\n",
      "---\n",
      "\"How kind and generous you are!\n",
      "---\n"
     ]
    }
   ],
   "source": [
    "for item in random.sample(sentences, 5):\n",
    "    print item.text.strip().replace(\"\\r\\n\", \" \")\n",
    "    print \"---\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ten random words:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "in\n",
      "see\n",
      "enemy\n",
      "expire\n",
      "the\n",
      "he\n",
      "said\n",
      "should\n",
      "closer\n",
      "copy\n"
     ]
    }
   ],
   "source": [
    "for item in random.sample(words, 10):\n",
    "    print item.text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ten random noun chunks:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "the idea\n",
      "whom\n",
      "my height\n",
      "other format\n",
      "the\r\n",
      "opposite banks\n",
      "a guide\n",
      "my sufferings\n",
      "the mountain\n",
      "His feelings\n",
      "the latter\n"
     ]
    }
   ],
   "source": [
    "for item in random.sample(noun_chunks, 10):\n",
    "    print item.text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ten random entities:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Justine\n",
      "Paris\n",
      "Switzerland\n",
      "Felix\n",
      "Elizabeth\n",
      "Lucerne\n",
      "Elizabeth\n",
      "Lynn Hanninen\n",
      "Caroline Beaufort\n",
      "Safie\n"
     ]
    }
   ],
   "source": [
    "for item in random.sample(entities, 10):\n",
    "    print item.text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parts of speech\n",
    "\n",
    "The spaCy parser allows us to check what part of speech a word belongs to. In the cell below, we create three different lists—`nouns`, `verbs`, and `adjs`—that contain only words of the specified parts of speech. ([There's a full list of part of speech tags [here](https://spacy.io/docs/usage/pos-tagging#pos-tagging-english))."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "nouns = [w for w in words if w.pos_ == \"NOUN\"]\n",
    "verbs = [w for w in words if w.pos_ == \"VERB\"]\n",
    "adjs = [w for w in words if w.pos_ == \"ADJ\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now we can print out a random sample of any of these:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "son\n",
      "condescension\n",
      "hold\n",
      "glimmer\n",
      "law\n",
      "chemists\n",
      "apartment\n",
      "famine\n",
      "confessor\n",
      "heart\n"
     ]
    }
   ],
   "source": [
    "for item in random.sample(nouns, 10): # change \"nouns\" to \"verbs\" or \"adjs\" to sample from those lists!\n",
    "    print item.text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Entity types\n",
    "\n",
    "The parser in spaCy not only identifies \"entities\" but also assigns them to a particular type. [See a full list of entity types here.](https://spacy.io/docs/usage/entity-recognition#entity-types) Using this information, the following cell builds lists of the people, locations, and times mentioned in the text:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "people = [e for e in entities if e.label_ == \"PERSON\"]\n",
    "locations = [e for e in entities if e.label_ == \"LOC\"]\n",
    "times = [e for e in entities if e.label_ == \"TIME\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And then you can print out a random sample:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "night\n",
      "a few minutes\n",
      "nearly two hours\n",
      "several hours\n",
      "eight o'clock\n",
      "morning\n",
      "that night\n",
      "several hours\n",
      "a few minutes\n",
      "a few moments\n"
     ]
    }
   ],
   "source": [
    "for item in random.sample(times, 10):\n",
    "    print item.text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Finding the most common\n",
    "\n",
    "So let's repeat the task of finding the most common words, this time using the words parsed from the text using spaCy. The code looks mostly the same:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "word_count = Counter([w.text for w in words])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "18"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_count['heaven']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can even filter these with the stopwords list, as in the cell below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "filtered_words = [w.text for w in words if w.text.lower() not in stopwords]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "word_count = Counter(filtered_words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's see about the list of the most common words:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "could 192\n",
      "one 187\n",
      "would 183\n",
      "father 133\n",
      "man 133\n",
      "upon 126\n",
      "yet 115\n",
      "life 113\n",
      "may 108\n",
      "first 108\n",
      "might 108\n",
      "eyes 104\n",
      "every 104\n",
      "said 102\n",
      "shall 99\n",
      "time 97\n",
      "saw 94\n",
      "towards 93\n",
      "Elizabeth 92\n",
      "found 90\n"
     ]
    }
   ],
   "source": [
    "for word, count in word_count.most_common(20):\n",
    "    print word, count"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's actually a little bit better! Because spaCy knows enough about language to not include punctuation as part of the words, we're not getting as many \"noisy\" counts."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Writing to a file\n",
    "\n",
    "The following cell defines a function for writing data from a `Counter` object to a file. The file is in \"tab-separated values\" format, which you can open using most spreadsheet programs. Execute it before you continue:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def cleanstr(s):\n",
    "    import re\n",
    "    s = re.sub(r\"[\\n\\r]\", \" \", s).strip()\n",
    "    return s\n",
    "def save_counter_tsv(filename, counter, limit=1000):\n",
    "    with open(filename, \"w\") as outfile:\n",
    "        outfile.write(\"key\\tvalue\\n\")\n",
    "        for item, count in counter.most_common():\n",
    "            outfile.write(cleanstr(item) + \"\\t\" + str(count) + \"\\n\")    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, run the following cell. You'll end up with a file in the same directory as this notebook called `100_common_words.tsv` that has two columns, one for the words and one for their associated counts:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "save_counter_tsv(\"100_common_words.tsv\", word_count, 100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Try opening this file in Excel or Google Docs or Numbers!\n",
    "\n",
    "If you want to write the data from another `Counter` object to a file:\n",
    "\n",
    "* Change the filename to whatever you want (though you should probably keep the `.tsv` extension)\n",
    "* Replace `word_count` with the name of any of the `Counter` objects we've made in this sheet and use it in place of `word_count`\n",
    "* Change the number to the number of rows you want to include in your spreadsheet."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### When do things happen in this text?\n",
    "\n",
    "Here's another example. Using the `times` entities, we can make a spreadsheet of how often particular \"times\" (durations, times of day, etc.) are mentioned in the text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "time_counter = Counter([e.text.lower() for e in times])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "save_counter_tsv(\"time_count.tsv\", time_counter, 100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Semantic similarity with word vectors\n",
    "\n",
    "Every word in a spaCy parse is associated with a 300-dimensional vector. (A vector is just a fancy word for a \"list of numbers.\") This vector is based on a machine learning algorithm (called [GloVe](https://nlp.stanford.edu/projects/glove/)) that assigns the value to a word based on the frequency of the contexts it's found in. The math is complex, but the way it works out is that two words that have similar vectors are usually also similar in *meaning*.\n",
    "\n",
    "The following cell defines a function `cosine()` that returns a measure of \"distance\" between two vectors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from numpy import dot\n",
    "from numpy.linalg import norm\n",
    "\n",
    "# cosine similarity\n",
    "def cosine(v1, v2):\n",
    "    return dot(v1, v2) / (norm(v1) * norm(v2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using this function (and some fancy code), we can take any arbitrary word and then find its closest synonyms in the text. Change the word \"grumpy\" below to whatever you want and then run the cell:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ipykernel/__main__.py:7: RuntimeWarning: invalid value encountered in float_scalars\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[obnoxious,\n",
       " impatient,\n",
       " industrious,\n",
       " affectionate,\n",
       " obedient,\n",
       " uneducated,\n",
       " emaciated,\n",
       " untamed,\n",
       " barbarous,\n",
       " uncontrollable]"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_to_check = \"grumpy\"\n",
    "sorted(words, key=lambda x: cosine(nlp.vocab[word_to_check].vector, x.vector), reverse=True)[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This works not just for individual words, but for *entire sentences*. To get the vector for a sentence, we simply average its component vectors. The following function takes a sentence as a string and then returns the ten sentences closest in meaning from the text:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def similar_sentences(input_str, num=10):\n",
    "    input_vector = np.mean([w.vector for w in nlp(input_str)], axis=0)\n",
    "    return sorted(sentences,\n",
    "                  key=lambda x: cosine(np.mean([w.vector for w in x], axis=0), input_vector),\n",
    "                  reverse=True)[:num]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Try it out! Replace the string in `sentence_to_check` below with your own sentence, and run the cell. (It might take a while, depending on how big your source text file is.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ipykernel/__main__.py:7: RuntimeWarning: invalid value encountered in float_scalars\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "My food is not  that of man; I do not destroy the lamb and the kid to glut my appetite;  acorns and berries afford me sufficient nourishment.\n",
      "\n",
      "It was a most beautiful season; never did the fields  bestow a more plentiful harvest or the vines yield a more luxuriant  vintage, but my eyes were insensible to the charms of nature.\n",
      "\n",
      "She continued  with her foster parents and bloomed in their rude abode, fairer than a  garden rose among dark-leaved brambles.\n",
      "\n",
      "My dear Victor, do not  waste your time upon this; it is sad trash.\"\n",
      "\n",
      "My mother's  tender caresses and my father's smile of benevolent pleasure while  regarding me are my first recollections.\n",
      "\n",
      "I wish you could see him; he is very tall of his age, with  sweet laughing blue eyes, dark eyelashes, and curling hair.\n",
      "\n",
      "I will melt the stony  hearts of your enemies by my tears and prayers.\n",
      "\n",
      "His jaws opened, and he muttered some  inarticulate sounds, while a grin wrinkled his cheeks.\n",
      "\n",
      "The four others were dark-eyed, hardy little  vagrants; this child was thin and very fair.\n",
      "\n",
      "\"One night during my accustomed visit to the neighbouring wood where I  collected my own food and brought home firing for my protectors, I  found on the ground a leathern portmanteau containing several articles  of dress and some books.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "sentence_to_check = \"My favorite food is strawberry ice cream\"\n",
    "for item in similar_sentences(sentence_to_check):\n",
    "    print cleanstr(item.text)\n",
    "    print \"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is great for poetry but also for things like classifying documents, stylistics, etc."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}