aparrish/itp-camp-nlproc.ipynb Secret

## itp-camp-nlproc.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# A Reasonable Introduction to Natural Language Processing & the Vectorized Word\n",
    "\n",
    "By [Allison Parrish](http://www.decontextualize.com/)\n",
    "\n",
    "In this tutorial, we're going to show you how some basic text analysis tasks work in Python. We don't have time to go over a ton of Python basics, so we're just going to point out how you can modify the code in small ways to make it do different things.\n",
    "\n",
    "This is a \"Jupyter Notebook,\" which consists of text and \"cells\" of code. After you've loaded the notebook, you can execute the code in a cell by highlighting it and hitting Ctrl+Enter. In general, you need to execute the cells from top to bottom, but you can usually run a cell more than once without messing anything up. Experiment!\n",
    "\n",
    "If things start acting strange, you can interrupt the Python process by selecting \"Kernel > Interrupt\"—this tells Python to stop doing whatever it was doing. Select \"Kernel > Restart\" to clear all of your variables and start from scratch.\n",
    "\n",
    "We'll start with a very simple task: getting all of the words from a text file.\n",
    "\n",
    "## Getting all of the words from a text file\n",
    "\n",
    "The first thing you'll want to do is get a [plain text](http://air.decontextualize.com/plain-text/) file! One place to look is [Project Gutenberg](http://www.gutenberg.org), which is a repository of books in English that are in the public domain.\n",
    "\n",
    "Once you've found a plain text file, save it to the same folder as the folder that contains this Jupyter Notebook file. Replace `pg84.txt` in the cell below with the filename of your plain text file and execute the cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "words = open(\"pg84.txt\").read().split()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great! If you got an error, make sure that the file name is correct (keep the quotes!) and run the cell again. You've created a variable `words` that contains a list of all the words in your text file. The `len()` function tells you how many words are in the list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "77986"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The cell below uses Python's `random` module to print out 25 words at random. (You can change the number if you want more or less.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "country\n",
      "to\n",
      "merely\n",
      "thousand\n",
      "when\n",
      "enter\n",
      "Jura\n",
      "If\n",
      "my\n",
      "forward,\n",
      "methods\n",
      "supposed\n",
      "whom\n",
      "departure\n",
      "fate\n",
      "weather\n",
      "of\n",
      "free\n",
      "happiness\n",
      "could\n",
      "bitterest\n",
      "their\n",
      "preparing\n",
      "now\n",
      "to\n"
     ]
    }
   ],
   "source": [
    "import random\n",
    "for word in random.sample(words, 25):\n",
    "    print(word)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are some weirdnesses here, especially the punctuation that you see at the end of some of the strings!\n",
    "\n",
    "But the real question is: what is a word? Consider, in English:\n",
    "\n",
    "* \"Basketball\" and is one word, but \"player piano\" is two (?). Why \"basketball net\" and not \"basketballnet\"?\n",
    "* \"Particleboard\" or \"particle board\"?\n",
    "* \"Mr. Smith\"\n",
    "* \"single-minded,\" \"rent-a-cop,\" \"abso-f###ing-lutely\"\n",
    "* \"power drill\" is two words in English, whereas the equivalent in German is one: \"Schnellschrauber\"\n",
    "* In Mowhawk: \"Sahonwanhotónkwahse\"; one word, roughly translated, \"she opened the door for him again.\"\n",
    "* Likewise, one word in Turkish: \"Muvaffakiyetsizleştiricileştiriveremeyebileceklerimizdenmişsinizcesineyken\" meaning \"As though you are from those whom we may not be able to easily make into a maker of unsuccessful ones\"\n",
    "\n",
    "So in order to turn a text into words, you need to know something about how that language works. (And you have to be willing to accept a little squishiness in how accurate the results are.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Counting words\n",
    "\n",
    "One of the most common tasks in text analysis is counting how many times every word in a text occurs. The easiest way to do this in Python is with the `Counter` object, contained in the `collections` module. Run the following cell to create a `Counter` object to count your words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "word_count = Counter(words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using syntax like what you see in the cell below, you can check to see how often particular words occur in the text:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "448"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_count[\"he\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "172"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_count[\"she\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_count[\"fourth-meal\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One strange thing you'll notice is that upper-case and lower-case versions of the same word are counted separately:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_count[\"heaven\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_count[\"Heaven\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll figure out a way to mitigate this problem later on.\n",
    "\n",
    "The following cell prints out the twenty most common words in the text, along with the number of times they occur. (Again, you can change the number if you want more or less.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "the 4056\n",
      "and 2972\n",
      "of 2741\n",
      "I 2720\n",
      "to 2142\n",
      "my 1629\n",
      "a 1394\n",
      "in 1126\n",
      "was 993\n",
      "that 987\n",
      "with 696\n",
      "had 679\n",
      "which 547\n",
      "but 542\n",
      "me 529\n",
      "his 500\n",
      "not 498\n",
      "as 486\n",
      "by 464\n",
      "for 449\n"
     ]
    }
   ],
   "source": [
    "for word, number in word_count.most_common(20):\n",
    "    print(word, number)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Stopwords\n",
    "\n",
    "Intuitively, it seems strange to count these words like \"the\" and \"and\" among the \"most common,\" because words like these are presumably common across *all* texts, not just this text in particular. To solve this problem, we can use \"stopwords\": a list of commonly-occurring English words that shouldn't be counted for the purpose of word frequency. No one exactly agrees on what this list should be, but here's one attempt (from [here](https://gist.github.com/sebleier/554280)). Make sure to execute this cell before you continue! You can add or remove items from the list if you want; just make sure to put quotes around the word you want to add, and add a comma at the end of the line (outside the quotes)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "stopwords = [\n",
    "    \"i\",\n",
    "    \"me\",\n",
    "    \"my\",\n",
    "    \"myself\",\n",
    "    \"we\",\n",
    "    \"our\",\n",
    "    \"ours\",\n",
    "    \"ourselves\",\n",
    "    \"you\",\n",
    "    \"your\",\n",
    "    \"yours\",\n",
    "    \"yourself\",\n",
    "    \"yourselves\",\n",
    "    \"he\",\n",
    "    \"him\",\n",
    "    \"his\",\n",
    "    \"himself\",\n",
    "    \"she\",\n",
    "    \"her\",\n",
    "    \"hers\",\n",
    "    \"herself\",\n",
    "    \"it\",\n",
    "    \"its\",\n",
    "    \"itself\",\n",
    "    \"they\",\n",
    "    \"them\",\n",
    "    \"their\",\n",
    "    \"theirs\",\n",
    "    \"themselves\",\n",
    "    \"what\",\n",
    "    \"which\",\n",
    "    \"who\",\n",
    "    \"whom\",\n",
    "    \"this\",\n",
    "    \"that\",\n",
    "    \"these\",\n",
    "    \"those\",\n",
    "    \"am\",\n",
    "    \"is\",\n",
    "    \"are\",\n",
    "    \"was\",\n",
    "    \"were\",\n",
    "    \"be\",\n",
    "    \"been\",\n",
    "    \"being\",\n",
    "    \"have\",\n",
    "    \"has\",\n",
    "    \"had\",\n",
    "    \"having\",\n",
    "    \"do\",\n",
    "    \"does\",\n",
    "    \"did\",\n",
    "    \"doing\",\n",
    "    \"a\",\n",
    "    \"an\",\n",
    "    \"the\",\n",
    "    \"and\",\n",
    "    \"but\",\n",
    "    \"if\",\n",
    "    \"or\",\n",
    "    \"because\",\n",
    "    \"as\",\n",
    "    \"until\",\n",
    "    \"while\",\n",
    "    \"of\",\n",
    "    \"at\",\n",
    "    \"by\",\n",
    "    \"for\",\n",
    "    \"with\",\n",
    "    \"about\",\n",
    "    \"against\",\n",
    "    \"between\",\n",
    "    \"into\",\n",
    "    \"through\",\n",
    "    \"during\",\n",
    "    \"before\",\n",
    "    \"after\",\n",
    "    \"above\",\n",
    "    \"below\",\n",
    "    \"to\",\n",
    "    \"from\",\n",
    "    \"up\",\n",
    "    \"down\",\n",
    "    \"in\",\n",
    "    \"out\",\n",
    "    \"on\",\n",
    "    \"off\",\n",
    "    \"over\",\n",
    "    \"under\",\n",
    "    \"again\",\n",
    "    \"further\",\n",
    "    \"then\",\n",
    "    \"once\",\n",
    "    \"here\",\n",
    "    \"there\",\n",
    "    \"when\",\n",
    "    \"where\",\n",
    "    \"why\",\n",
    "    \"how\",\n",
    "    \"all\",\n",
    "    \"any\",\n",
    "    \"both\",\n",
    "    \"each\",\n",
    "    \"few\",\n",
    "    \"more\",\n",
    "    \"most\",\n",
    "    \"other\",\n",
    "    \"some\",\n",
    "    \"such\",\n",
    "    \"no\",\n",
    "    \"nor\",\n",
    "    \"not\",\n",
    "    \"only\",\n",
    "    \"own\",\n",
    "    \"same\",\n",
    "    \"so\",\n",
    "    \"than\",\n",
    "    \"too\",\n",
    "    \"very\",\n",
    "    \"s\",\n",
    "    \"t\",\n",
    "    \"can\",\n",
    "    \"will\",\n",
    "    \"just\",\n",
    "    \"don\",\n",
    "    \"should\",\n",
    "    \"now\"\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To make use of this list, we'll create a new list that only includes those words that are *not* in the stopwords list. The Python code to do this is in the cell below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_words = [w for w in words if w.lower() not in stopwords]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Checking the length of this list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "38137"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(clean_words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're left with far fewer words! But if we create a `Counter` object with this list of words, our list of the most common words is a bit more interesting:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "word_count = Counter(clean_words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_count[\"he\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_count[\"she\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "could 188\n",
      "would 177\n",
      "one 175\n",
      "me, 147\n",
      "upon 125\n",
      "yet 109\n",
      "may 107\n",
      "might 107\n",
      "me. 107\n",
      "every 103\n",
      "first 102\n",
      "shall 99\n",
      "towards 93\n",
      "saw 91\n",
      "even 82\n",
      "found 80\n",
      "Project 77\n",
      "man 76\n",
      "time 75\n",
      "father 73\n",
      "must 72\n",
      "felt 72\n",
      "\"I 71\n",
      "said 68\n",
      "many 66\n"
     ]
    }
   ],
   "source": [
    "for word, count in word_count.most_common(25):\n",
    "    print(word, count)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Still not perfect, but it's a step forward."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Natural language processing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our word counts would be more interesting if we could reason better about the *language* in the text, not just the individual characters. For example, if we knew the parts of speech of individual words, we could exclude words that are determiners, conjunctions, etc. from the count. If we knew what kinds of things the words were referring to, we could count how many times particular characters or settings are referenced.\n",
    "\n",
    "To do this, we need to do a bit of Natural Language Processing. [More notes and opinions on this.](https://gist.github.com/aparrish/f21f6abbf2367e8eb23438558207e1c3)\n",
    "\n",
    "Most natural language processing is done with the aid of third-party libraries. We're going to use one called spaCy. To use spaCy, you first need to import it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import spacy\n",
    "nlp = spacy.load('en_core_web_md')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load in your text using the following line of code! (Remember to replace `pg84.txt` with the filename of your own text file.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# replace \"pg84.txt\" with the name of your own text file, then run this cell with CTRL+Enter.\n",
    "text = open(\"pg84.txt\").read()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, use spaCy to parse it. (This might take a while, depending on the size of your text.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "doc = nlp(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Right off the bat, the spaCy library gives us access to a number of interesting units of text:\n",
    "\n",
    "* All of the sentences (`doc.sents`)\n",
    "* All of the words (`doc`)\n",
    "* All of the \"named entitites,\" like names of places, people, #brands, etc. (`doc.ents`)\n",
    "* All of the \"noun chunks,\" i.e., nouns in the text plus surrounding matter like adjectives and articles\n",
    "\n",
    "The cell below, we extract these into variables so we can play around with them a little bit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sentences = list(doc.sents)\n",
    "words = [w for w in list(doc) if w.is_alpha]\n",
    "noun_chunks = list(doc.noun_chunks)\n",
    "entities = list(doc.ents)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With this information in hand, we can answer interesting questions like: how many sentences are in the text?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3668"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(sentences)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using `random.sample()`, we can get a small, randomly-selected sample from these lists. Here are five random sentences:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I have resolved every night, when I am not imperatively occupied by my duties, to record, as nearly as possible in his own words, what he has related during the day.\n",
      "---\n",
      "\"This reading had puzzled me extremely at first, but by degrees I discovered that he uttered many of the same sounds when he read as when he talked.\n",
      "---\n",
      "It is a scene terrifically desolate.\n",
      "---\n",
      "You raise me from the dust by this kindness; and I trust that, by your aid, I shall not be driven from the society and sympathy of your fellow creatures.'  \"'Heaven forbid!\n",
      "---\n",
      "When his children had departed, he took up his guitar and played several mournful but sweet airs, more sweet and mournful than I had ever heard him play before.\n",
      "---\n"
     ]
    }
   ],
   "source": [
    "for item in random.sample(sentences, 5):\n",
    "    print(item.text.strip().replace(\"\\n\", \" \"))\n",
    "    print(\"---\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ten random words:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "this\n",
      "months\n",
      "should\n",
      "family\n",
      "the\n",
      "the\n",
      "fraught\n",
      "it\n",
      "was\n",
      "small\n"
     ]
    }
   ],
   "source": [
    "for item in random.sample(words, 10):\n",
    "    print(item.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ten random noun chunks:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "the mountains\n",
      "no sentiment\n",
      "my retreat\n",
      "my labours\n",
      "whose joint wickedness\n",
      "the person\n",
      "him\n",
      "any assistance\n",
      "peril\n",
      "it\n"
     ]
    }
   ],
   "source": [
    "for item in random.sample(noun_chunks, 10):\n",
    "    print(item.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ten random entities:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "one\n",
      "several hours\n",
      "America\n",
      "one\n",
      "May 18th\n",
      "Louisa Biron\n",
      "the morning\n",
      "first\n",
      "Christianity\n",
      "Greek\n"
     ]
    }
   ],
   "source": [
    "for item in random.sample(entities, 10):\n",
    "    print(item.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parts of speech\n",
    "\n",
    "The spaCy parser allows us to check what part of speech a word belongs to. In the cell below, we create four different lists—`nouns`, `verbs`, `adjs` and `advs`—that contain only words of the specified parts of speech. ([There's a full list of part of speech tags here](https://spacy.io/docs/usage/pos-tagging#pos-tagging-english))."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "nouns = [w for w in words if w.pos_ == \"NOUN\"]\n",
    "verbs = [w for w in words if w.pos_ == \"VERB\"]\n",
    "adjs = [w for w in words if w.pos_ == \"ADJ\"]\n",
    "advs = [w for w in words if w.pos_ == \"ADV\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now we can print out a random sample of any of these:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sufferings\n",
      "promise\n",
      "sea\n",
      "meadows\n",
      "scenes\n",
      "speech\n",
      "care\n",
      "shipping\n",
      "friends\n",
      "cheerfulness\n"
     ]
    }
   ],
   "source": [
    "for item in random.sample(nouns, 10): # change \"nouns\" to \"verbs\" or \"adjs\" or \"advs\" to sample from those lists!\n",
    "    print(item.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Entity types\n",
    "\n",
    "The parser in spaCy not only identifies \"entities\" but also assigns them to a particular type. [See a full list of entity types here.](https://spacy.io/docs/usage/entity-recognition#entity-types) Using this information, the following cell builds lists of the people, locations, and times mentioned in the text:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "people = [e for e in entities if e.label_ == \"PERSON\"]\n",
    "locations = [e for e in entities if e.label_ == \"LOC\"]\n",
    "times = [e for e in entities if e.label_ == \"TIME\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And then you can print out a random sample:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "night\n",
      "accustomed hour\n",
      "night\n",
      "One night\n",
      "that night\n",
      "several hours\n",
      "night\n",
      "morning\n",
      "this night\n",
      "several hours\n"
     ]
    }
   ],
   "source": [
    "for item in random.sample(times, 10): # change \"times\" to \"people\" or \"locations\" to sample those lists\n",
    "    print(item.text.strip())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Finding the most common\n",
    "\n",
    "So let's repeat the task of finding the most common words, this time using the words parsed from the text using spaCy. The code looks mostly the same:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "word_count = Counter([w.text for w in words])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "18"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_count['heaven']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can even filter these with the stopwords list, as in the cell below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "filtered_words = [w.text for w in words if w.text.lower() not in stopwords]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "word_count = Counter(filtered_words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's see about the list of the most common words:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "could 192\n",
      "one 188\n",
      "would 183\n",
      "man 133\n",
      "father 133\n",
      "upon 126\n",
      "yet 115\n",
      "life 113\n",
      "may 108\n",
      "first 108\n",
      "might 108\n",
      "every 104\n",
      "eyes 104\n",
      "said 102\n",
      "shall 99\n",
      "time 97\n",
      "saw 94\n",
      "towards 93\n",
      "Elizabeth 92\n",
      "found 90\n"
     ]
    }
   ],
   "source": [
    "for word, count in word_count.most_common(20):\n",
    "    print(word, count)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's actually a little bit better! Because spaCy knows enough about language to not include punctuation as part of the words, we're not getting as many \"noisy\" counts."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Writing to a file\n",
    "\n",
    "The following cell defines a function for writing data from a `Counter` object to a file. The file is in \"tab-separated values\" format, which you can open using most spreadsheet programs. Execute it before you continue:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def save_counter_tsv(filename, counter, limit=1000):\n",
    "    with open(filename, \"w\") as outfile:\n",
    "        outfile.write(\"key\\tvalue\\n\")\n",
    "        for item, count in counter.most_common():\n",
    "            outfile.write(item.strip() + \"\\t\" + str(count) + \"\\n\")    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, run the following cell. You'll end up with a file in the same directory as this notebook called `100_common_words.tsv` that has two columns, one for the words and one for their associated counts:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [],
   "source": [
    "save_counter_tsv(\"100_common_words.tsv\", word_count, 100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Try opening this file in Excel or Google Docs or Numbers!\n",
    "\n",
    "If you want to write the data from another `Counter` object to a file:\n",
    "\n",
    "* Change the filename to whatever you want (though you should probably keep the `.tsv` extension)\n",
    "* Replace `word_count` with the name of any of the `Counter` objects we've made in this sheet and use it in place of `word_count`\n",
    "* Change the number to the number of rows you want to include in your spreadsheet."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### When do things happen in this text?\n",
    "\n",
    "Here's another example. Using the `times` entities, we can make a spreadsheet of how often particular \"times\" (durations, times of day, etc.) are mentioned in the text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "time_counter = Counter([e.text.lower().strip() for e in times])\n",
    "save_counter_tsv(\"time_count.tsv\", time_counter, 100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Do the same thing, but with people:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "people_counter = Counter([e.text.lower() for e in people])\n",
    "save_counter_tsv(\"people_count.tsv\", people_counter, 100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Semantic similarity with word vectors\n",
    "\n",
    "Every word in a spaCy parse is associated with a 300-dimensional vector. (A vector is just a fancy word for a \"list of numbers.\") This vector is based on a machine learning algorithm (called [GloVe](https://nlp.stanford.edu/projects/glove/)) that assigns the value to a word based on the frequency of the contexts it's found in. The math is complex, but the way it works out is that two words that have similar vectors are usually also similar in *meaning*. [More notes on the concept of word vectors here](https://gist.github.com/aparrish/2f562e3737544cf29aaf1af30362f469).\n",
    "\n",
    "The following cell defines a function `cosine()` that returns a measure of \"distance\" between two vectors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from numpy import dot\n",
    "from numpy.linalg import norm\n",
    "\n",
    "# cosine similarity\n",
    "def cosine(v1, v2):\n",
    "    if norm(v1) > 0 and norm(v2) > 0:\n",
    "        return dot(v1, v2) / (norm(v1) * norm(v2))\n",
    "    else:\n",
    "        return 0.0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're going to do a little bit of speculative text analysis. We'll start with a list of all unique words in our text:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "unique_words = list(set([w.text for w in words]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The function defined in the cell below checks each word in a source text, compares it to the vector of the specified word (which can be any English word), and returns the words with the highest cosine similarity from the source text. You can think of it as sort of a conceptual \"translator,\" translating a word from any domain and register of English to its closest equivalent in the text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def similar_words(word_to_check, source_set):\n",
    "    return sorted(source_set,\n",
    "                  key=lambda x: cosine(nlp.vocab[word_to_check].vector, nlp.vocab[x].vector),\n",
    "                  reverse=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Try it out by running the cell below. Replace `grumpy` with a word of your choice, and `10` with the number of results you want:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['annoyed',\n",
       " 'angry',\n",
       " 'impatient',\n",
       " 'sullen',\n",
       " 'gruff',\n",
       " 'Unhappy',\n",
       " 'unhappy',\n",
       " 'exasperated',\n",
       " 'rude',\n",
       " 'obnoxious']"
      ]
     },
     "execution_count": 79,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# change \"kitten\" to a word of your choice and 10 to the number of results you want\n",
    "similar_words(\"grumpy\", unique_words)[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What's the closest thing to \"baseball\" in a text that doesn't mention baseball?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['league',\n",
       " 'leagues',\n",
       " 'sport',\n",
       " 'game',\n",
       " 'coach',\n",
       " 'bat',\n",
       " 'college',\n",
       " 'Homer',\n",
       " 'season',\n",
       " 'ball']"
      ]
     },
     "execution_count": 80,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "similar_words(\"baseball\", unique_words)[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This works not just for individual words, but for *entire sentences*. To get the vector for a sentence, we simply average its component vectors. The `sentence_vector` function in the cell below takes a spaCy-parsed sentence and returns the averaged vector of the words in the sentence. The `similar_sentences` function takes an arbitrary string as a parameter and returns the sentences in our text closest in meaning (using the list of sentences assigned to the `sentences` variable further up the notebook):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def sentence_vector(sent):\n",
    "    vec = np.array([w.vector for w in sent if w.has_vector and np.any(w.vector)])\n",
    "    if len(vec) > 0:\n",
    "        return np.mean(vec, axis=0)\n",
    "    else:\n",
    "        raise ValueError(\"no words with vectors found\")   \n",
    "def similar_sentences(input_str, num=10):\n",
    "    input_vector = sentence_vector(nlp(input_str))\n",
    "    return sorted(sentences,\n",
    "                  key=lambda x: cosine(np.mean([w.vector for w in x], axis=0), input_vector),\n",
    "                  reverse=True)[:num]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Try it out! Replace the string in `sentence_to_check` below with your own sentence, and run the cell. (It might take a while, depending on how big your source text file is.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "My food is not\n",
      "that of man; I do not destroy the lamb and the kid to glut my appetite;\n",
      "acorns and berries afford me sufficient nourishment.\n",
      "\n",
      "You wish to eat me and tear me to pieces.\n",
      "\n",
      "I greedily devoured the\n",
      "remnants of the shepherd's breakfast, which consisted of bread, cheese,\n",
      "milk, and wine; the latter, however, I did not like.\n",
      "\n",
      "\"I lay on my straw, but I could not sleep.\n",
      "\n",
      "I wish you could see him; he is very tall of his age, with\n",
      "sweet laughing blue eyes, dark eyelashes, and curling hair.\n",
      "\n",
      "I am surrounded by mountains of ice which admit of no escape and\n",
      "threaten every moment to crush my vessel.\n",
      "\n",
      "Yet I would die to make\n",
      "her happy.\n",
      "\n",
      "I\n",
      "had first, however, provided for my sustenance for that day by a loaf\n",
      "of coarse bread, which I purloined, and a cup with which I could drink\n",
      "more conveniently than from my hand of the pure water which flowed by\n",
      "my retreat.\n",
      "\n",
      "I could now almost fancy myself among the\n",
      "Swiss mountains.\n",
      "\n",
      "For some\n",
      "time I sat upon the rock that overlooks the sea of ice.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "sentence_to_check = \"I love to eat strawberry ice cream.\"\n",
    "for item in similar_sentences(sentence_to_check):\n",
    "    print(item.text.strip())\n",
    "    print(\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is great for poetry but also for things like classifying documents, stylistics, etc."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Classifying sentences\n",
    "\n",
    "Speaking of classification, averaged word vectors make it easy to classify sentences into categories. For classification, we'll use an algorithm called a [Support Vector Machine](http://scikit-learn.org/stable/modules/svm.html). Here's how it works: you give the Support Vector Machine classifier a list of \"examples\" in the form of vectors, along with a label for each example. Then, you can pass a \"test\" vector to the classifier and it will return the label it thinks belongs with that vector—even if that exact vector never occurred in the training data. Nice!\n",
    "\n",
    "In the code below, we're going to make a classifier that predicts whether a sentence belongs to *Frankenstein* or *Dracula*. You can follow along with this example by downloading the plain text versions of [*Dracula*](http://www.gutenberg.org/cache/epub/345/pg345.txt) and [*Frankenstein*](http://www.gutenberg.org/cache/epub/84/pg84.txt) from Project Gutenberg. Or pick your own two texts!\n",
    "\n",
    "In the cell below, we read in both documents and parse them using spaCy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "doc1 = nlp(open(\"pg84.txt\").read()) # Frankenstein\n",
    "doc2 = nlp(open(\"pg345.txt\").read()) # Dracula"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The function defined in the following cell creates sentence vectors for every sentence in a spaCy document and returns a list of those vectors. (It also normalizes the data, which makes our classifier work better.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import preprocessing\n",
    "def sentence_vectors(doc):\n",
    "    all_vectors = []\n",
    "    for sent in doc.sents:\n",
    "        try:\n",
    "            sv = preprocessing.scale(sentence_vector(sent))\n",
    "        except ValueError:\n",
    "            continue\n",
    "        all_vectors.append(sv)\n",
    "    return all_vectors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now, we'll create two lists: one with sentence vectors from the first document (*Frankenstein*) and one with sentence vectors from the second document (*Dracula*)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/allison/anaconda/lib/python3.6/site-packages/sklearn/preprocessing/data.py:177: UserWarning: Numerical issues were encountered when scaling the data and might not be solved. The standard deviation of the data is probably very close to 0. \n",
      "  warnings.warn(\"Numerical issues were encountered \"\n"
     ]
    }
   ],
   "source": [
    "doc1_sent_vecs = sentence_vectors(doc1)\n",
    "doc2_sent_vecs = sentence_vectors(doc2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we need to create the data. The code in the cell below makes one big list of all of the sentence vectors, and then another big list of labels that go with each data. We're using `1` as the label for *Frankenstein* and `2` as the label for *Dracula*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = np.array(doc1_sent_vecs + doc2_sent_vecs)\n",
    "y = np.array([1]*len(doc1_sent_vecs) + [2]*len(doc2_sent_vecs))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following cell actually creates the classifier and fits the classifier model to the training data. This will take a while, maybe a long while, based on the size of the texts you're using."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "SVC(C=10.0, cache_size=500, class_weight='balanced', coef0=0.0,\n",
       "  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',\n",
       "  max_iter=-1, probability=False, random_state=None, shrinking=True,\n",
       "  tol=0.001, verbose=False)"
      ]
     },
     "execution_count": 91,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn import svm\n",
    "\n",
    "classifier = svm.SVC(class_weight='balanced', C=10.0, cache_size=500)\n",
    "classifier.fit(X, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As a quick and dirty test, the code in the cell below uses the classifier to predict the category of the existing training data. (Technically, we'd want to test on data that wasn't included in the training, but this is good enough for demonstration purposes.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {},
   "outputs": [],
   "source": [
    "doc1_predictions = Counter([classifier.predict([vec])[0] for vec in doc1_sent_vecs])\n",
    "doc2_predictions = Counter([classifier.predict([vec])[0] for vec in doc2_sent_vecs])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({1: 3308, 2: 355})"
      ]
     },
     "execution_count": 97,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc1_predictions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we'll print out the percentage of predictions that the classifier got right:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Text #1 accuracy: 0.9030849030849031\n",
      "Text #2 accuracy: 0.9212613473483039\n"
     ]
    }
   ],
   "source": [
    "print(\"Text #1 accuracy:\", (doc1_predictions[1] / sum(doc1_predictions.values())))\n",
    "print(\"Text #2 accuracy:\", (doc2_predictions[2] / sum(doc2_predictions.values())))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Accuracy in the range of 90% is actually not bad, considering the small amount of data we're working with!\n",
    "\n",
    "The function below takes an arbitrary sentence (whether or not it's found in either original source text) and predicts its category:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def predict_sentence(classifier, text):\n",
    "    sv = sentence_vector(nlp(text))\n",
    "    return classifier.predict([preprocessing.scale(sv)])[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the sake of curiousity, we'll see if the classifier correctly predicts lines of dialogue from the Mel Brooks parody movies based on the works in question. First, a quote from [*Young Frankenstein*](http://www.imdb.com/title/tt0072431/quotes):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 102,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predict_sentence(classifier,\n",
    "    \"You and I are going to make the greatest single contribution to science since the creation of fire.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now predict a sentence from *[Dracula: Dead And Loving It](http://www.imdb.com/title/tt0112896/trivia?tab=qt&ref_=tt_trv_qu)*:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2"
      ]
     },
     "execution_count": 103,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predict_sentence(classifier, \"I tell you I saw you snatch a spider right of the air and eat it!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Not bad!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}