Skip to content

Instantly share code, notes, and snippets.

@aparrish
Last active September 7, 2022 08:59
Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 4 You must be signed in to fork a gist
  • Save aparrish/697b7f56ac28f4e59af77a66ac573b8f to your computer and use it in GitHub Desktop.
Save aparrish/697b7f56ac28f4e59af77a66ac573b8f to your computer and use it in GitHub Desktop.
Workshop notebook. Code examples released under CC0 https://creativecommons.org/choose/zero/, other text released under CC BY 4.0 https://creativecommons.org/licenses/by/4.0/
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# A Reasonable Introduction to Natural Language Processing & the Vectorized Word\n",
"\n",
"By [Allison Parrish](http://www.decontextualize.com/)\n",
"\n",
"In this tutorial, we're going to show you how some basic text analysis tasks work in Python. We don't have time to go over a ton of Python basics, so we're just going to point out how you can modify the code in small ways to make it do different things.\n",
"\n",
"This is a \"Jupyter Notebook,\" which consists of text and \"cells\" of code. After you've loaded the notebook, you can execute the code in a cell by highlighting it and hitting Ctrl+Enter. In general, you need to execute the cells from top to bottom, but you can usually run a cell more than once without messing anything up. Experiment!\n",
"\n",
"If things start acting strange, you can interrupt the Python process by selecting \"Kernel > Interrupt\"—this tells Python to stop doing whatever it was doing. Select \"Kernel > Restart\" to clear all of your variables and start from scratch.\n",
"\n",
"We'll start with a very simple task: getting all of the words from a text file.\n",
"\n",
"## Getting all of the words from a text file\n",
"\n",
"The first thing you'll want to do is get a [plain text](http://air.decontextualize.com/plain-text/) file! One place to look is [Project Gutenberg](http://www.gutenberg.org), which is a repository of books in English that are in the public domain.\n",
"\n",
"Once you've found a plain text file, save it to the same folder as the folder that contains this Jupyter Notebook file. Replace `pg84.txt` in the cell below with the filename of your plain text file and execute the cell."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"words = open(\"pg84.txt\").read().split()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Great! If you got an error, make sure that the file name is correct (keep the quotes!) and run the cell again. You've created a variable `words` that contains a list of all the words in your text file. The `len()` function tells you how many words are in the list:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"75043"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(words)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The cell below uses Python's `random` module to print out 25 words at random. (You can change the number if you want more or less.)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(Godwin)\n",
"I\n",
"good\n",
"did\n",
"burn.\n",
"drawing\n",
"all\n",
"may\n",
"kindly\n",
"eye\n",
"visits\n",
"piny\n",
"to\n",
"of\n",
"only\n",
"but\n",
"with\n",
"of\n",
"dreams\n",
"to\n",
"different\n",
"to\n",
"the\n",
"innocence.”\n",
"the\n"
]
}
],
"source": [
"import random\n",
"for word in random.sample(words, 25):\n",
" print(word)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are some weirdnesses here, especially the punctuation that you see at the end of some of the strings!\n",
"\n",
"But the real question is: what is a word? Consider, in English:\n",
"\n",
"* \"Basket ball\" and is one word, but \"playerpiano\" is two (?). Why \"basketball net\" and not \"basketballnet\"?\n",
"* \"Particleboard\" or \"particle board\"?\n",
"* \"Mr. Smith\"\n",
"* \"single-minded,\" \"rent-a-cop,\" \"abso-f###ing-lutely\"\n",
"* \"power drill\" is two words in English, whereas the equivalent in German is one: \"Schnellschrauber\"\n",
"* In Mowhawk: \"Sahonwanhotónkwahse\"; one word, roughly translated, \"she opened the door for him again.\"\n",
"* Likewise, one word in Turkish: \"Muvaffakiyetsizleştiricileştiriveremeyebileceklerimizdenmişsinizcesineyken\" meaning \"As though you are from those whom we may not be able to easily make into a maker of unsuccessful ones\"\n",
"\n",
"So in order to turn a text into words, you need to know something about how that language works. (And you have to be willing to accept a little squishiness in how accurate the results are.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Counting words\n",
"\n",
"One of the most common tasks in text analysis is counting how many times every word in a text occurs. The easiest way to do this in Python is with the `Counter` object, contained in the `collections` module. Run the following cell to create a `Counter` object to count your words."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from collections import Counter\n",
"word_count = Counter(words)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using syntax like what you see in the cell below, you can check to see how often particular words occur in the text:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"446"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"word_count[\"he\"]"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"172"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"word_count[\"she\"]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"298"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"word_count[\"this\"]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"974"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"word_count[\"that\"]"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"word_count[\"yonder\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One strange thing you'll notice is that upper-case and lower-case versions of the same word are counted separately:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"5"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"word_count[\"heaven\"]"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"word_count[\"Heaven\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll figure out a way to mitigate this problem later on.\n",
"\n",
"The following cell prints out the twenty most common words in the text, along with the number of times they occur. (Again, you can change the number if you want more or less.)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"the 3898\n",
"and 2903\n",
"I 2719\n",
"of 2634\n",
"to 2072\n",
"my 1631\n",
"a 1338\n",
"in 1071\n",
"was 992\n",
"that 974\n",
"had 679\n",
"with 654\n",
"which 540\n",
"but 538\n",
"me 529\n",
"his 500\n",
"not 479\n",
"as 477\n",
"for 463\n",
"he 446\n"
]
}
],
"source": [
"for word, number in word_count.most_common(20):\n",
" print(word, number)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Stopwords\n",
"\n",
"Intuitively, it seems strange to count these words like \"the\" and \"and\" among the \"most common,\" because words like these are presumably common across *all* texts, not just this text in particular. To solve this problem, we can use \"stopwords\": a list of commonly-occurring English words that shouldn't be counted for the purpose of word frequency. No one exactly agrees on what this list should be, but here's one attempt (from [here](https://gist.github.com/sebleier/554280)). Make sure to execute this cell before you continue! You can add or remove items from the list if you want; just make sure to put quotes around the word you want to add, and add a comma at the end of the line (outside the quotes)."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"stopwords = [\n",
" \"i\",\n",
" \"me\",\n",
" \"my\",\n",
" \"myself\",\n",
" \"we\",\n",
" \"our\",\n",
" \"ours\",\n",
" \"ourselves\",\n",
" \"you\",\n",
" \"your\",\n",
" \"yours\",\n",
" \"yourself\",\n",
" \"yourselves\",\n",
" \"he\",\n",
" \"him\",\n",
" \"his\",\n",
" \"himself\",\n",
" \"she\",\n",
" \"her\",\n",
" \"hers\",\n",
" \"herself\",\n",
" \"it\",\n",
" \"its\",\n",
" \"itself\",\n",
" \"they\",\n",
" \"them\",\n",
" \"their\",\n",
" \"theirs\",\n",
" \"themselves\",\n",
" \"what\",\n",
" \"which\",\n",
" \"who\",\n",
" \"whom\",\n",
" \"this\",\n",
" \"that\",\n",
" \"these\",\n",
" \"those\",\n",
" \"am\",\n",
" \"is\",\n",
" \"are\",\n",
" \"was\",\n",
" \"were\",\n",
" \"be\",\n",
" \"been\",\n",
" \"being\",\n",
" \"have\",\n",
" \"has\",\n",
" \"had\",\n",
" \"having\",\n",
" \"do\",\n",
" \"does\",\n",
" \"did\",\n",
" \"doing\",\n",
" \"a\",\n",
" \"an\",\n",
" \"the\",\n",
" \"and\",\n",
" \"but\",\n",
" \"if\",\n",
" \"or\",\n",
" \"because\",\n",
" \"as\",\n",
" \"until\",\n",
" \"while\",\n",
" \"of\",\n",
" \"at\",\n",
" \"by\",\n",
" \"for\",\n",
" \"with\",\n",
" \"about\",\n",
" \"against\",\n",
" \"between\",\n",
" \"into\",\n",
" \"through\",\n",
" \"during\",\n",
" \"before\",\n",
" \"after\",\n",
" \"above\",\n",
" \"below\",\n",
" \"to\",\n",
" \"from\",\n",
" \"up\",\n",
" \"down\",\n",
" \"in\",\n",
" \"out\",\n",
" \"on\",\n",
" \"off\",\n",
" \"over\",\n",
" \"under\",\n",
" \"again\",\n",
" \"further\",\n",
" \"then\",\n",
" \"once\",\n",
" \"here\",\n",
" \"there\",\n",
" \"when\",\n",
" \"where\",\n",
" \"why\",\n",
" \"how\",\n",
" \"all\",\n",
" \"any\",\n",
" \"both\",\n",
" \"each\",\n",
" \"few\",\n",
" \"more\",\n",
" \"most\",\n",
" \"other\",\n",
" \"some\",\n",
" \"such\",\n",
" \"no\",\n",
" \"nor\",\n",
" \"not\",\n",
" \"only\",\n",
" \"own\",\n",
" \"same\",\n",
" \"so\",\n",
" \"than\",\n",
" \"too\",\n",
" \"very\",\n",
" \"s\",\n",
" \"t\",\n",
" \"can\",\n",
" \"will\",\n",
" \"just\",\n",
" \"don\",\n",
" \"should\",\n",
" \"now\"\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To make use of this list, we'll create a new list that only includes those words that are *not* in the stopwords list. The Python code to do this is in the cell below:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"clean_words = [w for w in words if w.lower() not in stopwords]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Checking the length of this list:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"36422"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(clean_words)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We're left with far fewer words! But if we create a `Counter` object with this list of words, our list of the most common words is a bit more interesting:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"word_count = Counter(clean_words)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"word_count[\"he\"]"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"word_count[\"she\"]"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"could 187\n",
"would 177\n",
"one 174\n",
"me, 148\n",
"upon 123\n",
"yet 109\n",
"might 107\n",
"me. 107\n",
"every 103\n",
"first 102\n",
"shall 98\n",
"towards 93\n",
"may 92\n",
"saw 91\n",
"even 81\n",
"found 77\n",
"man 76\n",
"time 75\n",
"father 73\n",
"felt 72\n",
"“I 71\n",
"said 68\n",
"made 66\n",
"many 65\n",
"life 65\n"
]
}
],
"source": [
"for word, count in word_count.most_common(25):\n",
" print(word, count)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Still not perfect, but it's a step forward.\n",
"\n",
"If you're interested in keyword extraction, [I made a notebook that explains in detail a well-motivated method for extracting keywords from arbitrary (English) texts](https://github.com/aparrish/rwet/blob/master/quick-and-dirty-keywords.ipynb) that has links to a number of interesting and relevant papers!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Natural language processing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our word counts would be more interesting if we could reason better about the *language* in the text, not just the individual characters. For example, if we knew the parts of speech of individual words, we could exclude words that are determiners, conjunctions, etc. from the count. If we knew what kinds of things the words were referring to, we could count how many times particular characters or settings are referenced.\n",
"\n",
"To do this, we need to do a bit of Natural Language Processing. I cover just the bare essentials in this notebook. [Here's a more in-depth tutorial that I wrote](https://github.com/aparrish/rwet/blob/master/nlp-concepts-with-spacy.ipynb).\n",
"\n",
"Most natural language processing is done with the aid of third-party libraries. We're going to use one called spaCy. To use spaCy, you first need to install it (i.e., download the code and put it in a place where Python can find it) and download the language model. (The language model contains statistical information about a particular language that makes it possible for spaCy to do things like parse sentences into their constituent parts.)\n",
"\n",
"To install spaCy, [follow the instructions here](https://spacy.io/usage/). If you're using Anaconda, you'll need to open a Terminal window (or the equivalent on your operating system) and type\n",
"\n",
" conda install -c conda-forge spacy\n",
"\n",
"This line installs the library. You'll also need to download a language model. For that, type:\n",
"\n",
" python -m spacy download en_core_web_md\n",
"\n",
"(Replace en with the language code for your desired language, if there's a model available for it.) The language model contains the statistical information necessary to parse text into sentences and sentences into parts of speech. Note that this download is several hundred megabytes, so it might take a while!\n",
"\n",
"Once you've installed the library and downloaded the model, you should be able to load the model in the following cell:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import spacy\n",
"nlp = spacy.load('en_core_web_md')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(This could also take a while–the model is very large and your computer needs to load it from your hard drive and into memory. When you see a `[*]` next to a cell, that means that your computer is still working on executing the code in the cell.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now you can load in your text using the following line of code! (Remember to replace `pg84.txt` with the filename of your own text file.)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# replace \"pg84.txt\" with the name of your own text file, then run this cell with CTRL+Enter.\n",
"text = open(\"pg84.txt\").read()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The code in the following cell \"cleans up\" the text a little bit by replacing certain characters that spaCy doesn't cope well with:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import re\n",
"text = re.sub(r\"[“”]\", '\"', text)\n",
"text = re.sub(r\"[’‘]\", \"'\", text)\n",
"text = text.replace(\"\\n\", \" \")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, use spaCy to parse it. (This might take a while, depending on the size of your text.)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"doc = nlp(text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Right off the bat, the spaCy library gives us access to a number of interesting units of text:\n",
"\n",
"* All of the sentences (`doc.sents`)\n",
"* All of the words (`doc`)\n",
"* All of the \"named entitites,\" like names of places, people, #brands, etc. (`doc.ents`)\n",
"* All of the \"noun chunks,\" i.e., nouns in the text plus surrounding matter like adjectives and articles\n",
"\n",
"The cell below, we extract these into variables so we can play around with them a little bit."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"sentences = list(doc.sents)\n",
"words = [w for w in list(doc) if w.is_alpha]\n",
"noun_chunks = list(doc.noun_chunks)\n",
"entities = list(doc.ents)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With this information in hand, we can answer interesting questions like: how many sentences are in the text?"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3435"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(sentences)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using `random.sample()`, we can get a small, randomly-selected sample from these lists. Here are five random sentences:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"To you first entering on life, to whom care is new and agony unknown, how can you understand what I have felt and still feel?\n",
"---\n",
"\"My dear Frankenstein,\" exclaimed Henry, when he perceived me weep with bitterness, \"are you always to be unhappy?\n",
"---\n",
"Yet some feelings, unallied to the dross of human nature, beat even in these rugged bosoms.\n",
"---\n",
"\"Autumn passed thus.\n",
"---\n",
"With this deep consciousness of what they owed towards the being to which they had given life, added to the active spirit of tenderness that animated both, it may be imagined that while during every hour of my infant life I received a lesson of patience, of charity, and of self-control, I was so guided by a silken cord that all seemed but one train of enjoyment to me.\n",
"---\n"
]
}
],
"source": [
"for item in random.sample(sentences, 5):\n",
" print(item.text.strip().replace(\"\\n\", \" \"))\n",
" print(\"---\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ten random words:"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"place\n",
"that\n",
"My\n",
"events\n",
"appeared\n",
"reflection\n",
"received\n",
"the\n",
"the\n",
"but\n"
]
}
],
"source": [
"for item in random.sample(words, 10):\n",
" print(item.text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ten random noun chunks:"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"the several empires\n",
"my favourite authors\n",
"I\n",
"the truth\n",
"him\n",
"the full extent\n",
"she\n",
"the language\n",
"Elizabeth\n",
"him\n"
]
}
],
"source": [
"for item in random.sample(noun_chunks, 10):\n",
" print(item.text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ten random entities:"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Felix\n",
"one\n",
"thousand\n",
"21\n",
"Switzerland\n",
"Some hours\n",
"last night\n",
"first\n",
"Rhine\n",
"first\n"
]
}
],
"source": [
"for item in random.sample(entities, 10):\n",
" print(item.text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Parts of speech\n",
"\n",
"The spaCy parser allows us to check what part of speech a word belongs to. In the cell below, we create four different lists—`nouns`, `verbs`, `adjs` and `advs`—that contain only words of the specified parts of speech. ([There's a full list of part of speech tags here](https://spacy.io/docs/usage/pos-tagging#pos-tagging-english))."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"nouns = [w for w in words if w.pos_ == \"NOUN\"]\n",
"verbs = [w for w in words if w.pos_ == \"VERB\"]\n",
"adjs = [w for w in words if w.pos_ == \"ADJ\"]\n",
"advs = [w for w in words if w.pos_ == \"ADV\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And now we can print out a random sample of any of these:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ages\n",
"sailors\n",
"joy\n",
"religion\n",
"environs\n",
"night\n",
"space\n",
"balminess\n",
"air\n",
"lie\n",
"men\n",
"day\n",
"resolutions\n",
"child\n",
"whom\n",
"affection\n",
"prey\n",
"wonders\n",
"persecutor\n",
"prospect\n"
]
}
],
"source": [
"for item in random.sample(nouns, 20): # change \"nouns\" to \"verbs\" or \"adjs\" or \"advs\" to sample from those lists!\n",
" print(item.text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Entity types\n",
"\n",
"The parser in spaCy not only identifies \"entities\" but also assigns them to a particular type. [See a full list of entity types here.](https://spacy.io/docs/usage/entity-recognition#entity-types) Using this information, the following cell builds lists of the people, locations, and times mentioned in the text:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"people = [e for e in entities if e.label_ == \"PERSON\"]\n",
"locations = [e for e in entities if e.label_ == \"LOC\"]\n",
"times = [e for e in entities if e.label_ == \"TIME\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And then you can print out a random sample:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"night\n",
"about nine o'clock\n",
"a few sad hours\n",
"the sixth hour\n",
"about an hour\n",
"the hour\n",
"midnight\n",
"About two hours\n",
"the night\n",
"the next morning\n",
"that hour\n",
"midnight\n",
"night\n",
"one in the morning\n",
"a few minutes\n",
"the preceding night\n",
"night\n",
"one evening\n",
"eleven o'clock\n",
"last night\n"
]
}
],
"source": [
"for item in random.sample(times, 20): # change \"times\" to \"people\" or \"locations\" to sample those lists\n",
" print(item.text.strip())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Finding the most common\n",
"\n",
"So let's repeat the task of finding the most common words, this time using the words parsed from the text using spaCy. The code looks mostly the same:"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from collections import Counter\n",
"word_count = Counter([w.text for w in words])"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"15"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"word_count['heaven']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can even filter these with the stopwords list, as in the cell below:"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"filtered_words = [w.text for w in words if w.text.lower() not in stopwords]"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"word_count = Counter(filtered_words)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's see about the list of the most common words:"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"could 191\n",
"one 186\n",
"would 183\n",
"man 133\n",
"father 133\n",
"upon 124\n",
"yet 115\n",
"life 112\n",
"first 108\n",
"might 108\n",
"every 104\n",
"eyes 104\n",
"said 102\n",
"shall 99\n",
"time 97\n",
"saw 94\n",
"towards 93\n",
"may 93\n",
"Elizabeth 92\n",
"night 89\n"
]
}
],
"source": [
"for word, count in word_count.most_common(20):\n",
" print(word, count)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That's actually a little bit better! Because spaCy knows enough about language to not include punctuation as part of the words, we're not getting as many \"noisy\" counts."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Writing to a file\n",
"\n",
"The following cell defines a function for writing data from a `Counter` object to a file. The file is in \"tab-separated values\" format, which you can open using most spreadsheet programs. Execute it before you continue:"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def save_counter_tsv(filename, counter, limit=1000):\n",
" with open(filename, \"w\") as outfile:\n",
" outfile.write(\"key\\tvalue\\n\")\n",
" for item, count in counter.most_common():\n",
" outfile.write(item.strip() + \"\\t\" + str(count) + \"\\n\") "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, run the following cell. You'll end up with a file in the same directory as this notebook called `100_common_words.tsv` that has two columns, one for the words and one for their associated counts:"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"save_counter_tsv(\"100_common_words.tsv\", word_count, 100)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Try opening this file in Excel or Google Docs or Numbers!\n",
"\n",
"If you want to write the data from another `Counter` object to a file:\n",
"\n",
"* Change the filename to whatever you want (though you should probably keep the `.tsv` extension)\n",
"* Replace `word_count` with the name of any of the `Counter` objects we've made in this sheet and use it in place of `word_count`\n",
"* Change the number to the number of rows you want to include in your spreadsheet."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### When do things happen in this text?\n",
"\n",
"Here's another example. Using the `times` entities, we can make a spreadsheet of how often particular \"times\" (durations, times of day, etc.) are mentioned in the text."
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"time_counter = Counter([e.text.lower().strip() for e in times])\n",
"save_counter_tsv(\"time_count.tsv\", time_counter, 100)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Do the same thing, but with people:"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"people_counter = Counter([e.text.lower() for e in people])\n",
"save_counter_tsv(\"people_count.tsv\", people_counter, 100)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Semantic similarity with word vectors\n",
"\n",
"Every word in a spaCy parse is associated with a 300-dimensional vector. (A vector is just a fancy word for a \"list of numbers.\") This vector is based on a machine learning algorithm (called [GloVe](https://nlp.stanford.edu/projects/glove/)) that assigns the value to a word based on the frequency of the contexts it's found in. The math is complex, but the way it works out is that two words that have similar vectors are usually also similar in *meaning*. [More notes on the concept of word vectors here](https://github.com/aparrish/rwet/blob/master/understanding-word-vectors.ipynb).\n",
"\n",
"The following cell defines a function `cosine()` that returns a measure of \"similarity\" between two vectors. (Note that you'll need the [scipy](https://www.scipy.org/) library installed for this to work; fortunately, Anaconda comes with the library pre-installed!)"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"from scipy.spatial.distance import cosine as cosine_distance\n",
"def cosine(v1, v2):\n",
" if np.any(v1) and np.any(v2):\n",
" return 1 - cosine_distance(v1, v2)\n",
" else:\n",
" return 0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We're going to do a little bit of speculative text analysis. We'll start with a list of all unique words in our text:"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"unique_words = list(set([w.text for w in words]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The function defined in the cell below checks each word in a source text, compares it to the vector of the specified word (which can be any English word), and returns the words with the highest cosine similarity from the source text. You can think of it as sort of a conceptual \"translator,\" translating a word from any domain and register of English to its closest equivalent in the text."
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def similar_words(word_to_check, source_set):\n",
" return sorted(source_set,\n",
" key=lambda x: cosine(nlp.vocab[word_to_check].vector, nlp.vocab[x].vector),\n",
" reverse=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Try it out by running the cell below. Replace `grumpy` with a word of your choice, and `10` with the number of results you want:"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/plain": [
"['gruff',\n",
" 'annoyed',\n",
" 'exasperated',\n",
" 'angry',\n",
" 'impetuous',\n",
" 'impatient',\n",
" 'embittered',\n",
" 'Unhappy',\n",
" 'dejected',\n",
" 'discontented']"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# change \"kitten\" to a word of your choice and 10 to the number of results you want\n",
"similar_words(\"grumpy\", unique_words)[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What's the closest thing to \"baseball\" in a text that doesn't mention baseball?"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['league',\n",
" 'leagues',\n",
" 'sport',\n",
" 'game',\n",
" 'coach',\n",
" 'bat',\n",
" 'college',\n",
" 'Homer',\n",
" 'season',\n",
" 'ball']"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"similar_words(\"baseball\", unique_words)[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This works not just for individual words, but for *entire sentences*. To get the vector for a sentence, we simply average its component vectors. The `sentence_vector` function in the cell below takes a spaCy-parsed sentence and returns the averaged vector of the words in the sentence. The `similar_sentences` function takes an arbitrary string as a parameter and returns the sentences in our text closest in meaning (using the list of sentences assigned to the `sentences` variable further up the notebook):"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def sentence_vector(sent):\n",
" vec = np.array([w.vector for w in sent if w.has_vector and np.any(w.vector)])\n",
" if len(vec) > 0:\n",
" return np.mean(vec, axis=0)\n",
" else:\n",
" raise ValueError(\"no words with vectors found\") \n",
"def similar_sentences(input_str, num=10):\n",
" input_vector = sentence_vector(nlp(input_str))\n",
" return sorted(sentences,\n",
" key=lambda x: cosine(np.mean([w.vector for w in x], axis=0), input_vector),\n",
" reverse=True)[:num]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Try it out! Replace the string in `sentence_to_check` below with your own sentence, and run the cell. (It might take a while, depending on how big your source text file is.)"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"My food is not that of man; I do not destroy the lamb and the kid to glut my appetite; acorns and berries afford me sufficient nourishment.\n",
"\n",
"I greedily devoured the remnants of the shepherd's breakfast, which consisted of bread, cheese, milk, and wine; the latter, however, I did not like.\n",
"\n",
"I read and reread her letter, and some softened feelings stole into my heart and dared to whisper paradisiacal dreams of love and joy; but the apple was already eaten, and the angel's arm bared to drive me from all hope.\n",
"\n",
"You wish to eat me and tear me to pieces.\n",
"\n",
"Follow me; I seek the everlasting ices of the north, where you will feel the misery of cold and frost, to which I am impassive.\n",
"\n",
"For an instant I dared to shake off my chains and look around me with a free and lofty spirit, but the iron had eaten into my flesh, and I sank again, trembling and hopeless, into my miserable self.\n",
"\n",
"I love you very tenderly.\n",
"\n",
"The vegetables in the gardens, the milk and cheese that I saw placed at the windows of some of the cottages, allured my appetite.\n",
"\n",
"I wish you could see him; he is very tall of his age, with sweet laughing blue eyes, dark eyelashes, and curling hair.\n",
"\n",
"I am surrounded by mountains of ice which admit of no escape and threaten every moment to crush my vessel.\n",
"\n"
]
}
],
"source": [
"sentence_to_check = \"I love to eat strawberry ice cream.\"\n",
"for item in similar_sentences(sentence_to_check):\n",
" print(item.text.strip())\n",
" print(\"\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is great for poetry but also for things like classifying documents, stylistics, etc."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Rewriting a text with semantic similarity\n",
"\n",
"Now for a bit of creativity. Below, I show how to \"rewrite\" a source text by replacing its individual units (whether words or sentences) with semantically similar units from our source text. The text we'll be rewriting is Frost's _The Road Not Taken_, which I've pasted below in and assigned to a variable `frost`. (Feel free to put in a text of your own choosing! Shorter is better for this exercise.)"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"frost = \"\"\"Two roads diverged in a yellow wood,\n",
"And sorry I could not travel both\n",
"And be one traveler, long I stood\n",
"And looked down one as far as I could\n",
"To where it bent in the undergrowth;\n",
"\n",
"Then took the other, as just as fair,\n",
"And having perhaps the better claim,\n",
"Because it was grassy and wanted wear;\n",
"Though as for that the passing there\n",
"Had worn them really about the same,\n",
"\n",
"And both that morning equally lay\n",
"In leaves no step had trodden black.\n",
"Oh, I kept the first for another day!\n",
"Yet knowing how way leads on to way,\n",
"I doubted if I should ever come back.\n",
"\n",
"I shall be telling this with a sigh\n",
"Somewhere ages and ages hence:\n",
"Two roads diverged in a wood, and I—\n",
"I took the one less travelled by,\n",
"And that has made all the difference.\"\"\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The code in the cell below replaces each *word* in the text with the second-most semantically similar word in our source text. (Second-most semantically similar to encourage the appearance of words that aren't just the same as the original word in the text.)"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"original: Two roads diverged in a yellow wood,\n",
"two impassable developed in a red planks and\n",
"\n",
"original: And sorry I could not travel both\n",
"and forgot guess could not travelling Well\n",
"\n",
"original: And be one traveler, long I stood\n",
"and be One adventurer and short guess stood\n",
"\n",
"original: And looked down one as far as I could\n",
"and seemed up One As Much As guess could\n",
"\n",
"original: To where it bent in the undergrowth;\n",
"To Where It bending in the bushes and\n",
"\n",
"original: \n",
"\n",
"\n",
"original: Then took the other, as just as fair,\n",
"then went the other and As actually As reasonable and\n",
"\n",
"original: And having perhaps the better claim,\n",
"and Having perhaps the Good claims and\n",
"\n",
"original: Because it was grassy and wanted wear;\n",
"because It Was sod and Did wearing and\n",
"\n",
"original: Though as for that the passing there\n",
"But As For That the passes There\n",
"\n",
"original: Had worn them really about the same,\n",
"had wear they actually About the one and\n",
"\n",
"original: \n",
"\n",
"\n",
"original: And both that morning equally lay\n",
"and Well That Morning quite laid\n",
"\n",
"original: In leaves no step had trodden black.\n",
"in leaf No steps had tread Black so\n",
"\n",
"original: Oh, I kept the first for another day!\n",
"oh and guess keeping the second For Another day thanks\n",
"\n",
"original: Yet knowing how way leads on to way,\n",
"Yet knew how ways ultimately On To ways and\n",
"\n",
"original: I doubted if I should ever come back.\n",
"guess fancying If guess Should Ever come Again so\n",
"\n",
"original: \n",
"\n",
"\n",
"original: I shall be telling this with a sigh\n",
"guess hereafter be tell This With a groans\n",
"\n",
"original: Somewhere ages and ages hence:\n",
"around age and age Thus details\n",
"\n",
"original: Two roads diverged in a wood, and I—\n",
"two impassable developed in a planks and and upon\n",
"\n",
"original: I took the one less travelled by,\n",
"guess went the One than traversed By and\n",
"\n",
"original: And that has made all the difference.\n",
"and That has making all the mean so\n",
"\n"
]
}
],
"source": [
"for line in frost.split(\"\\n\"):\n",
" print(\"original:\", line)\n",
" replaced_words = [similar_words(word.text, unique_words)[1] for word in nlp(line)]\n",
" print(\" \".join(replaced_words))\n",
" print()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Neat! And in the cell below, each *line* of the poem is replaced with the sentence from our source text that is most semantically similar:"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"original: Two roads diverged in a yellow wood,\n",
"replacement: \"I continued to wind among the paths of the wood, until I came to its boundary, which was skirted by a deep and rapid river, into which many of the trees bent their branches, now budding with the fresh spring.\n",
"\n",
"original: And sorry I could not travel both\n",
"replacement: I regret that I am taken from you; and, happy and beloved as I have been, is it not hard to quit you all?\n",
"\n",
"original: And be one traveler, long I stood\n",
"replacement: I then reflected, and the thought made me shiver, that the creature whom I had left in my apartment might still be there, alive and walking about. \n",
"\n",
"original: And looked down one as far as I could\n",
"replacement: If I looked up, I saw scenes which were familiar to me in my happier time and which I had contemplated but the day before in the company of her who was now but a shadow and a recollection.\n",
"\n",
"original: To where it bent in the undergrowth;\n",
"replacement: \"I continued to wind among the paths of the wood, until I came to its boundary, which was skirted by a deep and rapid river, into which many of the trees bent their branches, now budding with the fresh spring.\n",
"\n",
"\n",
"\n",
"original: Then took the other, as just as fair,\n",
"replacement: \"Presently two countrymen passed by, but pausing near the cottage, they entered into conversation, using violent gesticulations; but I did not understand what they said, as they spoke the language of the country, which differed from that of my protectors. \n",
"\n",
"original: And having perhaps the better claim,\n",
"replacement: If, instead of this remark, my father had taken the pains to explain to me that the principles of Agrippa had been entirely exploded and that a modern system of science had been introduced which possessed much greater powers than the ancient, because the powers of the latter were chimerical, while those of the former were real and practical, under such circumstances I should certainly have thrown Agrippa aside and have contented my imagination, warmed as it was, by returning with greater ardour to my former studies.\n",
"\n",
"original: Because it was grassy and wanted wear;\n",
"replacement: We returned again, with torches; for I could not rest, when I thought that my sweet boy had lost himself, and was exposed to all the damps and dews of night; Elizabeth also suffered extreme anguish. \n",
"\n",
"original: Though as for that the passing there\n",
"replacement: The prospect of such an occupation made every other circumstance of existence pass before me like a dream, and that thought only had to me the reality of life. \n",
"\n",
"original: Had worn them really about the same,\n",
"replacement: And the same feelings which made me neglect the scenes around me caused me also to forget those friends who were so many miles absent, and whom I had not seen for so long a time. \n",
"\n",
"\n",
"\n",
"original: And both that morning equally lay\n",
"replacement: Everything was silent except the leaves of the trees, which were gently agitated by the wind; the night was nearly dark, and the scene would have been solemn and affecting even to an uninterested observer. \n",
"\n",
"original: In leaves no step had trodden black.\n",
"replacement: He had apparently been strangled, for there was no sign of any violence except the black mark of fingers on his neck. \n",
"\n",
"original: Oh, I kept the first for another day!\n",
"replacement: I had not a moment to lose, but seizing the hand of the old man, I cried, 'Now is the time! \n",
"\n",
"original: Yet knowing how way leads on to way,\n",
"replacement: But I fear, from what you have yourself described to be his properties, that this will prove impracticable; and thus, while every proper measure is pursued, you should make up your mind to disappointment.\" \n",
"\n",
"original: I doubted if I should ever come back.\n",
"replacement: I cannot pretend to describe what I then felt. \n",
"\n",
"\n",
"\n",
"original: I shall be telling this with a sigh\n",
"replacement: I was bound by a solemn promise which I had not yet fulfilled and dared not break, or if I did, what manifold miseries might not impend over me and my devoted family! \n",
"\n",
"original: Somewhere ages and ages hence:\n",
"replacement: Chapter 2 We were brought up together; there was not quite a year difference in our ages. \n",
"\n",
"original: Two roads diverged in a wood, and I—\n",
"replacement: \"I continued to wind among the paths of the wood, until I came to its boundary, which was skirted by a deep and rapid river, into which many of the trees bent their branches, now budding with the fresh spring.\n",
"\n",
"original: I took the one less travelled by,\n",
"replacement: \"I generally rested during the day and travelled only when I was secured by night from the view of man. \n",
"\n",
"original: And that has made all the difference.\n",
"replacement: Chemistry is that branch of natural philosophy in which the greatest improvements have been and may be made; it is on that account that I have made it my peculiar study; but at the same time, I have not neglected the other branches of science.\n",
"\n"
]
}
],
"source": [
"for line in frost.split(\"\\n\"):\n",
" if len(line) > 0:\n",
" print(\"original:\", line)\n",
" print(\"replacement:\", similar_sentences(line)[0].text)\n",
" else:\n",
" print()\n",
" print()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I've many more examples of similarity replacement in [this notebook](https://gist.github.com/aparrish/4f4f35a046ac1d954a02fc1ffbed9dcb)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
name: rwet
channels:
- conda-forge
dependencies:
- python
- spacy
python -m spacy download en_core_web_md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment