Skip to content

Instantly share code, notes, and snippets.

@AllenDowney
Created March 18, 2019 14:49
Show Gist options
  • Save AllenDowney/0985c55e2d49d6859f1b8ee6b9bd9956 to your computer and use it in GitHub Desktop.
Save AllenDowney/0985c55e2d49d6859f1b8ee6b9bd9956 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Text analysis with Python\n",
"\n",
"\n",
"Copyright 2019 Allen Downey\n",
"\n",
"[MIT License](https://opensource.org/licenses/MIT)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Word Frequencies\n",
"----------------\n",
"\n",
"Let's look at frequencies of words, bigrams and trigrams in a text.\n",
"\n",
"The following function reads lines from a file or URL and splits them into words:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"def iterate_words(filename):\n",
" \"\"\"Read lines from a file and split them into words.\"\"\"\n",
" for line in open(filename):\n",
" for word in line.split():\n",
" yield word.strip()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here's an example using a book from Project Gutenberg. `wc` is a Counter of words, that is, a dictionary that maps from each word to the number of times it appears:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from collections import Counter"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# FAIRY TALES\n",
"# By The Brothers Grimm\n",
"# http://www.gutenberg.org/cache/epub/2591/pg2591.txt'\n",
"wc = Counter(iterate_words('pg2591.txt'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here are the 20 most common words:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('the', 6507),\n",
" ('and', 5250),\n",
" ('to', 2707),\n",
" ('a', 1932),\n",
" ('he', 1817),\n",
" ('of', 1450),\n",
" ('was', 1337),\n",
" ('in', 1080),\n",
" ('she', 1049),\n",
" ('that', 1021),\n",
" ('his', 1014),\n",
" ('you', 941),\n",
" ('it', 881),\n",
" ('her', 880),\n",
" ('had', 827),\n",
" ('I', 755),\n",
" ('they', 751),\n",
" ('for', 721),\n",
" ('with', 720),\n",
" ('as', 718)]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wc.most_common(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Word frequencies in natural languages follow a predictable pattern called Zipf's law (which is an instance of Stigler's law, which is also an instance of Stigler's law).\n",
"\n",
"We can see the pattern by lining up the words in descending order of frequency and plotting their counts (6507, 5250, 2707) versus ranks (1st, 2nd, 3rd, ...):"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"def counter_ranks(wc):\n",
" \"\"\"Returns ranks and counts as lists.\"\"\"\n",
" return zip(*enumerate(sorted(wc.values(), reverse=True)))"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"ranks, counts = counter_ranks(wc)\n",
"plt.plot(ranks, counts, linewidth=3)\n",
"plt.xlabel('Rank')\n",
"plt.ylabel('Count')\n",
"plt.title('Word count versus rank, linear scale');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Huh. Maybe that's not so clear after all. The problem is that the counts drop off very quickly. If we use the highest count to scale the figure, most of the other counts are indistinguishable from zero.\n",
"\n",
"Also, there are more than 10,000 words, but most of them appear only a few times, so we are wasting most of the space in the figure in a regime where nothing is happening.\n",
"\n",
"This kind of thing happens a lot. A common way to deal with it is to compute the log of the quantities or to plot them on a log scale:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"ranks, counts = counter_ranks(wc)\n",
"plt.plot(ranks, counts, linewidth=4)\n",
"plt.xlabel('Rank')\n",
"plt.ylabel('Count')\n",
"plt.xscale('log')\n",
"plt.yscale('log')\n",
"plt.title('Word count versus rank, log-log scale');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This (approximately) straight line is characteristic of Zipf's law.\n",
"\n",
"n-grams\n",
"-------\n",
"\n",
"On to the next topic: bigrams and trigrams."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"from itertools import tee\n",
"\n",
"def pairwise(iterator):\n",
" \"\"\"Iterates through a sequence in overlapping pairs.\n",
" \n",
" If the sequence is 1, 2, 3, the result is (1, 2), (2, 3), (3, 4), etc.\n",
" \"\"\"\n",
" a, b = tee(iterator)\n",
" next(b, None)\n",
" return zip(a, b)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`bigrams` is the histogram of word pairs:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"bigrams = Counter(pairwise(iterate_words('pg2591.txt')))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And here are the 20 most common:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(('to', 'the'), 444),\n",
" (('in', 'the'), 399),\n",
" (('of', 'the'), 369),\n",
" (('and', 'the'), 349),\n",
" (('into', 'the'), 294),\n",
" (('said', 'the'), 251),\n",
" (('on', 'the'), 199),\n",
" (('and', 'when'), 168),\n",
" (('he', 'had'), 164),\n",
" (('he', 'was'), 164),\n",
" (('to', 'be'), 163),\n",
" (('it', 'was'), 152),\n",
" (('Then', 'the'), 151),\n",
" (('I', 'will'), 149),\n",
" (('that', 'he'), 143),\n",
" (('at', 'the'), 142),\n",
" (('came', 'to'), 138),\n",
" (('and', 'he'), 135),\n",
" (('she', 'was'), 129),\n",
" (('all', 'the'), 125)]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bigrams.most_common(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similarly, we can iterate the trigrams:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"def triplewise(iterator):\n",
" a, b, c = tee(iterator, 3)\n",
" next(b)\n",
" next(c)\n",
" next(c)\n",
" return zip(a, b, c)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And make a Counter:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"trigrams = Counter(triplewise(iterate_words('pg2591.txt')))\n",
"\n",
"# Uncomment this line to run the analysis with Elvis Presley lyrics\n",
"#trigrams = Hist(triplewise(iterate_words('lyrics-elvis-presley.txt')))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here are the 20 most common:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(('came', 'to', 'the'), 65),\n",
" (('and', 'when', 'he'), 50),\n",
" (('out', 'of', 'the'), 50),\n",
" (('said', 'to', 'the'), 34),\n",
" (('he', 'came', 'to'), 33),\n",
" (('and', 'when', 'she'), 33),\n",
" (('went', 'into', 'the'), 32),\n",
" (('went', 'to', 'the'), 31),\n",
" (('and', 'said', 'to'), 31),\n",
" (('one', 'of', 'the'), 30),\n",
" (('came', 'to', 'a'), 30),\n",
" (('and', 'as', 'he'), 29),\n",
" (('they', 'came', 'to'), 29),\n",
" (('he', 'did', 'not'), 28),\n",
" (('there', 'was', 'a'), 28),\n",
" (('that', 'he', 'had'), 28),\n",
" (('and', 'I', 'will'), 27),\n",
" (('that', 'it', 'was'), 25),\n",
" (('and', 'at', 'last'), 24),\n",
" (('and', 'when', 'the'), 24)]"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"trigrams.most_common(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Markov analysis\n",
"\n",
"And now for a little fun. I'll make a dictionary that maps from each word pair to a Counter of the words that can follow."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"from collections import defaultdict\n",
"\n",
"d = defaultdict(Counter)\n",
"for a, b, c in trigrams:\n",
" d[a, b][c] += trigrams[a, b, c]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can look up a pair and see what might come next:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Counter({'ran': 2,\n",
" 'on': 1,\n",
" 'of': 2,\n",
" 'that': 1,\n",
" 'came,': 1,\n",
" 'streamed': 1,\n",
" 'fell': 1,\n",
" 'might': 1,\n",
" 'ran.': 1})"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d['the', 'blood']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here are the most common words that follow \"into the\":"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('forest', 15),\n",
" ('forest,', 13),\n",
" ('garden', 9),\n",
" ('kitchen,', 8),\n",
" ('cellar', 8),\n",
" ('wide', 7),\n",
" ('room,', 7),\n",
" ('water,', 7),\n",
" ('wood', 6),\n",
" ('kitchen', 6)]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d['into', 'the'].most_common(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here are the words that follow \"said the\":"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('old', 13),\n",
" ('man,', 12),\n",
" ('little', 10),\n",
" ('fisherman,', 8),\n",
" ('father,', 7),\n",
" ('ass,', 6),\n",
" ('other;', 5),\n",
" ('wife,', 5),\n",
" ('fish;', 5),\n",
" ('fish.', 5)]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d['said', 'the'].most_common(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following function chooses a random word from the suffixes in a Counter:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"import random\n",
"\n",
"def choice(counter):\n",
" \"\"\"Chooses a random element.\"\"\"\n",
" return random.choice(list(counter.elements()))"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'fox,'"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"choice(d['said', 'the'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Given a prefix, we can choose a random suffix:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'fisherman,'"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prefix = 'said', 'the'\n",
"suffix = choice(d[prefix])\n",
"suffix"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we can shift the words and compute the next prefix:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"('the', 'fisherman,')"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prefix = prefix[1], suffix\n",
"prefix"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Repeating this process, we can generate random new text that has the same correlation structure between words as the original:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"'how happily we shall be your waiting-maid any longer.' So they went up to the forest. Ah! what a blockhead that brother of the sick; the virtues of all one daughter. Although the little birds are singing; you walk gravely along as if they have a lad who takes care of this agreement violates the law of the mill went 'Click clack, click clack, click clack, click clack.' The bird settled on the ground, he thought to find the way homewards free from the roof with his hand into his ear and tell all she had scarcely touched her sister, "
]
}
],
"source": [
"for i in range(100):\n",
" suffix = choice(d[prefix])\n",
" print(suffix, end=' ')\n",
" prefix = prefix[1], suffix"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With a prefix of two words, we typically get text that flirts with sensibility."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment