Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Predictive text and text generation notebook. Written for Code Societies at SFPC, summer 2018. Code examples released under CC0 https://creativecommons.org/choose/zero/, other text released under CC BY 4.0 https://creativecommons.org/licenses/by/4.0/
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Predictive text and text generation\n",
"\n",
"By [Allison Parrish](http://www.decontextualize.com/)\n",
"\n",
"This notebook is a whirlwind tour of how certain kinds of predictive text generation work! By \"predictive text generation\" what I mean is any text generation method that is based around a statistical model that, given a certain stretch of text, \"predicts\" which bit of text should come next, based on probabilities learned from an existing corpus of text.\n",
"\n",
"The code is written in Python, but you don't really need to know Python in order to use the notebook. Everything's pre-written for you, so you can just execute the cells, making small changes to the code as needed."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Working with text files\n",
"\n",
"Before we get started, we'll first need some text! Grab two [plain text files from Project Gutenberg](http://www.gutenberg.org/) (or from another source of your choice) and save them to the same directory as this notebook. (I suggest working with two files because we'll be running some code explicitly to \"compare\" two texts. Also, I think seeing two different outputs from the text generation methods discussed in this notebook will help you better understand how those methods work.) The code in the following cell loads into Python variables the contents of *two plain text files*, assigned to variables `text_a` and `text_b`. You'll need to replace the filenames with the names of the files that you downloaded, keeping the quotation marks (`\"`) intact."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true,
"scrolled": true
},
"outputs": [],
"source": [
"text_a = open(\"1342-0.txt\").read()\n",
"text_b = open(\"84-0.txt\").read()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These variables are *strings*, which are essentially just long lists of the characters that occur in the text, in the order that they occur. The code in the following cell shows the first two hundred characters of text A:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The Project Gutenberg EBook of Pride and Prejudice, by Jane Austen\n",
"\n",
"This eBook is for the use of anyone anywhere at no cost and with\n",
"almost no restrictions whatsoever. You may copy it, give it away \n"
]
}
],
"source": [
"print(text_a[:200])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can change `text_a` to `text_b` to see the output from your second text, or change `200` to a number of your choosing.\n",
"\n",
"The `random.sample()` function gives us a random sampling of the contents of a variable (as long as that variable is a sequence of things, like a string or a list). So, for example, to see twenty random characters from text B:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"['i',\n",
" 'o',\n",
" 'r',\n",
" 'a',\n",
" 'r',\n",
" 'd',\n",
" 't',\n",
" 'h',\n",
" 'r',\n",
" 'o',\n",
" ' ',\n",
" ' ',\n",
" ' ',\n",
" ' ',\n",
" 'c',\n",
" 'u',\n",
" 'u',\n",
" 'a',\n",
" 'e',\n",
" ' ']"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import random\n",
"random.sample(text_b, 20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This isn't incredibly helpful on its own, but you'll notice that the characters it drew (probably) more or less follow the expected letter distribution for English (i.e., lots of `e`s and `n`s and `t`s)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Perhaps more interesting would be to see a randomly-sampled list of *words*. To do this, we'll make separate variables for the words in the text, using a Python function called `.split()`, which takes a string and turns it into a list of words contained in that string. The following cell makes two new variables that contain the words from both texts respectively:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true,
"scrolled": true
},
"outputs": [],
"source": [
"a_words = text_a.split()\n",
"b_words = text_b.split()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, ten random words from both text A and text B:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"['Darcy',\n",
" 'Sir',\n",
" 'silence',\n",
" 'he',\n",
" 'consolation',\n",
" 'lady',\n",
" 'suit',\n",
" 'power',\n",
" 'family',\n",
" 'discovered']"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"random.sample(a_words, 10)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"['my',\n",
" 'of',\n",
" 'of',\n",
" 'and',\n",
" 'the',\n",
" 'minutes',\n",
" 'that',\n",
" 'tempest,',\n",
" 'fixed',\n",
" 'terms']"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"random.sample(b_words, 10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The code in the following cell uses Python's `Counter` object to count the *most common* letters in the first of these texts:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"[(' ', 113941),\n",
" ('e', 70344),\n",
" ('t', 47283),\n",
" ('a', 42156),\n",
" ('o', 41138),\n",
" ('n', 38430),\n",
" ('i', 36273),\n",
" ('h', 33883),\n",
" ('r', 33293),\n",
" ('s', 33292),\n",
" ('d', 22247),\n",
" ('l', 21282)]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from collections import Counter\n",
"Counter(text_a).most_common(12)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Specifying the `a_words` variable gives the most frequent *words* instead:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"[('the', 4205),\n",
" ('to', 4121),\n",
" ('of', 3662),\n",
" ('and', 3309),\n",
" ('a', 1945),\n",
" ('her', 1858),\n",
" ('in', 1813),\n",
" ('was', 1795),\n",
" ('I', 1740),\n",
" ('that', 1419),\n",
" ('not', 1356),\n",
" ('she', 1306)]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Counter(a_words).most_common(12)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Compare these to the most common words in text B:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"[('the', 4056),\n",
" ('and', 2971),\n",
" ('of', 2741),\n",
" ('I', 2719),\n",
" ('to', 2142),\n",
" ('my', 1631),\n",
" ('a', 1394),\n",
" ('in', 1125),\n",
" ('was', 993),\n",
" ('that', 987),\n",
" ('with', 700),\n",
" ('had', 679)]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Counter(b_words).most_common(12)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What we're comparing here is called *unigram frequency*. (\"Unigram\" means a sequence of length one—more on this below.) For most English texts, the most frequent words in any given text will correspond closely to the most common "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Frequency significance\n",
"\n",
"One thing you can do with unigram frequencies is figure out which words are *most distinctive* in a given text. This would allow you to do things like [extract keywords from a text](https://github.com/aparrish/rwet/blob/master/quick-and-dirty-keywords.ipynb), or make a simple classifier that could tell you, for some stretch of text, whether that text is more similar to a given text A or text B. The code in the cell below implements a simple \"distinctiveness\" check. If you give it the `Counter` object from the code above, it will give you a list of words whose frequency in text B is most distinctive as compared to text A.\n",
"\n",
"Don't worry about understanding this code—just execute it to make the functions available and then move on to the next cell. (For the curious, the code uses a [G-test](https://en.wikipedia.org/wiki/G-test) to determine significance.)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": true,
"scrolled": true
},
"outputs": [],
"source": [
"from scipy.stats import chi2_contingency\n",
"def compare_counts(a_count, b_count, count_threshold=1):\n",
" a_total = sum(a_count.values())\n",
" b_total = sum(b_count.values())\n",
" sigs = []\n",
" for k, v in a_count.items():\n",
" if v <= count_threshold:\n",
" continue\n",
" sigs.append(\n",
" (k,\n",
" chi2_contingency([[v, a_total], [b_count[k], b_total]], lambda_=0)[0],\n",
" \"a\" if v > b_count[k] else \"b\"))\n",
" sigs.sort(key=lambda x: x[1], reverse=True)\n",
" return sigs\n",
"def count_report(a_count, b_count, n=10, glue=\"\"):\n",
" compared = compare_counts(a_count, b_count)\n",
" print(\"most significant\")\n",
" print(\"----------------\")\n",
" for item in compared[:n]:\n",
" print(glue.join(item[0]), \" (text \", item[2], \")\", sep=\"\")\n",
" print()\n",
" print(\"least significant\")\n",
" print(\"-----------------\")\n",
" for item in compared[-n:]:\n",
" print(glue.join(item[0]), \" (text \", item[2], \")\", sep=\"\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Running the cell below will now give you the list of the most significant and least significant words in terms of their frequency in the given texts. The \"most significant\" words are the words that are most distinctive to either text. The \"least significant\" are words that more or less occur in the same frequency in both texts, and which therefore can't be considered distinctive."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"most significant\n",
"----------------\n",
"my (text b)\n",
"I (text b)\n",
"Mr. (text a)\n",
"her (text a)\n",
"she (text a)\n",
"the (text a)\n",
"Mrs. (text a)\n",
"me (text b)\n",
"Miss (text a)\n",
"Darcy (text a)\n",
"and (text a)\n",
"Elizabeth (text a)\n",
"though (text a)\n",
"be (text a)\n",
"very (text a)\n",
"\n",
"least significant\n",
"-----------------\n",
"sentiment (text a)\n",
"away (text a)\n",
"care (text a)\n",
"once (text a)\n",
"In (text a)\n",
"At (text a)\n",
"should (text a)\n",
"information (text a)\n",
"All (text a)\n",
"open (text a)\n",
"stood (text a)\n",
"society (text a)\n",
"you, (text a)\n",
"understand (text a)\n",
"sat (text a)\n"
]
}
],
"source": [
"count_report(Counter(a_words), Counter(b_words), 15)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## N-grams\n",
"\n",
"Unigrams are great, but there's only so much they can tell us about a text, since a unigram-based analysis throws out all information about *the order* in which words occur. So the first kind of text analysis that we’ll look at today is an n-gram model. An n-gram is simply a sequence of units drawn from a longer sequence; in the case of text, the unit in question is usually a character or a word. For convenience, we'll call the unit of the n-gram its *level*; the length of the n-gram is called its *order*. For example, the following is a list of all unique character-level order-2 n-grams in the word `condescendences`:\n",
"\n",
" co\n",
" on\n",
" nd\n",
" de\n",
" es\n",
" sc\n",
" ce\n",
" en\n",
" nc\n",
"\n",
"And the following is an excerpt from the list of all unique word-level order-5 n-grams in *The Road Not Taken*:\n",
"\n",
" Two roads diverged in a\n",
" roads diverged in a yellow\n",
" diverged in a yellow wood,\n",
" in a yellow wood, And\n",
" a yellow wood, And sorry\n",
" yellow wood, And sorry I\n",
"\n",
"N-grams are used frequently in natural language processing and are a basic tool text analysis. Their applications range from programs that correct spelling to creative visualizations to compression algorithms to stylometrics to generative text. They can be used as the basis of a Markov chain algorithm—and, in fact, that’s one of the applications we’ll be using them for later in this lesson.\n",
"\n",
"### Finding and counting word pairs\n",
"\n",
"So how would we go about writing Python code to find n-grams? We'll start with a simple task: finding *word pairs* in a text. A word pair is essentially a word-level order-2 n-gram; once we have code to find word pairs, we’ll generalize it to handle n-grams of any order.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": true,
"scrolled": true
},
"outputs": [],
"source": [
"def ngrams_for_sequence(n, seq):\n",
" return [tuple(seq[i:i+n]) for i in range(len(seq)-n+1)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using this function, here are random character-level n-grams of order 9 from text A:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"[(' ', 'C', 'o', 'l', 'l', 'i', 'n', 's', ' '),\n",
" ('v', 'e', '.', ' ', 'T', 'h', 'e', ' ', 'w'),\n",
" ('t', '\\n', 'w', 'a', 's', ' ', 'a', 'n', ' '),\n",
" ('t', ',', ' ', 'u', 'n', 'a', 'b', 'l', 'e'),\n",
" ('v', 'e', 'n', 't', ' ', 'h', 'i', 's', ' '),\n",
" ('l', ' ', 'l', 'i', 'g', 'h', 't', '\\n', 'm'),\n",
" (' ', 'h', 'e', 'r', ' ', 'a', 's', '\\n', 'h'),\n",
" ('b', 'e', 'l', 'i', 'e', 'v', 'e', ' ', 'm'),\n",
" ('e', ' ', 'w', 'h', 'i', 'c', 'h', ' ', 's'),\n",
" ('c', 'o', 'm', 'p', 'r', 'e', 'h', 'e', 'n')]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import random\n",
"a_9grams = ngrams_for_sequence(9, text_a)\n",
"random.sample(a_9grams, 10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or all the word-level 5-grams from text B:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"[('to', 'state', 'those', 'facts', 'which'),\n",
" ('for', 'my', 'letters', 'with', 'feverish'),\n",
" ('me', 'inexpressible', 'pleasure.', 'But', 'a'),\n",
" ('upright', 'in', 'it.', 'No', 'wood,'),\n",
" ('the', 'pursuit', 'of', 'knowledge', 'is'),\n",
" ('presented', 'a', 'thousand', 'objects', 'that'),\n",
" ('become', 'acquainted', 'with', 'any', 'of'),\n",
" ('and', 'their', 'idol,', 'and', 'something'),\n",
" ('still', 'proceed', 'over', 'the', 'untamed'),\n",
" ('be', 'friendless', 'is', 'indeed', 'to')]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"b_word_5grams = ngrams_for_sequence(5, text_b.split())\n",
"random.sample(b_word_5grams, 10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All of the bigrams (ngrams of order 2) from the string `condescendences`:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"[('c', 'o'),\n",
" ('o', 'n'),\n",
" ('n', 'd'),\n",
" ('d', 'e'),\n",
" ('e', 's'),\n",
" ('s', 'c'),\n",
" ('c', 'e'),\n",
" ('e', 'n'),\n",
" ('n', 'd'),\n",
" ('d', 'e'),\n",
" ('e', 'n'),\n",
" ('n', 'c'),\n",
" ('c', 'e'),\n",
" ('e', 's')]"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ngrams_for_sequence(2, \"condescendences\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And of course we can use it in conjunction with a `Counter` to find the most common n-grams in a text:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"from collections import Counter\n",
"a_count = Counter(ngrams_for_sequence(3, text_a.split()))\n",
"b_count = Counter(ngrams_for_sequence(3, text_b.split()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The most common 3-grams from text A:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"[(('as', 'soon', 'as'), 45),\n",
" (('she', 'could', 'not'), 42),\n",
" (('I', 'do', 'not'), 39),\n",
" (('that', 'he', 'had'), 35),\n",
" (('I', 'am', 'sure'), 35),\n",
" (('could', 'not', 'be'), 30),\n",
" (('it', 'would', 'be'), 28),\n",
" (('as', 'well', 'as'), 27),\n",
" (('by', 'no', 'means'), 26),\n",
" (('would', 'have', 'been'), 26),\n",
" (('that', 'it', 'was'), 25),\n",
" (('one', 'of', 'the'), 25)]"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a_count.most_common(12)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"... and the most common 3-grams from text B:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"[(('which', 'I', 'had'), 38),\n",
" (('I', 'could', 'not'), 31),\n",
" (('I', 'did', 'not'), 31),\n",
" (('that', 'I', 'had'), 25),\n",
" (('that', 'I', 'was'), 22),\n",
" (('me,', 'and', 'I'), 19),\n",
" (('that', 'I', 'might'), 18),\n",
" (('Project', 'Gutenberg-tm', 'electronic'), 18),\n",
" (('that', 'I', 'should'), 17),\n",
" (('I', 'do', 'not'), 16),\n",
" (('the', 'Project', 'Gutenberg'), 15),\n",
" (('one', 'of', 'the'), 15)]"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"b_count.most_common(12)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The same statistical significance test we used on unigrams can also give us the most distinctive 3-grams:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"most significant\n",
"----------------\n",
"she could not (text a)\n",
"I am sure (text a)\n",
"me, and I (text b)\n",
"as soon as (text a)\n",
"I could not (text b)\n",
"Mrs. Bennet was (text a)\n",
"by no means (text a)\n",
"could not be (text a)\n",
"Mr. and Mrs. (text a)\n",
"that I had (text b)\n",
"that I should (text b)\n",
"that she had (text a)\n",
"\n",
"least significant\n",
"-----------------\n",
"when they were (text a)\n",
"if you are (text a)\n",
"soon as I (text a)\n",
"it is the (text a)\n",
"of the day (text a)\n",
"the particulars of (text a)\n",
"the loss of (text a)\n",
"a mixture of (text a)\n",
"it was to (text a)\n",
"but there was (text a)\n",
"was able to (text a)\n",
"But it was (text a)\n"
]
}
],
"source": [
"count_report(a_count, b_count, 12, \" \")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Markov models: what comes next?\n",
"\n",
"Now that we have the ability to find and record the n-grams in a text, it’s time to take our analysis one step further. The next question we’re going to try to answer is this: Given a particular n-gram in a text, what is most likely to come next?\n",
"\n",
"We can imagine the kind of algorithm we’ll need to extract this information from the text. It will look very similar to the code to find n-grams above, but it will need to keep track not just of the n-grams but also a list of all units (word, character, whatever) that *follow* those n-grams.\n",
"\n",
"Let’s do a quick example by hand. This is the same character-level order-2 n-gram analysis of the (very brief) text “condescendences” as above, but this time keeping track of all characters that follow each n-gram:\n",
"\n",
"| n-grams |\tnext? |\n",
"| ------- | ----- |\n",
"|co| n|\n",
"|on| d|\n",
"|nd| e, e|\n",
"|de| s, n|\n",
"|es| c, (end of text)|\n",
"|sc| e|\n",
"|ce| n, s|\n",
"|en| d, c|\n",
"|nc| e|"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From this table, we can determine that while the n-gram `co` is followed by n 100% of the time, and while the n-gram `on` is followed by `d` 100% of the time, the n-gram `de` is followed by `s` 50% of the time, and `n` the rest of the time. Likewise, the n-gram `es` is followed by `c` 50% of the time, and followed by the end of the text the other 50% of the time.\n",
"\n",
"Exercise: Imagine (or even better, write out) what this table might look like if you were analyzing words instead of characters, with a source text of your choice."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Markov chains: Generating text from a Markov model\n",
"\n",
"The Markov models we created above don't just give us interesting statistical probabilities. It also allows us generate a *new* text with those probabilities by *chaining together predictions*. Here’s how we’ll do it, starting with the order 2 character-level Markov model of `condescendences`: (1) start with the initial n-gram (`co`)—those are the first two characters of our output. (2) Now, look at the last *n* characters of output, where *n* is the order of the n-grams in our table, and find those characters in the “n-grams” column. (3) Choose randomly among the possibilities in the corresponding “next” column, and append that letter to the output. (Sometimes, as with `co`, there’s only one possibility). (4) If you chose “end of text,” then the algorithm is over. Otherwise, repeat the process starting with (2). Here’s a record of the algorithm in action:\n",
"\n",
" co\n",
" con\n",
" cond\n",
" conde\n",
" conden\n",
" condend\n",
" condendes\n",
" condendesc\n",
" condendesce\n",
" condendesces\n",
" \n",
"As you can see, we’ve come up with a word that looks like the original word, and could even be passed off as a genuine English word (if you squint at it). From a statistical standpoint, the output of our algorithm is nearly indistinguishable from the input. This kind of algorithm—moving from one state to the next, according to a list of probabilities—is known as a Markov chain generator.\n",
"\n",
"### Generating with Markovify\n",
"\n",
"Fortunately, with the invention of digital computers, you don't have to perform this algorithm by hand! In fact, Markov chain text generation has been a pastime of poets and programmers going back [all the way to 1983](https://www.jstor.org/stable/24969024), so it should be no surprise that there are many implementations of the idea in Python that you can download and install. The one we're going to use is [Markovify](https://github.com/jsvine/markovify), a Markov chain text generation library originally developed for BuzzFeed, apparently. It comes with a lot of extra niceties that will make our lives easier, but underneath the hood, it implements an algorithm very similar to the one we just did by hand above.\n",
"\n",
"To install Markovify on your computer, run the cell below:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: markovify in /Users/allison/anaconda/lib/python3.6/site-packages\n",
"Requirement already satisfied: unidecode in /Users/allison/anaconda/lib/python3.6/site-packages (from markovify)\n",
"\u001b[33mYou are using pip version 9.0.3, however version 10.0.1 is available.\n",
"You should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\n"
]
}
],
"source": [
"!pip install markovify"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And then run this cell to make the library available in your notebook:"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": true,
"scrolled": true
},
"outputs": [],
"source": [
"import markovify"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The code in the following cell creates a new text generator, using the text in the variable specified to build the Markov model, which is then assigned to the variable `generator_a`."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"generator_a = markovify.Text(text_a)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can then call the `.make_sentence()` method to generate a sentence from the model:"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Mr. Darcy could not have happened; but poor consolation to think our friend mercenary.”\n"
]
}
],
"source": [
"print(generator_a.make_sentence())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `.make_short_sentence()` method allows you to specify a maximum length for the generated sentence:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"I cannot consider your daughters.\n"
]
}
],
"source": [
"print(generator_a.make_short_sentence(50))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By default, Markovify tries to generate a sentence that is significantly different from any existing sentence in the input text. As a consequence, sometimes the `.make_sentence()` or `.make_short_sentence()` methods will return `None`, which means that in ten tries it wasn't able to generate such a sentence. You can work around this by increasing the number of times it tries to generate a sufficiently unique sentence using the `tries` parameter:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"“But why should _we_?”\n"
]
}
],
"source": [
"print(generator_a.make_short_sentence(40, tries=100))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or by disabling the check altogether with `test_output=False`:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"It could not mention before me.”\n"
]
}
],
"source": [
"print(generator_a.make_short_sentence(40, test_output=False))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Changing the order\n",
"\n",
"When you create the model, you can specify the order of the model using the `state_size` parameter. It defaults to 2. Let's make two model with different orders and compare:"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"gen_a_1 = markovify.Text(text_a, state_size=1)\n",
"gen_a_4 = markovify.Text(text_a, state_size=4)"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"order 1\n",
"“What a countenance, such an explanation of each with a ring at least it would scarcely needed an attachment.\n",
"\n",
"order 4\n",
"She wrote cheerfully, seemed surrounded with comforts, and mentioned nothing which she could not have formed a very pleasing opinion of conjugal felicity or domestic comfort.\n"
]
}
],
"source": [
"print(\"order 1\")\n",
"print(gen_a_1.make_sentence(test_output=False))\n",
"print()\n",
"print(\"order 4\")\n",
"print(gen_a_4.make_sentence(test_output=False))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In general, the higher the order, the more the sentences will seem \"coherent\" (i.e., more closely resembling the source text). Lower order models will produce more variation. Deciding on the order is usually a matter of taste and trial-and-error."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Changing the level\n",
"\n",
"Markovify, by default, works with *words* as the individual unit. It doesn't come out-of-the-box with support for character-level models. The following code defines a new kind of Markovify generator that implements character-level models. Execute it before continuing:"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": true,
"scrolled": true
},
"outputs": [],
"source": [
"class SentencesByChar(markovify.Text):\n",
" def word_split(self, sentence):\n",
" return list(sentence)\n",
" def word_join(self, words):\n",
" return \"\".join(words)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Any of the parameters you passed to `markovify.Text` you can also pass to `SentencesByChar`. The `state_size` parameter still controls the order of the model, but now the n-grams are characters, not words.\n",
"\n",
"The following cell implements a character-level Markov text generator for the word \"condescendences\":"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": true,
"scrolled": true
},
"outputs": [],
"source": [
"con_model = SentencesByChar(\"condescendences\", state_size=2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Execute the cell below to see the output—it'll be a lot like what we implemented by hand earlier!"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"'condencescencesces'"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"con_model.make_sentence()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Of course, you can use a character-level model on any text of your choice. So, for example, the following cell creates a character-level order-7 Markov chain text generator from text A:"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"collapsed": true,
"scrolled": true
},
"outputs": [],
"source": [
"gen_a_char = SentencesByChar(text_a, state_size=7)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And the cell below prints out a random sentence from this generator. (The `.replace()` is to get rid of any newline characters in the output.)"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There, Mrs. Bennet one day.\n"
]
}
],
"source": [
"print(gen_a_char.make_sentence(test_output=False).replace(\"\\n\", \" \"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Combining models\n",
"\n",
"Markovify has a handy feature that allows you to *combine* models, creating a new model that draws on probabilities from both of the source models. You can use this to create hybrid output that mixes the style and content of two (or more!) different source texts. To do this, you need to create the models independently, and then call `.combine()` to combine them."
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"collapsed": true,
"scrolled": true
},
"outputs": [],
"source": [
"generator_a = markovify.Text(text_a)\n",
"generator_b = markovify.Text(text_b)\n",
"combo = markovify.combine([generator_a, generator_b], [0.5, 0.5])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The bit of code `[0.5, 0.5]` controls the \"weights\" of the models, i.e., how much to emphasize the probabilities of any model. You can change this to suit your tastes. (E.g., if you want mostly text A with but a *soupçon* of text B, you would write `[0.9, 0.1]`. Try it!) \n",
"\n",
"Then you can create sentences using the combined model:"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The hall, the dining-room, which fronted the lane, and were assisted in their native country.\n"
]
}
],
"source": [
"print(combo.make_sentence())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Bringing it all together\n",
"\n",
"I've pre-written some code below to make it easy for you to experiment and produce output from Markovify. Just make adjustments to the values assigned to the variables in the cell below:"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"collapsed": true,
"scrolled": true
},
"outputs": [],
"source": [
"# change to \"word\" for a word-level model\n",
"level = \"char\"\n",
"# controls the length of the n-gram\n",
"order = 7\n",
"# controls the number of lines to output\n",
"output_n = 14\n",
"# weights between the models; text A first, text B second.\n",
"# if you want to completely exclude one model, set its corresponding value to 0\n",
"weights = [0.5, 0.5]\n",
"# limit sentence output to this number of characters\n",
"length_limit = 280"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(The lines beginning with `#` are \"comments\"—they don't do anything, they're just there to explain what's happening in the code.)\n",
"\n",
"After making your changes above, run the cell below to generate text according to your parameters. Repeat as necessary until you get something you really like!"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"She resource, her minutes, you know, when my father to the convenience of natural hideous monsters that Lydia's for her too in its execution.”\n",
"\n",
"“Oh, yes.\n",
"\n",
"A man might lower, but followed myself a little; but I have more than his affectionate praise, and circumstance of his; she was “very glad to for the satisfied, though well-bred, were ever with the happiness.\n",
"\n",
"And with a bow to Mr. Darcy called from what patriot fell.\n",
"\n",
"I have still was all that he was.\n",
"\n",
"I felt that the length the while he was reflection for her.\n",
"\n",
"“After the felicity of his maker owe him not; call out of the morning without report; and there are so strange and trademark, my dear sister had been unfolded against him.\n",
"\n",
"He absolutely settled upon it.\n",
"\n",
"As the mockery and felt that _you_ are dancing with his daughters uncommonly well repaid my fair creation; I imagined; his concerning she been imposed.\n",
"\n",
"I have describe and praises my weak and fourth with eagerly desired information, till her grateful to write such a number the family in Geneva.\n",
"\n",
"Anything in my father, might have felt that remained first suppliant, and least of pity.\n",
"\n",
"These feelings, all her friends.”\n",
"\n",
"But again.\n",
"\n",
"Soon the opportunity of the place and spirits which Jane had sent abroad?”\n",
"\n"
]
}
],
"source": [
"model_cls = markovify.Text if level == \"word\" else SentencesByChar\n",
"gen_a = model_cls(text_a, state_size=order)\n",
"gen_b = model_cls(text_b, state_size=order)\n",
"gen_combo = markovify.combine([gen_a, gen_b], weights)\n",
"for i in range(output_n):\n",
" out = gen_combo.make_short_sentence(length_limit, test_output=False)\n",
" out = out.replace(\"\\n\", \" \")\n",
" print(out)\n",
" print()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Neural network text prediction with `textgenrnn`\n",
"\n",
"Like a [Markov chain](ngrams-and-markov-chains.ipynb), a recurrent neural network (RNN) is a way to make predictions about what will come next in a sequence. For our purposes, the sequence in question is a sequence of characters, and the prediction we want to make is *which character will come next*. Both Markov models and recurrent neural networks do this by using statistical properties of text to make a *probability distribution* for what character will come next, given some information about what comes before. The two procedures work very differently internally, and we're not going to go into the gory details about implementation here. (But if you're interested in the gory details, [here's a good place to start](https://karpathy.github.io/2015/05/21/rnn-effectiveness/).) For our purposes, the main *functional* difference between a Markov chain and a recurrent neural network is the *portion* of the sequence used to make the prediction. A Markov model uses a fixed window of history from the sequence, while an RNN (theoretically) uses the *entire history* of the sequence.\n",
"\n",
"## Start with Markov\n",
"\n",
"To illustrate, here's that Markov model of the word \"condescendences.\" In a Markov model based on bigrams from this string of characters, you'd make a list of bigrams and the characters that follow those bigrams, like so:\n",
"\n",
"| n-grams |\tnext? |\n",
"| ------- | ----- |\n",
"| co | n |\n",
"| on | d |\n",
"| nd | e, e |\n",
"| de | s, n |\n",
"| es | c, (end of text) |\n",
"| sc | e |\n",
"| ce | n, s |\n",
"| en | d, c |\n",
"| nc | e |\n",
"\n",
"You could also write this as a probability distribution, with one column for each bigram. The value in each cell indicates the probability that the character following the bigram in a given row will be followed by the character in a given column:\n",
"\n",
"| n-grams | c | o | n | d | e | s | END |\n",
"| ------- | - | - | - | - | - | - | --- |\n",
"| co | 0 | 0 | 1.0 | 0 | 0 | 0 | 0 | \n",
"| on | 0 | 0 | 0 | 1.0 | 0 | 0 | 0 | \n",
"| nd | 0 | 0 | 0 | 0 | 1.0 | 0 | 0 | \n",
"| de | 0 | 0 | 0.5 | 0 | 0 | 0.5 | 0 |\n",
"| es | 0.5 | 0 | 0 | 0 | 0 | 0 | 0.5 |\n",
"| sc | 0 | 0 | 0 | 0 | 1.0 | 0 | 0 |\n",
"| ce | 0 | 0 | 0.5 | 0 | 0 | 0.5 | 0 |\n",
"| en | 0.5 | 0 | 0 | 0.5 | 0 | 0 | 0 |\n",
"| nc | 0 | 0 | 0 | 0 | 1.0 | 0 | 0 |\n",
"\n",
"Each row of this table is a *probability distribution*, meaning that it shows how probable a given letter is to follow the n-gram in the original text. In a probability distribution, all of the values add up to 1.\n",
"\n",
"Fitting a Markov model to the data is a matter of looking at each sequence of characters in a given text, and updating the table of probability distributions accordingly. To make a prediction from this table, you can \"sample\" from the probability distribution for a given n-gram (i.e., sampling from the distribution for the bigram `de`, you'd have a 50% chance of picking `n` and a 50% chance of picking `s`).\n",
"\n",
"Another way of thinking about this Markov model is as a (hypothetical!) function *f* that takes a bigram as a parameter and returns a probability distribution for that bigram:\n",
"\n",
" f(\"ce\") → [0.0, 0.0, 0.5, 0.0, 0.0, 0.5, 0.0]\n",
" \n",
"(Note that the values at each index in this distribution line up with the columns in the table above.)\n",
" \n",
"The items in the list returned from this function correspond to the probability for the corresponding next character, as given in the table. To sample from this list, you'd pick randomly among the indices according to their probabilities, and then look up the corresponding character by its position in the table.\n",
"\n",
"To generate new text from this model:\n",
"\n",
"1. Set your output string to a randomly selected n-gram\n",
"2. Sample a letter from the probability distribution associated with the n-gram at the end of the output string\n",
"3. Append the sampled letter to the end of the string\n",
"4. Repeat from (2) until the END token is reached\n",
"\n",
"Of course, you don't write this function by hand! When you're creating a Markov model from your data (or \"training\" the model), you're essentially asking the computer to write this function *for you*. In this sense, a Markov model is a very simple kind of machine learning, since the computer \"learns\" the probability distribution from the data that you feed it.\n",
"\n",
"## A (very) simplified explanation of RNNs\n",
"\n",
"The mechanism by which a recurrent neural network \"learns\" probability distributions from sequences is much more sophisticated than the mechanism used in a Markov model, but functionally they're very similar: you give the computer some data to \"train\" on, and then ask it to automatically create a function that will return a probability distribution of what comes next, given some input. An RNN differs from a Markov chain in that to predict the next item in the sequence, you pass in *the entire sequence* instead of just the most recent n-gram.\n",
"\n",
"In other words, you can (again, hypothetically) think of an RNN as a way of automatically creating a function *f* that takes a sequence of characters of arbitrary length and returns a probability distribution for which character comes next in the sequence. Unlike a Markov chain, it's possible to *improve* the accuracy of the probability distribution returned from this function by training on the same data multiple times.\n",
"\n",
"Let's say that we want to train the RNN on the string \"condescendences\" to learn this function, and we want to make a prediction about what character is most likely to follow the sequence of characters \"condescendence\". When training a neural network, the process of learning a function like this works iteratively: you start off with a function that essentially gives a uniform probability distribution for each outcome (i.e., no one outcome is considered more likely than any other):\n",
"\n",
" f(\"condescendences\") → [0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14] (after zero passes through the data)\n",
" \n",
"... and as you iterate over the training data (in this case, the word \"condescendences\"), the probability distribution gradually improves, ideally until it comes to accurately reflect the actual observed distribution (in the parlance, until it \"converges\"). After some number of passes through the data, you might expect the automatically-learned function to return distributions like this:\n",
"\n",
" f(\"condescendences\") → [0.01, 0.02, 0.01, 0.03, 0.01, 0.9, 0.02] (after n passes through the data)\n",
"\n",
"A single pass through the training data is called an \"epoch.\" When it comes to any neural network, and RNNs in particular, more epochs is almost always better.\n",
"\n",
"To generate text from this model:\n",
"\n",
"1. Initialize your output string to an empty string, or a random character, or a starting \"prefix\" that you specify;\n",
"2. Sample the next letter from the distribution returned for the current output string;\n",
"3. Append that character to the end of the output string;\n",
"4. Repeat from (2)\n",
"\n",
"Of course, in a real life application of both a Markov model and an RNN, you'd normally have more than seven items in the probability distribution! In fact, you'd have one element in the probability distribution for every possible character that occurs in the text. (Meaning that if there were 100 unique characters found in the text, the probability distribution would have 100 items in it.)\n",
"\n",
"## Markov chains vs RNNs \n",
"\n",
"The primary benefit of an RNN over a Markov model for text generation is that an RNN takes into account *the entire history* of a sequence when generating the next character. This means that, for example, an RNN can theoretically learn how to close quotes and parentheses, which a Markov chain will never be able to reliably do (at least for pairs of quotes and parentheses longer than the n-gram of the Markov chain).\n",
"\n",
"The drawback of RNNs is that they are *computationally expensive*, from both a processing and memory perspective. This is (again) a simplification, but internally, RNNs work by \"squishing\" information about the training data down into large matrices, and make predictions by performing calculations on these large matrices. That means that you need a lot of CPU and RAM to train an RNN, and the resulting models (when stored to disk) can be very large. Training an RNN also (usually) takes a lot of time.\n",
"\n",
"Another consideration is the size of your corpus. Markov models will give interesting and useful results even for very small datasets, but RNNs require large amounts of data to train—the more data the better.\n",
"\n",
"So what do you do if you *don't* have a very large corpus? Or if you don't have a lot of time to train on your corpus?\n",
"\n",
"## RNN generation from pre-trained models\n",
"\n",
"Fortunately for us, developer and data scientist [Max Woolf](https://github.com/minimaxir) has made a Python library called [textgenrnn](https://github.com/minimaxir/textgenrnn) that makes it really easy to experiment with RNN text generation. This library includes a model (according to the documentation) \"trained on hundreds of thousands of text documents, from Reddit submissions (via BigQuery) and Facebook Pages (via my Facebook Page Post Scraper), from a very diverse variety of subreddits/Pages,\" and allows you to use this model as a starting point for your own training.\n",
"\n",
"To install textgenrnn, you'll probably want to install [Keras](http://keras.io/) first. With Anaconda:"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Solving environment: done\n",
"\n",
"# All requested packages already installed.\n",
"\n"
]
}
],
"source": [
"!conda install -y keras"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then install textgenrnn with `pip`:"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already up-to-date: textgenrnn in /Users/allison/anaconda/lib/python3.6/site-packages\n",
"Requirement already up-to-date: h5py in /Users/allison/anaconda/lib/python3.6/site-packages (from textgenrnn)\n",
"Requirement already up-to-date: scikit-learn in /Users/allison/anaconda/lib/python3.6/site-packages (from textgenrnn)\n",
"Requirement already up-to-date: keras>=2.1.5 in /Users/allison/anaconda/lib/python3.6/site-packages (from textgenrnn)\n",
"Requirement already up-to-date: numpy>=1.7 in /Users/allison/anaconda/lib/python3.6/site-packages (from h5py->textgenrnn)\n",
"Requirement already up-to-date: six in /Users/allison/anaconda/lib/python3.6/site-packages (from h5py->textgenrnn)\n",
"Requirement already up-to-date: keras-applications==1.0.2 in /Users/allison/anaconda/lib/python3.6/site-packages (from keras>=2.1.5->textgenrnn)\n",
"Requirement already up-to-date: scipy>=0.14 in /Users/allison/anaconda/lib/python3.6/site-packages (from keras>=2.1.5->textgenrnn)\n",
"Requirement already up-to-date: keras-preprocessing==1.0.1 in /Users/allison/anaconda/lib/python3.6/site-packages (from keras>=2.1.5->textgenrnn)\n",
"Requirement already up-to-date: pyyaml in /Users/allison/anaconda/lib/python3.6/site-packages (from keras>=2.1.5->textgenrnn)\n",
"\u001b[33mYou are using pip version 9.0.3, however version 10.0.1 is available.\n",
"You should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\n"
]
}
],
"source": [
"!pip install --upgrade textgenrnn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once it's installed, import the `textgenrnn` class from the package:"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Using TensorFlow backend.\n"
]
}
],
"source": [
"from textgenrnn import textgenrnn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And create a new `textgenrnn` object like so. (The `name` parameter controls the filename used when automatically saving the model to disk, so pick something descriptive!)"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"collapsed": true,
"scrolled": true
},
"outputs": [],
"source": [
"textgen = textgenrnn(name=\"text_a\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This object has a `.generate()` method which will, by default, generate text from the pre-trained model only."
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The most of the story of sisters was a digital to be any adopting the first time and it was to wear one of the working on the side of the face of this picture of my games on its phone?\n",
"\n"
]
}
],
"source": [
"textgen.generate()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `textgenrnn` library needs a data structure called a \"list of strings\" as its source text for training. We'll use Markovify's `split_into_sentences` method to turn our plain-text input files into lists of sentences like so:"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"collapsed": true,
"scrolled": true
},
"outputs": [],
"source": [
"from markovify.splitters import split_into_sentences\n",
"text_a_sentences = split_into_sentences(text_a)\n",
"text_b_sentences = split_into_sentences(text_b)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here are five random sentences from both texts:"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"['But your\\n_family_ owe me nothing.',\n",
" 'Mr. Darcy bowed.',\n",
" \"Mrs. Reynolds informed them that it had been taken in his father's\\nlifetime.\",\n",
" '“I should not be surprised,” said Darcy, “if he were to give it up as\\nsoon as any eligible purchase offers.”',\n",
" 'I now give it to _you_, if you are resolved on\\nhaving him.']"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"random.sample(text_a_sentences, 5)"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"['“He did not succeed.',\n",
" 'I found that the\\nsparrow uttered none but harsh notes, whilst those of the blackbird and\\nthrush were sweet and enticing.',\n",
" 'I trod\\nheaven in my thoughts, now exulting in my powers, now burning with the idea\\nof their effects.',\n",
" 'The old man, I could perceive,\\noften endeavoured to encourage his children, as sometimes I found that\\nhe called them, to cast off their melancholy.',\n",
" 'As he said this his countenance became expressive of a calm, settled\\ngrief that touched me to the heart.']"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"random.sample(text_b_sentences, 5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To train a text generator on your own text, use the `.train_on_texts()` method, passing in a list of strings. The `num_epochs` parameter allows you to indicate how many epochs (i.e., passes over the data) should be performed. The more epochs the better, especially for shorter texts, but you'll get okay results even with just a few.\n",
"\n",
"Training a neural network usually takes a really long time! So it makes sense to \"try out\" a text before committing to the many hours it might take to train the network on the full text. The following example trains the neural network on just the first 100 lines from text A, which lets you get an idea of what the output will look like when training on its entire contents. You'll notice that the `train_on_texts()` function prints output as it goes, showing what the generated text is likely to look like."
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training on 8,345 character sequences.\n",
"Epoch 1/3\n",
"65/65 [==============================] - 39s 597ms/step - loss: 2.1617\n",
"####################\n",
"Temperature: 0.2\n",
"####################\n",
"“I the time of the competuting of the own dear make it internot the how to take the time of the fortne and wife one of the mary to make the own the such and the own the how one of the woman of the woman introduce of the notese of the many own wife internet intended one of the sourh, that have a ma\n",
"\n",
"“I am anywhe that will be make it internet internet and will the notesself internet,” you have a compand of the wors of the worst to go one of the may to take the surright.”\n",
"\n",
"“I am anywhan the may will be make the may or inter the fortnet internet, and make that the may was such and the how of the such and the own compenting the serville intended the time of the day of the woman of the woman of the woman of the notesisd of the mary of the competungare of the day of the\n",
"\n",
"####################\n",
"Temperature: 0.5\n",
"####################\n",
"“I not will the time?”\n",
"\n",
"“Doley would do not one into of my disableth.”\n",
"\n",
"“Whith haindney have a servied to the wome “be Mr..Bennet comperany.”\n",
"\n",
"####################\n",
"Temperature: 1.0\n",
"####################\n",
"“What for bennet mr.\n",
"\n",
"“Bestne own resow; it has that that you fouthan wife yug wife, and the doug to make that you love machriod persion a moder.”\n",
"\n",
"“I a compare noisT he no losing onlying as or the argulation of the Ep. Bise vimerse look nos him of it, “got mr.\n",
"\n",
"Epoch 2/3\n",
"65/65 [==============================] - 33s 507ms/step - loss: 1.4411\n",
"####################\n",
"Temperature: 0.2\n",
"####################\n",
"“I see the surrounding of the sallicling in some see and may have a sure it is not so not be acquainted the serving of the serving the same is has a such a sure it are much to say the surrounding it and I am so have may a such an and see you take the surrounding of the something and the surrey is \n",
"\n",
"“I am such a surrounding Mr. Bingley and the surrunnogg and have not said that it anywhere and I am so have mean that it is not sure the surrounding of the same and I am such a surrey in the such a man of the sales.\n",
"\n",
"“I am such a surrey and I am such a serving the surrey and I am and so any of them, and may so she is not have a sure it is and I am such a surrey it and I am a sourh of my dear, you must be is about when I am the surrounding of his married is a sure it and will be the serving the surrunnoghomous \n",
"\n",
"####################\n",
"Temperature: 0.5\n",
"####################\n",
"“I am no not know you much, and she such a surround Mrs.\n",
"\n",
"“I am not go any of my dear, my dear, and I have a nerves.\n",
"\n",
"“I assue me for the mary in my single man or may be may acquist.”\n",
"\n",
"####################\n",
"Temperature: 1.0\n",
"####################\n",
"No tere you cop know the off to be Mr. ! “any,ferus, and a know.\n",
"\n",
"“How can you he setlying a ferthe add I mean Banners see send and eveenderservation henly to fercow thing, sailbreagg st comes next to would me.\n",
"\n",
"She deplanced; preind this mind are who she died Prides of the hears of mind, and we_rensersenger statua.”\n",
"\n",
"Epoch 3/3\n",
"65/65 [==============================] - 31s 482ms/step - loss: 1.1898\n",
"####################\n",
"Temperature: 0.2\n",
"####################\n",
"“I am so hard to see the surrounding in my dear Mr. Bennet was see you and she was a surrund of the for his many of my dear, you may be a surrey in the for them, you must be a mass and I am so hard to see the surrounding in the fortune of the surrounding in a shashuberson and the surrounding of th\n",
"\n",
"“I am a she are a few of the forthorries, that he should may have a such a surcemperage of the surrounding of the surrunnowast in the for them.\n",
"\n",
"“I have a servence server her sure it is a surrey to see the surrounding of the surrounding of the surrounding of the surrounding of the surrund of the surround of the surround of the surround of the day and I am such a surrey to be must anyone on my dear, you and she are any of them, that he said\n",
"\n",
"####################\n",
"Temperature: 0.5\n",
"####################\n",
"“I no send some see my daughters.\n",
"\n",
"“My dear Mr. Bennet is not will be seen to the howadatherow for the forthorries of them of her knight in a wife on inforting them.”\n",
"\n",
"“Dope a you see you assure it in the Mr. Bennet as in schology,” know her own disconderson, and a such an Mr. Bennet, that the own married; the beging of the end of the same for your betwas will be in them of single on my daughters.\n",
"\n",
"####################\n",
"Temperature: 1.0\n",
"####################\n",
"“Waken and I be none liked.\n",
"\n",
"Marfeicher_tw_ novia compenture, what aspeds “everdiand that at Kittle vasment, when much perce, threat vr_agh, _2elly on for intended freethefight I own I am not emp each lit how week for them, where, this.\n",
"\n",
"“contwast them, feensind with; you do only sums they, and do novound him, shew siling mys wine of my cousing child.\n",
"\n"
]
}
],
"source": [
"textgen.train_on_texts(text_a_sentences[:100], num_epochs=3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After training, you can generate new text using the `.generate()` method again:"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"“What do you may go; that he may be mind it all of the surrounding and I am so sure that you him dead them as married; “sen a should but I conservent upon marrying the man and she will go before!\n",
"\n"
]
}
],
"source": [
"textgen.generate()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The results aren't very interesting because by default the generator is very conservative in how it samples from the probability distribution. You can use the `temperature` parameter to make the sampling a bit more likely to pick improbable outcomes. The higher the value, the weirder the results. The default is 0.2, and going above 1.0 is likely to produce unacceptably strange results:"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Bennet, so you shoul not acquise her that a few that the same is will be must the such a man of offired in a base of the same, that I am so when I comes will be must be into a little his consider and the pacefomms delight it will be you as “singed to make a let inters.”\n",
"\n"
]
}
],
"source": [
"textgen.generate(temperature=0.5)"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"“What do you feeling you must all with!\n",
"\n"
]
}
],
"source": [
"textgen.generate(temperature=0.9)"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"“My pc; ifew nightit relengingfockedbed, af.\n",
"\n"
]
}
],
"source": [
"textgen.generate(temperature=1.5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you pass a number *n* to the `.generate()` method as its first parameter, `.generate()` will print out *n* instances of text generation from the model. The code in the following cell prints out ten examples from the specified temperature:"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"You may fought her for one of them, you must do not sure what a man detils that have the same; the them of my daughters, the mary in the asus, that in the hate of before?\n",
"\n",
"“I do not for your surrung to make may be the time of me, Mr. Bennet, “the dear that so adjusting the four Mr. Bingley has seen the sort in the party.”\n",
"\n",
"“It is not you be said _fine _.\n",
"\n",
"“What can is see you must the most them?”\n",
"\n",
"“But I am sure for my daughters.\n",
"\n",
"“A shell has a new for my pic a surchabers.\n",
"\n",
"“I am sure if no returned to have see you much to said her dear Jane to see the man of marrying that he comes in them on his wife what I am a part the littler of the new woman for her man introduce that would so for my daughters what I be for a surrey it, I mean thinking them.\n",
"\n",
"“But the times indest and see you all little advised on the own bennet, or what she was been a mass as that it are you ampressed it affect to know and what a surrund her was a surround the visit, and the respost of a surrey to be more of my dear, you must see you to get _theaks and what a knowleth\n",
"\n",
"How can you have a conservating daughters.\n",
"\n",
"“I do not know him.\n",
"\n"
]
}
],
"source": [
"textgen.generate(10, temperature=0.5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(This may take a little while.)\n",
"\n",
"When you're satisfied with the results and you're ready to train on all of the sentences, just remove the `[:100]` from the call to `.train_on_texts()`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"textgen.train_on_texts(text_a_sentences, num_epochs=5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The textgenrnn library automatically saves the model to disk after each epoch in the same directory as this notebook. You can load a model you've previously trained by passing its filename to the `textgenrnn` function:"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {
"collapsed": true,
"scrolled": true
},
"outputs": [],
"source": [
"textgen = textgenrnn(\"text_a_weights.hdf5\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And then you can call the `.generate()` method as normal:"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"“You have the dear Mrs.\n",
"\n",
"“I am fortune forthned or that I make it will be the nerves.”\n",
"\n",
"“My fried in some news.\n",
"\n",
"“Why did not such a marry in the introduce of the day.”\n",
"\n",
"“Deleagly that it use to a few of my dear his monthord forthorress need and not may said you may have a mass her cried of the visit imponsire advise the store of them.\n",
"\n",
"“But the compand of themselves have her theres her not one of them of her fining her daughters.\n",
"\n",
"“It was see with you hear that you and make in her friend to some adsently to be will do not of her use of the nerves.\n",
"\n",
"“I have cliefer that you have my consider forthorries that he should hear them to make it about them.”\n",
"\n",
"“Derucate have a serving to taken the best forthorrowing him of Mr. Bingley was or it in sourh, that he sure that he has to so be for the empt and a man for your have my daughters when has no will like the daughters.\n",
"\n",
"_Houghthere and will be you that was the them.\n",
"\n"
]
}
],
"source": [
"textgen.generate(10, temperature=0.5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Generating with shorter texts\n",
"\n",
"I've found that `textgenrnn` works especially well with very short, word-length texts. For example, download [this file of human moods](https://github.com/dariusk/corpora/blob/master/data/humans/moods.json) from Corpora Project, and put it in the same directory as this notebook."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then load the JSON file and grab just the list of words naming moods:"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {
"collapsed": true,
"scrolled": true
},
"outputs": [],
"source": [
"import json\n",
"mood_data = json.loads(open(\"./moods.json\").read())\n",
"moods = mood_data['moods']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And create another textgenrnn object:"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"mood_gen = textgenrnn(name=\"moods\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, train the RNN on these moods. One epoch will do the trick:"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training on 6,651 character sequences.\n",
"Epoch 1/1\n",
"51/51 [==============================] - 31s 607ms/step - loss: 2.2394\n",
"####################\n",
"Temperature: 0.2\n",
"####################\n",
"intatted\n",
"\n",
"contrent\n",
"\n",
"reserted\n",
"\n",
"####################\n",
"Temperature: 0.5\n",
"####################\n",
"disnarned\n",
"\n",
"piped\n",
"\n",
"noest\n",
"\n",
"####################\n",
"Temperature: 1.0\n",
"####################\n",
"shmedians\n",
"\n",
"ermssty innockuistic\n",
"\n",
"teguis\n",
"\n"
]
}
],
"source": [
"mood_gen.train_on_texts(moods, num_epochs=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now generate a list of new moods:"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"deantenic\n",
"\n",
"resertly\n",
"\n",
"accorruative\n",
"\n",
"taped\n",
"\n",
"asteriated\n",
"\n",
"abled\n",
"\n",
"distressed\n",
"\n",
"expload\n",
"\n",
"patenced\n",
"\n",
"resuped\n",
"\n",
"intranted\n",
"\n",
"unbullet\n",
"\n",
"disafted\n",
"\n",
"adventy\n",
"\n",
"enthely\n",
"\n",
"discarned\n",
"\n",
"dislansed\n",
"\n",
"cured\n",
"\n",
"contive\n",
"\n",
"samiden\n",
"\n",
"enforrems\n",
"\n",
"leaged\n",
"\n",
"justened\n",
"\n",
"devellated\n",
"\n",
"wristent\n",
"\n"
]
}
],
"source": [
"mood_gen.generate(25, temperature=0.5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Further reading\n",
"\n",
"* [This notebook from the creator of textgenrnn](https://github.com/minimaxir/textgenrnn/blob/master/docs/textgenrnn-demo.ipynb) covers everything about the library that I covered in this tutorial—and much more, including how to start generation from a particular \"seed\" and how to save and load models (useful if you spent an afternoon training a model on your own corpus and don't want to have to do it again!)\n",
"* Take a look at [Janelle Shane's wonderful overview of how she uses RNNs in her process](http://aiweirdness.com/faq). And then take a look at her [wonderful creative work with RNNs](http://aiweirdness.com/).\n",
"* Hayes, Brian. “Computer recreations.” Scientific American, vol. 249, no. 5, 1983, pp. 18–31. JSTOR, http://www.jstor.org/stable/24969024. (Original column from Scientific American that described how Markov chain text generation works—very readable! I can send a PDF, hit me up.)\n",
"* [A Travesty Generator for Micros](https://elmcip.net/critical-writing/travesty-generator-micros) is a follow-up to Hayes' article that has some more theory and an actual Pascal listing (which is now mostly of only historical interest).\n",
"* [This notebook](https://github.com/aparrish/rwet/blob/master/ngrams-and-markov-chains.ipynb) shows how to implement a Markov chain generator from scratch in Python, if you're interested in such things!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.