Skip to content

Instantly share code, notes, and snippets.

@tra38
Last active May 18, 2020 15:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tra38/3e609536ff8e0fa51710e63ce9cdc3c6 to your computer and use it in GitHub Desktop.
Save tra38/3e609536ff8e0fa51710e63ce9cdc3c6 to your computer and use it in GitHub Desktop.
Slaying the Great White Whale...um, I mean Novel

August

Introduction

From a 2017 comment:

Usual critiques of story generators tend to be some variant of "Yes, you can produce short evocative text snippets, but without structure, you can't scale your generator upwards". It seems approaches that require planning and outlines (story compilers, simulations, etc.) are an inverse of that, providing the structure to scale upwards, but are unable to generate the short evocative text snippets on their own.

This may not actually be a problem to worry about though - a human probably need to be in the loop somewhere (at least, if the story is to be enjoyable or useful to other humans).

But if it does become an issue, maybe we need to have two generators -- use a Markov chain or RNNs to generate short evocative paragraphs for individual topics, and then a large-scale planner to pick paragraphs from those individual topics.

No, I'm not going to build a Markov chain or a RNN to generate short evocative paragraphs. But what I'm going to do is to develop a AI-Driven Pipeline™ to harvest human-readable text from the Internet (mostly en.wikiquote.com, stackexchange.com, and dariusk's corpora), and use said human-readable text to generate the evocative paragraphs. Then, we just use a large-scale planner (like the "Track Method") to assemble those evocative paragraphs into a novel. The large-scale planner have proven to work for >10K words, so the emphasis is to get the AI-Driven Pipeline™ working. If it fails, everything else fails along with it.

The Pipeline

To be honest, the AI-Driven Pipeline™ should just be called a "pipeline"...and will involve the "human element" to a somewhat big deal. You may look at a flowchart of the Pipeline here (or look at the Mermaid Live Editor version of the flowchart), but it really composes of three steps...

  1. Getting the corpus from sites like en.wikiquote.com and the StackExchange network.
  2. Splitting the corpus up into evocative paragraphs.
  3. Sending it off to the Story Compiler.

Getting the Corpus (or the use of Modern Works)

My approach is very similar to what Isaac Karth did in his Pirate Novel entry - searching Gutenberg for evocative quotes that he can wind up using, selecting those quotes manually, and plugging them into the large-scale planner. I was also able to use Gutenberg successfully to find certain evocative paragraphs that can be plugged into the Track Method.

However, the problem with Gutenberg is that the public-domain text is fairly old and outdated, which makes it hard for people in the modern-day to read it properly. This is bad for development (as we have to interpret old texts), and it is bad in production (as readers also need to spend time reading it). The publisher who read my computer-generated novel described the experience as "reading Moby Dick...for fun", which isn't exactly high praise.

What would be nice is if we have text that share our modern understanding of the world. Then we could reuse that text instead. Development will be quicker, and the final reading experience would be more enjoyable. Reusing 19th century Victorian novels is nice, but we would rather reuse "modern works" instead.

We could, of course, handwrite the templates and evocative paragraphs...thereby generating our own modern works that we can plug into the machine. But handwriting stuff doesn't scale effectively (as any human writer can tell you). What we want to do is to reuse other people's modern works.

But you can't just copy and paste other people's modern works into your novel, as the NaNoGenMo README helpfully advise against:

Please try to respect copyright. We're not going to police it, as ultimately it's on your head if you want to just copy/paste a Stephen King novel or whatever, but the most useful/interesting implementations are going to be ones that don't engender lawsuits.

en.wikiquote.com and the StackExchange network, however, adopt a copyleft approach to text. All their text is licensed under CC-BY-SA - meaning that one can freely share and adapt the text, so long as one give attribution and the same rights to other people. This is somewhat problematic for code, but for text, this is incredibly awesome. We can use coherent, modern texts in our own works. All we have to do is to provide the attribution and license the final work under CC-BY-SA otherselves.

Wikiquote is a special case though as it's a collection of quotes, and quotes themselves can fall under copyright law. While you can use quotes under a "fair-use" doctrine, it can get dicey, especailly if you use too many quotes from a single source. You can read more about copyright policies regarding quotes here, but I think that I'll be okay so long as I don't limit myself to copying from a single "work" (example: copying all my quotes from one single book). Instead, I should use Wikiquote to find various quotes from multiple different works (either based on the same topic or from the same author).

As a side-note, I'm more in favor of permissive licenses (like CC0 or CC-BY). However, since corpus collection is hard, I'll bite the bullet and use the copyleft corpuses for now. In any event, permissive licenses can be relicensed to CC-BY-SA.

Note that while the flowchart doesn't mention it, the human programmer is expected to rewrite or add additional text to the gathered corpus some degree. I call this approach "glue text", since it's being used to 'glue' the selected paragraph up to everything else. I also expect to handwrite the introduction and conclusion of the novel as well.

The StackExchange network will be incredibly useful for me because one of their sites is "worldbuilding.stackexchange.com". This means that it is very easy to "build" up plausible and engaging settings simply by copying and pasting worldbuilding ideas from that site. The catch is that it will take some time to adapt the online content into something more suitable for a novel.

Splitting the Text

Some of the text that would be gathered will be treated as "major scenes" that will play a role in the central narrative. The remaining text will be used as quotes that the characters in the novel will wind up speaking. Here's a very micro-zoomed-in version of the story outline:

  • Major Narrative #1 Starts
  • Character #1 Talks
  • Major Narrative #1 Continues
  • Character #2 Talks
  • Major Narrative #1 Ends
  • Character #3 Talks
  • Major Narrative #2 Starts
  • ...

Now, you can't just have Character #1, Character #2, and Character #3 pull from the same corpus of quotes! They're different characters, with different ideas and beliefs about the world. Their quotes should reflect that.

That's where the AI part of the AI-Driven Pipeline™ comes in. I'll use a Bayesian classifier to split up the corpus of quotes into three different corpuses. This allows me to process the corpus much faster.

This of course requires me to do some data labeling up-front, as I find some evocative quotes and manually assign them to Character #1, Characater #2, and Character #3. But once I provide enough data points, I can let the Bayesian classifer take over.

Sending It Off To the Story Compiler

Once you have a YAML file containing all the text that you want to reuse, you just need to hand the text off the compiler to process and generate a Markdown file. That should usually be the end of it, except for the fact that we're reliant on CC-BY-SA works, and those works require "attribution" of some sorts. So the compiler also need to provide some attribution to the modern works we're reusing.

This was not a problem we had to worry about before. If we used Gutenberg sources, well, they're in the public domain, so you don't need to provide attribution. If we used our own handwritten works, well, we already own the copyright (so we can waive the attribution requirement). But when we're dealing with works that are not in the public domain, we need to provide attribution, yet the idea of shoving quotation marks and footnotes right in the middle of a fictional novel doesn't sit well with me.

I saw an elegant solution to the 'attribution problem' in an editorial that was assembled from other modern sources (as a way to talk about plagarism and appropriation art). Ironically, I don't remember who wrote that editorial, so I can't give that person any sort of attribution. Sorry. I think it I saw it in some very classy website like "The New Yorker" though, so it's probably a pretty legit solution that a higher-up editor signed off on.

Their approach to citation was like this:

We’re changing the way people share around the world with our Global Community and 1.4 billion pieces of content under our simple, easy-to-use open licenses. It's critical that we give 110% when intelligently aligning drivers. Strategically touching base about strategizing enterprises will make us leaders in the next-generation dot-bomb industry.

But our company is very limited in one important way: we do not know “everything about the human being,” because that is impossible. The largest libraries in the world do not contain “everything.” The quantity of anthropological data discovered by scientists now exceeds any individual’s ability to assimilate it. The division of labor, including intellectual labor, begun thirty thousand years ago in the Paleolithic, has become an irreversible phenomenon, and there is nothing that can be done about it.

...

Attributions:

The attribution is a legal formality and comes at the end of the article, in case someone wants to look at it. It also is easy to find out where the quote is being used simply by using CTRL-F to find out where else the "start of the quote" is being used.

For me to use this solution though, you will need to write some code that can generate the attribution text, using the following attribution template:

"{start of quote} ... {end of quote}" is from {source} ({license_info})."

This may be non-trival. On the plus side, it would increase word count.

September

Update, 9/16/2018

So I was able to write up a proof of concept for attributions. You can see this software in action here. I'll probably rewrite it from scratch for NaNoGenMo 2018, but at least I have proven that it is possible. It also wound up being non-trival as well, so I'm glad to have gotten the experience.

Another issue that I was able to resolve was determining what the novel would actually be about. I already had some inkling of what I wanted to write in August (a novel set in Alpha Complex, the setting of the PARANOIA tabletop RPG). But now it has been formalized, with three characters and an excuse plot to drive the story. Knowledge of how the novel is "supposed" to go is useful because I want to be able to 'churn' out as much copy as possible. I'm plan on reusing the vast amount of content that has already been written, but it will take time to adapt that content...so the less I need to think about how to adapt it, the better.

As a side-note, there is a divide between "writing-with-structure" novelists and "writing-from-the-seat-of-your-pants" (EDIT: also known as "exploration writing") novelists, and people can debate over which approach is superior. Back when I engaged in manual writing, I tend to lean towards "writing-from-the-seat-of-your-pants" because that tends to be more entertaining for me as a writer (if I had a pre-planned structure, then I already know what happened -- so writing the novel is useless and I should just hand out the outline to everyone), but I also acknowledge that many people would prefer stories that are written with some 'planning' beforehand. I remembered reading one novelist claim that it is possible to write from the seat of your pants if you have planned out other aspects of your story. According to that novelist, a novel can be broken up into three sections:

  • The Plot
  • The Characters
  • The Setting

If you are able to get two of those bulletpoints nailed down, then the third bulletpoint is incredibly easy to generate text for (so you can write from the seat of your pants). I have successfully nailed down the characters and the setting...and thus can use my knowledge of the characters and the setting to help me generate text for the plot.

As a test of how effective I am in adapting content to match the context, I wrote the ending to the story (the secondary 'bookend'), using a short story ("The Machine Stops") as my corpus. I found this short story almost by accident, on Wikisource. This short story is immensely readable, although it is fairly old - it was written in 1909. However, "The Machine Stops" seems to fit in thematically with PARANOIA, so I decided to wind up using its materials.

It took me three hours to "write" the ending to the novel (which really meant copying the ending of The Machine Stops and then making modifications), meaning I "wrote" ~1000 words per hour. That's...actually doesn't sound good. Not good at all. Indeed, 1000 words per hour means you can produce a NaNoWriMo novel in a mere 50 hours, but it sounds stupid in NaNoGenMo if I have to handwrite out the whole corpus in 50 hours and let the machine simple rearrange the order of the corpus. I should be able to generate words much faster and quicker.

The only other time I timed how many words I can generate per hour was when I was conducting a "writing exercise" using the Mythic GM Emulator - I would define the characters and the setting, and let the Mythic GM Emulator decide what each character does within the setting. I was able to churn out copy at 1000 words per hour, just like now. While I liked the output, test readers hated it, due to many reasons, but mostly because it seemed incredibly random (the main character died in the third chapter).

The main reason I'm writing this outline is to actually gain metrics on how many "words per hour" I normally do, without copying from other sources and modifying them (EDIT: or using a mechanical aid like the Mythic GM Emulator). If this number is lower than 1000/words-per-hour, then we know that this approach has advantages over handwriting out a novel. If this number is the same, or even higher, then something has gone horribly, horribly wrong.

Time to test.

word_count = string.split(" ").length
# word_count is 723

time = (28.to_f/60)
# 0.4666666666666667 hours

words_per_hour = word_count/time

p words_per_hour
=> 1549.2857142857142

Something has indeed gone horribly wrong. I should not get 1549 words per hour just through handwriting. I'll have to think carefully why this happened.

Update, 9/19/2018

Currently, I suspect the reason that I'm getting a higher wordcount through handwriting is due to the "latent heat effect" - according to computional creative scholars, the quality of the work decreases as you attempt to give away more autonomy and power to some other tool (like the Mythic GM Emulator or the AI-Driven Pipeline™). The hope of computional creative scholars though is that this decrease in quality is temporary; humans become better at harnessing their favorite tools, and the work soons reaches equivalent quality (or even higher). I have not had practice with adapting other people's works, that's all. Give me more time, the computional creative scholars say, and all will be well...eventually.

And I actually do like reusing other people's work. It is pure flattery to think that you have an idea that nobody has ever came up with before[1], and there's no point reinventing the wheel when somebody has already built an awesome wheel for you to use and abuse. In fact, even just having the text there can inspire me to improve it further, avoiding the "blank page" problem that manual writers have to face with constantly.

I'm not an optimist though (if I was, I would be working with Markov chains and RNNs, not copying and pasting other people's works), so I'm scaling down my ambitions. My goal now is to just to generate a novella of 10,000 words or higher, which is very easy to do, considering a 3,000-word ending has already been pre-written. The remaining ~40,000 words can just be me regenerating the novella again and again or just printing out 'meows'.

I'm also dropping the "AI-Driven" part of the AI-Driven Pipeline™. The Bayesian classifier here doesn't seem to be adding any value to this system, except as an idea generator (give it a random quote and it can tell me which one of my characters could say it). I can dispense with idea generators though; they're a dime-a-dozen in the real-world. Here is the updated flow chart and Mermaid Live Editor version.

To be honest, I'm already in the process of treating this project a failure, even though I came in with such high hopes that "copying and pasting other peoples' work" would be the "silver bullet" that will save the day.

Of course, maybe there's no such silver bullet, and maybe writing a computer-generated novel will take as much labor and time as handwriting out said novel. In an article I wrote about NaNoGenMo 2016, I predicted that "The Goal Of The Programmer Will Be To 'Scale' Novel Experiences, Not To Save Money", and quoted Orteil, the developer of Cookie Clicker, who tweeted:

thanks to procedural generation, I can produce twice the content in double the time

If this is the case, then maybe this approach is the way to slay the Great White Whale. Find tons of paragraphs, and then adapt them for your purposes. But I think I'm happy with it being a proof-of-concept rather than as a full-fledged novel.

(Fun Fact: 501 words/30 minutes = 1002.0 words/per hour. About the same rate as copy-paste. Huh. So maybe it's possible for a human to write faster than someone who copy/paste/adapt...but not consistently? It's certainly less stressful to copy/paste stuff though.)

[1] Interestingly, this is an unintentional plagarism of "Gulliver's Travels". During one of his adventures, Gulliver meets up with a programmer who planned on generating all human knowledge through a machine that would randomly generate text. He would hire a few people to run the machine, a few more people to read the output and write down any output that sounds "evocative", and then assign himself the goal of compiling the "evocative" outputs into readable books. The quote about of "pure flattery" came straight from the programmer (who claimed that he used all his thoughts in childhood as a 'corpus' of sorts for the machine). I have placed the full text of this passage in this gist, for your reading pleasure.

Update, 9/23/2018

I have completed the proof-of-concept and uploaded it onto GitHub. I think I'll stand by my statements in the previous update - this approach is indeed the way to slay the Great White Whale, but that this approach is just as time-consuming as hand-writing out a novel (and to be honest, I probably don't have the time or inclination to pursue such an approach). I saw some minor improvement in my copy/paste abilities (at one point, I got 1.3K words per hour), but even a minor increase in writing speed might not lead to anything worthwhile in the grand schemes of things.

I also got rid of the seperation between quotes and missions (it might have been a great idea, but I just didn't get the time to implement) and switched from YAML to JSON (I didn't really like YAML's confusing indentation rules and was willing to accept JSON's verbosity). Here is the final flow chart and the Mermaid Live Editor version. The final outline of the story is pretty much:

  • Beginning Paragraphs
  • Mission
  • Mission
  • ...
  • Mission
  • Ending Paragraph

I suppose that, in retrospect, if I came up with a good frame story to justify grabbing and displaying random paragraphs (without the need for writing glue-text to justify the existence of those paragraphs within the frame story), I could then spend more time harvesting paragraphs, which would boost my words/hour count. But even this approach would likely lead to a marginal improvement, and wouldn't really address the main problems associated with paragraph typology.

A friend of mine did raise an interesting idea to me - even if it takes the same amount of time to 'generate' a novel as it is to 'handwrite' the novel, a computer can still produce a novel much more effectively than a human. Let us assume a constant word-speed of 1K/words per hour. It will take 50 hours to write a novel. Now, a human, working full-time on this novel, can only dedicate 8 hours a day to writing. This means the human will take 6.25 days to churn out a novel. But a machine, able to dedicate 24 hours a day, only need to take 2.08 days to write that same novel. That's a 66.72% reduction! My friend goes on to say that if you have the machine generate hundreds of the novels beforehand, you could then 'recommend' certain novels to humans, and they would be able to read it instanenously. (You can even treat this as a realistic plan to create a 'fully automated novel generator' - let the human write a natural language prompt, and then the computer finds an existing computer-generated novel that is most similar to that writing prompt. Done.)

Of course, computers can actually generate text fast, but we're asking for human-readable text here, which is slightly harder. Our systems are not advanced enough to generate 1K human-readable words/hour on their own. If they were, I would assume NaNoGenMo would be very different, with more "reinforcement learning" entries and less "paragraph typology" entires. I would argue though that the Pipeline™ is advanced to generate 1K human-readable words/hour (though I may be overestimating its ability as well; if we want a reusable novel generator, we will naturally need more passages to draw from). It does suggest to me that maybe big dreams like, say, writing a Lovecraft novel generator using paragraph typology, may still be worthwhile.

It will take a lot of effort though, and I think future programmers who want to follow in this line will likely need to figure out a good way of automatically harvesting paragraphs to copy and paste. I manually grabbed quotes from Wikiquote, but I used https://repl.it/@tra38/LittleNiftySubweb-Experimental to effectively search worldbuilding.stackexchange.com (and get the links to Worldbuilding's posts for citation generation). Future programmers will, naturally, find better ways of harvesting and processing a corpus.

Postscript, 5/18/2020

I found out the source for the "citation style" mentioned in the "Sending It Off To the Story Compiler" section. Turns out this approach wasn't published by the New Yorker, but instead by Harper's Maganize. The "citation style" was invented by Jonathan Lethem, in the article The Ecstasy of Influence, in Feburary 2007.

This, I suppose, showcase why keeping track of sources is essential - you do not want to falsely attribute an idea to someone who didn't come up with it (which is a real problem with no good solution).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment