Create a gist now

Instantly share code, notes, and snippets.

How to tokenize and create n-grams in Shakespeare from the command-line

Creating Shakespearean n-grams with just the command-line and regexes

This is a quick example showing how to use regexes to find tri-grams in Shakespeare...well, 570,872 of them, anyway, if we do some basic filtering of non-dialogue.

Though tokenization and n-grams should typically be done using a proper natural language processing framework, it's possible to do in a jiffy from the command-line, using standard Unix tools and ack, the better-than-grep utility.

What are n-grams?

As Wikipedia says:

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech.

This exercise shows how to build tri-grams from Shakespeare, and it's easier seen than explained, so keep on reading. For practical purposes, n-grams are a useful way to determine statistically common (or rare) phrases in a given block of text, in a more specific way than simple word-counts.

You may have seen Google Book's interactive n-gram viewer:

If you're unfamiliar with n-grams, a great place to start is this book excerpt from Peter Norvig. That excerpt is linked to Norvig's page about ngrams, which contains datasets and other real-world exercises.

Making n-grams from the command-line

n-grams are pretty ubiquitous for language analsyis and are a common part of NLP frameworks. So the fun of this walkthrough is to see how it can be done from the command-line and Unix tooling, which is much quicker for experimenting than jumping into iPython or RStudio.

It's something I just discovered myself after digging around with the ack tool and remembering a basic concept about regex lookaheads.

The ack tool allows the full use of Perl-compatible regexes. And it has an --output flag, which allows you to output capture groups:

$ echo "Nov 9, 2014" | ack '(\d{4})' --output 'The year is $1'
The year is 2014

But how do we use regexes to create n-grams? We use the zero-width property of lookaheads:

$ echo "do re me fa so la ti do" |
     ack '(\w+) (?=(\w+) (\w+))' --output '$1 $2 $3'

The output:

do re me
re me fa
me fa so
fa so la
so la ti
la ti do     

Looks pretty good!

Download the data

mkdir -p 'tempshakespeare' && cd tempshakespeare
curl -s '' \
   | tar xvz

The unpacking process creates a tree structure like this:

  ├── README
  ├── comedies
  │   ├── allswellthatendswell
  │   ├── asyoulikeit
  │   ├── comedyoferrors
  │   ├── cymbeline
  │   ├── loveslabourslost
  │   ├── measureforemeasure
  │   ├── merchantofvenice
  │   ├── merrywivesofwindsor
  │   ├── midsummersnightsdream
  │   ├── muchadoaboutnothing
  │   ├── periclesprinceoftyre
  │   ├── tamingoftheshrew
  │   ├── tempest
  │   ├── troilusandcressida
  │   ├── twelfthnight
  │   ├── twogentlemenofverona
  │   └── winterstale
  ├── glossary
  ├── histories
  │   ├── 1kinghenryiv
  │   ├── 1kinghenryvi
  │   ├── 2kinghenryiv
  │   ├── 2kinghenryvi
  │   ├── 3kinghenryvi
  │   ├── kinghenryv
  │   ├── kinghenryviii
  │   ├── kingjohn
  │   ├── kingrichardii
  │   └── kingrichardiii
  ├── poetry
  │   ├── loverscomplaint
  │   ├── rapeoflucrece
  │   ├── sonnets
  │   ├── various
  │   └── venusandadonis
  └── tragedies
      ├── antonyandcleopatra
      ├── coriolanus
      ├── hamlet
      ├── juliuscaesar
      ├── kinglear
      ├── macbeth
      ├── othello
      ├── romeoandjuliet
      ├── timonofathens
      └── titusandronicus

Tokenize and make n-grams

cat */* | 
  # translate to lowercase
  tr [:upper:] [:lower:] | 
  # change newlines and tabs to space characters
  tr '\t\n' ' ' | 
  # delete all non-letters/spaces/apostrophes/numbers
  sed -E "s/[^a-z0-9 ']+//g" | 
  # tokenize, and use lookahead+capture to perform 0-width matching
  ack '(\S+) +(?=(\S+) +(\S+))' --output '$1 $2 $3' |
  # sort, then unique count, then reverse sort numerically
  sort | uniq -c | sort -rn

The top results -- because this was a raw text grep, all of the top tri-grams are character names (though honestly, I had never heard of Sir Toby Belch):

 297 king henry vi
 247 i pray you
 217 i will not
 188 king henry v
 185 king richard iii
 175 act iv scene
 172 sir toby belch
 160 i do not
 157 i know not
 154 act iii scene
 146 act ii scene
 142 i am a
 140 i am not

We can refine the process by filtering for just text that is dialogue:

$ cat  tragedies/hamlet | ack '^(?:[A-Z]+.*?)?\t(?: *\[Aside\] *)?([A-Z][a-z ]+.+)'

Uh...I'm not going to even try to explain that regex, except that it has something to do with how dialogue either starts as a single tab-space away from the beginning of a line or a tab away from a speaker's all-caps name, which itself always begins at the start of the line. But then we have to ignore the "[Aside.]" that sometimes starts a block of dialogue.

Meh, it's good enough for a command-line exploration...parsing Shakespeare is probably best done in a real scripting environment:


Using that regex pattern to filter out non-dialogue text:

cat */* | 
  # just capture dialogue, aside from '[Aside]'
  # takes advantage of the fact that this text uses tabs to separate dialogue
  # from speaker
  ack '^(?:[A-Z]+.*?)?\t(?: *\[Aside\] *)?([A-Z][a-z ]+.+)' --output '$1' |
  # translate to lowercase
  tr [:upper:] [:lower:] | 
  # change newlines and tabs to space characters
  tr '\t\n' ' ' | 
  # delete all non-letters/spaces/apostrophes/numbers
  sed -E "s/[^a-z0-9 ']+//g" | 
  # tokenize, and use lookahead+capture to perform 0-width matching
  ack '(\S+) +(?=(\S+) +(\S+))' --output '$1 $2 $3' |
  # sort, then unique count, then reverse sort numerically
  sort | uniq -c | sort -rn

And now we have more pertinent results that aren't things like "Some Duke's Name". On my laptop, it takes about half-a-minute to generate and then sort and group the tri-grams. Not bad!

 207 i pray you
 180 i will not
 143 my lord i
 137 i do not
 131 i know not
 115 my good lord
 112 i am not
 107 this is the
 105 and i will
 103 the duke of
 103 i am a
 100 i would not
  95 my lord of
  93 there is no
  91 that i have
  87 it is a
  81 i have a
  80 that i am
  80 good my lord
  79 it is not
  75 my lord and
  73 i thank you
  73 a room in
  72 i will be
  71 it is the
  71 and all the
  68 what's the matter
  68 thou art a
  68 i pray thee
  68 i have done
  66 as i am
  65 if it be
  63 you my lord
  63 what is the
  62 my lord the
  62 and in the
  61 i beseech you

Note: Peter Norvig has a prepped Shakespeare file stripped of all the non-dialogue, easy for the tokenizing. But the point of quick tokenizing/n-gramming is to be able to do it on any text corpus of your choosing: it's good to get good at processing text, if you want to do unique text analyses specific to your work and research.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment