dannguyen/shakespeare-ngrams-cli-ack.md

## shakespeare-ngrams-cli-ack.md

      
    Raw
  

              shakespeare-ngrams-cli-ack.md
            
          
    Creating Shakespearean n-grams with just the command-line and regexes

This is a quick example showing how to use regexes to find tri-grams in Shakespeare...well, 570,872 of them, anyway, if we do some basic filtering of non-dialogue.
Though tokenization and n-grams should typically be done using a proper natural language processing framework, it's possible to do in a jiffy from the command-line, using standard Unix tools and ack, the better-than-grep utility.
What are n-grams?

As Wikipedia says:

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech.

This exercise shows how to build tri-grams from Shakespeare, and it's easier seen than explained, so keep on reading. For practical purposes, n-grams are a useful way to determine statistically common (or rare) phrases in a given block of text, in a more specific way than simple word-counts.
You may have seen Google Book's interactive n-gram viewer:

If you're unfamiliar with n-grams, a great place to start is this book excerpt from Peter Norvig. That excerpt is linked to Norvig's page about ngrams, which contains datasets and other real-world exercises.
Making n-grams from the command-line

n-grams are pretty ubiquitous for language analsyis and are a common part of NLP frameworks. So the fun of this walkthrough is to see how it can be done from the command-line and Unix tooling, which is much quicker for experimenting than jumping into iPython or RStudio.
It's something I just discovered myself after digging around with the ack tool and remembering a basic concept about regex lookaheads.
The ack tool allows the full use of Perl-compatible regexes. And it has an --output flag, which allows you to output capture groups:
$ echo "Nov 9, 2014" | ack '(\d{4})' --output 'The year is $1'
The year is 2014
But how do we use regexes to create n-grams? We use the zero-width property of lookaheads:
$ echo "do re me fa so la ti do" |
     ack '(\w+) (?=(\w+) (\w+))' --output '$1 $2 $3'
The output:
do re me
re me fa
me fa so
fa so la
so la ti
la ti do     

Looks pretty good!
Download the data

mkdir -p 'tempshakespeare' && cd tempshakespeare
curl -s 'http://stash.compciv.org/scrapespeare/matty.shakespeare.tar.gz' \
   | tar xvz
The unpacking process creates a tree structure like this:
  ├── README
  ├── comedies
  │   ├── allswellthatendswell
  │   ├── asyoulikeit
  │   ├── comedyoferrors
  │   ├── cymbeline
  │   ├── loveslabourslost
  │   ├── measureforemeasure
  │   ├── merchantofvenice
  │   ├── merrywivesofwindsor
  │   ├── midsummersnightsdream
  │   ├── muchadoaboutnothing
  │   ├── periclesprinceoftyre
  │   ├── tamingoftheshrew
  │   ├── tempest
  │   ├── troilusandcressida
  │   ├── twelfthnight
  │   ├── twogentlemenofverona
  │   └── winterstale
  ├── glossary
  ├── histories
  │   ├── 1kinghenryiv
  │   ├── 1kinghenryvi
  │   ├── 2kinghenryiv
  │   ├── 2kinghenryvi
  │   ├── 3kinghenryvi
  │   ├── kinghenryv
  │   ├── kinghenryviii
  │   ├── kingjohn
  │   ├── kingrichardii
  │   └── kingrichardiii
  ├── poetry
  │   ├── loverscomplaint
  │   ├── rapeoflucrece
  │   ├── sonnets
  │   ├── various
  │   └── venusandadonis
  └── tragedies
      ├── antonyandcleopatra
      ├── coriolanus
      ├── hamlet
      ├── juliuscaesar
      ├── kinglear
      ├── macbeth
      ├── othello
      ├── romeoandjuliet
      ├── timonofathens
      └── titusandronicus

Tokenize and make n-grams

cat */* | 
  # translate to lowercase
  tr [:upper:] [:lower:] | 
  # change newlines and tabs to space characters
  tr '\t\n' ' ' | 
  # delete all non-letters/spaces/apostrophes/numbers
  sed -E "s/[^a-z0-9 ']+//g" | 
  # tokenize, and use lookahead+capture to perform 0-width matching
  ack '(\S+) +(?=(\S+) +(\S+))' --output '$1 $2 $3' |
  # sort, then unique count, then reverse sort numerically
  sort | uniq -c | sort -rn
The top results -- because this was a raw text grep, all of the top tri-grams are character names (though honestly, I had never heard of Sir Toby Belch):
 297 king henry vi
 247 i pray you
 217 i will not
 188 king henry v
 185 king richard iii
 175 act iv scene
 172 sir toby belch
 160 i do not
 157 i know not
 154 act iii scene
 146 act ii scene
 142 i am a
 140 i am not

We can refine the process by filtering for just text that is dialogue:
$ cat  tragedies/hamlet | ack '^(?:[A-Z]+.*?)?\t(?: *\[Aside\] *)?([A-Z][a-z ]+.+)'
Uh...I'm not going to even try to explain that regex, except that it has something to do with how dialogue either starts as a single tab-space away from the beginning of a line or a tab away from a speaker's all-caps name, which itself always begins at the start of the line. But then we have to ignore the "[Aside.]" that sometimes starts a block of dialogue.
Meh, it's good enough for a command-line exploration...parsing Shakespeare is probably best done in a real scripting environment:

Using that regex pattern to filter out non-dialogue text:
cat */* | 
  # just capture dialogue, aside from '[Aside]'
  # takes advantage of the fact that this text uses tabs to separate dialogue
  # from speaker
  ack '^(?:[A-Z]+.*?)?\t(?: *\[Aside\] *)?([A-Z][a-z ]+.+)' --output '$1' |
  # translate to lowercase
  tr [:upper:] [:lower:] | 
  # change newlines and tabs to space characters
  tr '\t\n' ' ' | 
  # delete all non-letters/spaces/apostrophes/numbers
  sed -E "s/[^a-z0-9 ']+//g" | 
  # tokenize, and use lookahead+capture to perform 0-width matching
  ack '(\S+) +(?=(\S+) +(\S+))' --output '$1 $2 $3' |
  # sort, then unique count, then reverse sort numerically
  sort | uniq -c | sort -rn
And now we have more pertinent results that aren't things like "Some Duke's Name". On my laptop, it takes about half-a-minute to generate and then sort and group the tri-grams. Not bad!
 207 i pray you
 180 i will not
 143 my lord i
 137 i do not
 131 i know not
 115 my good lord
 112 i am not
 107 this is the
 105 and i will
 103 the duke of
 103 i am a
 100 i would not
  95 my lord of
  93 there is no
  91 that i have
  87 it is a
  81 i have a
  80 that i am
  80 good my lord
  79 it is not
  75 my lord and
  73 i thank you
  73 a room in
  72 i will be
  71 it is the
  71 and all the
  68 what's the matter
  68 thou art a
  68 i pray thee
  68 i have done
  66 as i am
  65 if it be
  63 you my lord
  63 what is the
  62 my lord the
  62 and in the
  61 i beseech you

Note: Peter Norvig has a prepped Shakespeare file stripped of all the non-dialogue, easy for the tokenizing. But the point of quick tokenizing/n-gramming is to be able to do it on any text corpus of your choosing: it's good to get good at processing text, if you want to do unique text analyses specific to your work and research.