Skip to content

Instantly share code, notes, and snippets.

@ariddell
Created June 7, 2010 22:06
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save ariddell/429255 to your computer and use it in GitHub Desktop.
Save ariddell/429255 to your computer and use it in GitHub Desktop.
# Identifying character names in The Adventures of David Simple
# source: http://www.munseys.com/diskone/davidsimp.htm
#
# commands:
# ./ner.sh david_simple.txt > david_simple.ner.txt
# sed -e 's/\S\+\/[^P]\w*//g' -e 's/\s\{2,\}/\n/g' -e 's/\/PERSON//g' david_simple.ner.txt | sort | uniq -c | sort -nr | sed 's/^\s\+//' | awk '{if ($1 > 1) print $1,"\t",substr($0, length($1)+2) }'
# note: ner.sh is Stanford NER http://nlp.stanford.edu/software/CRF-NER.html
238 David
131 Cynthia
96 Camilla
69 Dumont
47 Livia
45 Isabelle
33 Dorimene
29 Daniel
24 Mr. Orgueil
23 Valentine
20 Marquis de Stainville
20 Corinna
18 Mr. Simple
17 Marquis
13 Chevalier Dumont
12 Mind
11 Vieuville
11 Juliè
11 John
10 Mother
10 Brother
9 Story
9 Shakespear
9 Mr. David
8 Pandolph
8 Le Vive
8 CYNTHIA
7 Mr. Spatter
7 Monsieur Le Buisson
7 Chevalier
6 Person
6 Mr. Varnish
6 Mr. Johnson
5 Sacharissa
5 Miss Johnson
5 Madam
5 Le Neuf
4 Shakespeare
4 Mr. Nokes
4 Joy
4 i. e.
3 Woman
3 Mr. David Simple
3 Maxim
3 DUMONT
3 Daughter
3 DANIEL
3 Chearfulness
3 Carpenter
3 Ben Johnson
2 Virgil
2 Stainville
2 Peggy
2 Nanny Johnson
2 Mr. Daniel
2 Mr.
2 Miss Betty Trusty
2 Milton
2 Man
2 Judge
2 Johnson
2 George Barnwell
2 Dryden
2 Don Sebastian
2 Criticks
2 Coquettry
2 Characteristick
2 CAMILLA
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment