Skip to content

Instantly share code, notes, and snippets.

@hugovk
Last active February 25, 2016 09:10
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save hugovk/3f2409ac3f4697bd98e6 to your computer and use it in GitHub Desktop.
Save hugovk/3f2409ac3f4697bd98e6 to your computer and use it in GitHub Desktop.
The Pilgrim's Cutthroats: potential cutthroat compounds found in the complete works of John Bunyan, including The Pilgrim's Progress
The Pilgrim's Cutthroats: potential cutthroat compounds found in the complete works of John Bunyan, including The Pilgrim's Progress
Source: https://www.gutenberg.org/ebooks/6049
Clip-promise, "a notorious villain"
Mr. Dam-man, commissioner, trier, high Calvinist, immoral in conduct
Mr. Forget-good, "He could remember nothing but mischief, and to do it with delight."
Mr. Fri-babe, "free-babe"?, commissioner, trier, high Calvinist, immoral in conduct
Lord Hate-good, a judge
Mr. Hate-light, a juror
Hate-reproof, son of Vile-affection and Carnal-lust
Mr. Hold-the-world
Mrs. Know-nothing
Linger-after-lust
Love-flesh, "a very lewd fellow"
Love-gain, " a market town in the county of Coveting"
Mr. Love-lust, a juror
Love-no-good, a townsman
Love-no-light, "governor of Midnight-hold"
Mr. Love-saint, "my friend"
Mrs. Love-the-flesh, "at Madam Wanton's, where we were as merry as the maids"
Mr. Love-to-Mansoul, "that good man"
Pickthank, a witness
rob-shop, "these slithy,[48] rob-shop, pick-pocket men"
Mr. Save-all
Save-self
Scorn-truth, daughter of Vile-affection and Carnal-lust
Slay-good, "a giant that does much annoy the King's highway in these parts"
Slightgod, daughter of Vile-affection and Carnal-lust
Spite-god, "a most blasphemous wretch"
Take-heed
Taste-that-which-is-good, the cook
Doll Tear-sheet
Mr. Tell-true
Want-wit
? Mr. Blind-man, foreman of a jury whose colleagues are cutthroats
? Blood-men, troops, do they blood men, or men of blood?
? creep-hedge, Lazarus, "such a scabbed creep-hedge"
? Take-heed-what-you-hear, the trumpeter
? Mrs. Talk-About-The-Right Things, https://en.wikipedia.org/wiki/The_Pilgrim%27s_Progress
These are all the hyphenated and capped words in the text, from which the edited list was mainly whittled.
python word-finder.py "^[A-Z].*-.+" -pg 6049 > pilgrims-progress-capped-hyphenated.txt
Text has 12,261,364 characters
Split text into words
Text has 38,335 unique words
Found 414 matching words:
AS-IS
Abel-mizraim
Advocate-office
Advocate-to
Advocate.-1
Advocate.-ED
All-pause
All-prayer
All-prevailing
All-seeing
Almighty-they
Almighty.-In
Alms-deeds
Altar-work
Anglo-Norman
Anglo-Saxon
Answer.-He
Anti-Christians
Any-thing
Attorney-General
Ave-marias
BAT'S-EYES
BROKEN-HEARTED
BY-ENDS
Babel-beast
Backward-to-all-but-naught
Bat's-eyes
Bath-rabbim
Battering-rams
Beef-eaters
Bethlehem-Ephratah
Bible-loving
Blind-man
Blood-men
Bloody-man
Broad-way
By-end's
By-ends
By-path
Bye-ends
Bye-path
Byepath-meadow
CATALOGUE-TABLE
CONSTITUTION-SIN
Cain-like
Canaan-which
Captain-general
Carnal-lust
Carnal-security
Castle-gate
Chief-Justice
Christ-abhorring
Christ-advancing
Christ-an
Christ-dishonouring
Christ-its
Christ-less
Christ-then
Christ-to
Christ.-5
Christ.-A
Christ.-An
Christ.-In
Christian-like
Christianity-a
City-New
City-boldly
City-the
Clip-promise
Coffee-houses
Commander-in-chief
Common-place
Common-prayer-book
Corinth-all
Cotton-end
Cumber-ground
DARE-NOT-LIE
DEATH-BED
Dam-man
Damn-me-blades
Dare-not-lie
Dark-land
Desires-awake
Ear-gate
Ebed-melech
Egypt-and
Egypt-only
Eighth.-All
Election-doubteres
En-eglaim
En-gedi
En-hakkore
Ephraim-like
Evil-questioning
Eye-gate
FEEBLE-MIND
FIFTH.-LAST
FIG-TREE
FORE-ORDAINING
Facing-both-ways
Faint-heart
Fair-speech
Faith-doubters
Faith-heart
False-peace
Father.-1
Feeble-Mind
Feeble-mind
Feeble-minds
Feel-gate
Felicity-doubters
Fellow-feeling
Fellow-pilgrims
Fifth.-If
Fifth.-To
Fig-leaves
Fig-tree
Fig-trees
Fine-spun
First.-To
Forget-good
Forgiveness-The
Forty-five
Fourth.-Improve
Fourth.-To
Free-will
Fri-babe
Friend-my
GOOD-WILL
GOSPEL-HOLINESS
GREAT-HEART
GUTENBERG-TM
GUTENBERG-tm
Gentile-believers
Glanville.-ED
Gleaning-grapes
Glory-doubters
God-Man
God-as
God-fearing
God-head
God-its
God-like
God-man
God-mother
God-offending
God-perfect
God-provoking
God-speed
God-tempting
God-thou
God-ward
God.-2
God.-A
God.-Christianity
Godly-fear
Godly-fears
Godly-man
Godly-sincerity
Good-conscience
Good-deed
Good-deeds
Good-hope
Good-will
Gospel-Truths
Gospel-ordinances
Gospel-performances
Gospel-truths
Grace-doubters
Great-grace
Great-heart
Great-heart's
Great-hearts
Gutenberg-tm
HEAD-TO
HEAD.-TO
HEART-WORK
HOLD-THE-WORLD
HOLY-MAN
HOUR-GLASS
Hadad-rimmon
Half-a-crown
Hamon-gog
Hard-heart
Hate-good
Hate-light
Hate-reproof
Hear-well
Heart-castle
Heart-endearing
Heart-work
Heaven-entranced
Heaven-when
Hebrews-traits
Hell-bred
Hell-fire
Hell-gate
High-mind
High-priest
Him-a
Hold-the-world
Holy-man
Human-wisdom
Humble-mind
ILL-FAV
ILL-FAVOURED
ILL-PAUSE
IV.-ED
Ill-pause
Ill-shaped
Ill-will
Inn-keeper
Israel-like
Italic-lettered
Jehovah-its
Jerusalem-and
Jerusalem-descend
Jesus-they
Judgment-seat
Know-nothing
LOOKING-GLASS
LOVE-SAINT
Law-term
Light-mind
Linger-after-lust
Little-ease
Little-faith
Live-loose
Long-suffering
Looking-Glass
Looking-glass
Lord's-day
Lord's-days
Lord-general
Lord-one
Loth-to-stoop
Love-flesh
Love-gain
Love-lust
Love-no-good
Love-no-light
Love-saint
Love-the-flesh
Love-to-Mansoul
MEASURING-REED
MONEY-LOVE
Mammon-how
Man's-invention
Man-soul
Market-cross
Mercy-seat
Metheg-Ammah,9
Midnight-hold
Money-love
Morality-there
Mouth-gate
Much-afraid
Much-afraids
Narrow-grace
New-Year
New-fashioned
New-year
No-good
No-heart
No-life
No-sin
No-truth
Norman-French
Nose-gate
Not-right
O-buts
OVER-MUCH
Object.-But
Object.-If
Object.-My
Object.-What
Objection.-But
Padan-aram
Papist-like
Parliament.-Ed
Past-hope
Pater-nosters
Paul-another
Peace-maker
Pharisees-to
Plain-truth
Prayer-Book
Prayer-book
Prayer.-ED
Privilege.-Is
Privilege.-The
Publican-Jews
Puff-up
Quaker-like
RE-PRODUCE
READY-TO-HALT
ROSE-BUSH
Ramath-Lehi
Ramoth-Gilead
Ramoth-gilead
Ranter-like
Ready-to-halt
Ready-to-halts
Reply.-Well
Resurrection-doubters
Revelations-forgetting
SAVE-ALL
SELF-EXAMINATION
SELF-RIGHTEOUS
SEVENTH-DAY
SEVENTH.-USE
SHALL-COME
SIXTH.-OBJECTIONS
STAND-FAST
Sabbath-breaker
Sabbath-breaking
Sabbath-day
Sabbath-keepers
Salvation-doubters
Save-all
Save-self
Saviour-lies
Say-well
Scorn-truth
Scripture-moment
Second.-There
Second.-To
Self-Denial
Self-conceit
Self-confidence
Self-denial
Self-flatteries
Self-love
Self-righteousness
Self-will
Seventh-day
Seventh.-The
Shall-come
Shall-comes
Short-sighted
Short-wind
Sick-bed
Simple-hearted
Sin-sick
Sion-songs
Sixth.-Our
Sixth.-To
Slay-good
Sleepy-head
Slow-pace
Smooth-man
Snuff-dishes
Son-for
Son-in
Soul.-Truly
Spirit-faith
Spite-god
Stand-fast
Stand-to-lies
State-religion
Strait-Gate
Suffer-long
Sunday-school
Sweet-sin-hold
THIRTY-NINE
TO-DAY
Take-heed
Take-heed-what-you-hear
Tale-bearers
Taste-that-which-is-good
Tear-sheet
Tell-true
Temple-The
Temple-bar
Temple-chains
Third.-Many
Third.-To
Tiglath-pileser
Time-server
To-day
To-elbow
To-morrow
Too-bold
True-hearted
Tubal-Cain
Turn-about
Turn-away
Turn-stile-alley
Twenty-four
Two-tongues
Unthought-of
VINE-TREE
Vain-confidence
Vain-gloriously
Vain-glory
Vain-hope
Valiant-for-the-truth
Valiant-for-truth
Vile-affection
Vocation-doubters
WATER-BAPTISM
WELL-SPRING
Wain-wood
Want-wit
Water-Baptism
Water-baptism
Wet-eyes
Whitsun-ales
Wild-head
Will-be-will
Will-be-will's
Worldly-glory
Worldly-wiseman
Would-live
Source: https://www.gutenberg.org/ebooks/6049
#!/usr/bin/env python
# encoding: utf-8
"""
Find words matching a pattern in a Project Gutenberg text.
TODO input from text file instead of PG?
"""
from __future__ import print_function
import argparse
import re
import webbrowser
from pprint import pprint
def load_list(the_filename):
try:
with open(the_filename, 'r') as f:
my_list = [
line.decode('unicode-escape').rstrip(u'\n') for line in f]
except IOError:
my_list = []
return my_list
def cut_verbs(cutthroats):
"""Given a list of cutthroats, find the verb- stems.
Ignores unhyphenated.
"""
verbs = set()
for cutthroat in cutthroats:
if "-" in cutthroat:
first_word = cutthroat.split()[0].split(",")[0]
if "-" in first_word:
verb = first_word.split("-")[0]
verbs.add(verb.lower())
# print(verb, "\t", first_word, "\t", cutthroat)
return verbs
def text_from_pg(id_number):
# https://github.com/c-w/Gutenberg
from gutenberg.acquire import load_etext
from gutenberg.cleanup import strip_headers
# text = strip_headers(load_etext(id_number)).strip()
text = load_etext(id_number).strip()
return text
def words_from_text(text):
"""Split the text into a set of words"""
# https://textblob.readthedocs.org/en/dev/
print("Split text into words")
from textblob import TextBlob
blob = TextBlob(text)
# return set(word.lower() for word in blob.words)
return set(blob.words)
def open_url(url):
if not args.no_web:
webbrowser.open(url, new=2) # 2 = open in a new tab, if possible
def print_it(text):
"""cmd.exe cannot do Unicode so encode first"""
print(text.encode('utf-8'))
def commafy(value):
"""Add thousands commas"""
return "{:,}".format(value)
def summarise(some_set, text):
if len(some_set):
print("\nFound", len(some_set), text + ":\n")
some_set = sorted(some_set)
print_it("\n".join(some_set))
else:
print("\nFound no " + text + ".\n")
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="TODO",
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument(
'pattern',
# default='not-cutthroats.txt',
help="Input regex")
parser.add_argument(
'-pg', '--gutenberg',
type=int, default=2701,
help="ID number of a Project Gutenberg text")
parser.add_argument(
'-nw', '--no-web', action='store_true',
help="Don't open a web browser to show the source file")
args = parser.parse_args()
url = "https://www.gutenberg.org/ebooks/" + str(args.gutenberg)
text = text_from_pg(args.gutenberg)
print("Text has", commafy(len(text)), "characters")
# pprint(text)
words = words_from_text(text)
print("Text has", commafy(len(words)), "unique words")
found = set()
for word in words:
if word not in found and re.search(args.pattern, word):
found.add(word)
# if len(found_unknown) == 1:
# open_url(url)
summarise(found, "matching words")
print("\nSource: " + url)
# End of file
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment