Last active
February 25, 2016 09:10
-
-
Save hugovk/3f2409ac3f4697bd98e6 to your computer and use it in GitHub Desktop.
The Pilgrim's Cutthroats: potential cutthroat compounds found in the complete works of John Bunyan, including The Pilgrim's Progress
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The Pilgrim's Cutthroats: potential cutthroat compounds found in the complete works of John Bunyan, including The Pilgrim's Progress | |
Source: https://www.gutenberg.org/ebooks/6049 | |
Clip-promise, "a notorious villain" | |
Mr. Dam-man, commissioner, trier, high Calvinist, immoral in conduct | |
Mr. Forget-good, "He could remember nothing but mischief, and to do it with delight." | |
Mr. Fri-babe, "free-babe"?, commissioner, trier, high Calvinist, immoral in conduct | |
Lord Hate-good, a judge | |
Mr. Hate-light, a juror | |
Hate-reproof, son of Vile-affection and Carnal-lust | |
Mr. Hold-the-world | |
Mrs. Know-nothing | |
Linger-after-lust | |
Love-flesh, "a very lewd fellow" | |
Love-gain, " a market town in the county of Coveting" | |
Mr. Love-lust, a juror | |
Love-no-good, a townsman | |
Love-no-light, "governor of Midnight-hold" | |
Mr. Love-saint, "my friend" | |
Mrs. Love-the-flesh, "at Madam Wanton's, where we were as merry as the maids" | |
Mr. Love-to-Mansoul, "that good man" | |
Pickthank, a witness | |
rob-shop, "these slithy,[48] rob-shop, pick-pocket men" | |
Mr. Save-all | |
Save-self | |
Scorn-truth, daughter of Vile-affection and Carnal-lust | |
Slay-good, "a giant that does much annoy the King's highway in these parts" | |
Slightgod, daughter of Vile-affection and Carnal-lust | |
Spite-god, "a most blasphemous wretch" | |
Take-heed | |
Taste-that-which-is-good, the cook | |
Doll Tear-sheet | |
Mr. Tell-true | |
Want-wit | |
? Mr. Blind-man, foreman of a jury whose colleagues are cutthroats | |
? Blood-men, troops, do they blood men, or men of blood? | |
? creep-hedge, Lazarus, "such a scabbed creep-hedge" | |
? Take-heed-what-you-hear, the trumpeter | |
? Mrs. Talk-About-The-Right Things, https://en.wikipedia.org/wiki/The_Pilgrim%27s_Progress |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
These are all the hyphenated and capped words in the text, from which the edited list was mainly whittled. | |
python word-finder.py "^[A-Z].*-.+" -pg 6049 > pilgrims-progress-capped-hyphenated.txt | |
Text has 12,261,364 characters | |
Split text into words | |
Text has 38,335 unique words | |
Found 414 matching words: | |
AS-IS | |
Abel-mizraim | |
Advocate-office | |
Advocate-to | |
Advocate.-1 | |
Advocate.-ED | |
All-pause | |
All-prayer | |
All-prevailing | |
All-seeing | |
Almighty-they | |
Almighty.-In | |
Alms-deeds | |
Altar-work | |
Anglo-Norman | |
Anglo-Saxon | |
Answer.-He | |
Anti-Christians | |
Any-thing | |
Attorney-General | |
Ave-marias | |
BAT'S-EYES | |
BROKEN-HEARTED | |
BY-ENDS | |
Babel-beast | |
Backward-to-all-but-naught | |
Bat's-eyes | |
Bath-rabbim | |
Battering-rams | |
Beef-eaters | |
Bethlehem-Ephratah | |
Bible-loving | |
Blind-man | |
Blood-men | |
Bloody-man | |
Broad-way | |
By-end's | |
By-ends | |
By-path | |
Bye-ends | |
Bye-path | |
Byepath-meadow | |
CATALOGUE-TABLE | |
CONSTITUTION-SIN | |
Cain-like | |
Canaan-which | |
Captain-general | |
Carnal-lust | |
Carnal-security | |
Castle-gate | |
Chief-Justice | |
Christ-abhorring | |
Christ-advancing | |
Christ-an | |
Christ-dishonouring | |
Christ-its | |
Christ-less | |
Christ-then | |
Christ-to | |
Christ.-5 | |
Christ.-A | |
Christ.-An | |
Christ.-In | |
Christian-like | |
Christianity-a | |
City-New | |
City-boldly | |
City-the | |
Clip-promise | |
Coffee-houses | |
Commander-in-chief | |
Common-place | |
Common-prayer-book | |
Corinth-all | |
Cotton-end | |
Cumber-ground | |
DARE-NOT-LIE | |
DEATH-BED | |
Dam-man | |
Damn-me-blades | |
Dare-not-lie | |
Dark-land | |
Desires-awake | |
Ear-gate | |
Ebed-melech | |
Egypt-and | |
Egypt-only | |
Eighth.-All | |
Election-doubteres | |
En-eglaim | |
En-gedi | |
En-hakkore | |
Ephraim-like | |
Evil-questioning | |
Eye-gate | |
FEEBLE-MIND | |
FIFTH.-LAST | |
FIG-TREE | |
FORE-ORDAINING | |
Facing-both-ways | |
Faint-heart | |
Fair-speech | |
Faith-doubters | |
Faith-heart | |
False-peace | |
Father.-1 | |
Feeble-Mind | |
Feeble-mind | |
Feeble-minds | |
Feel-gate | |
Felicity-doubters | |
Fellow-feeling | |
Fellow-pilgrims | |
Fifth.-If | |
Fifth.-To | |
Fig-leaves | |
Fig-tree | |
Fig-trees | |
Fine-spun | |
First.-To | |
Forget-good | |
Forgiveness-The | |
Forty-five | |
Fourth.-Improve | |
Fourth.-To | |
Free-will | |
Fri-babe | |
Friend-my | |
GOOD-WILL | |
GOSPEL-HOLINESS | |
GREAT-HEART | |
GUTENBERG-TM | |
GUTENBERG-tm | |
Gentile-believers | |
Glanville.-ED | |
Gleaning-grapes | |
Glory-doubters | |
God-Man | |
God-as | |
God-fearing | |
God-head | |
God-its | |
God-like | |
God-man | |
God-mother | |
God-offending | |
God-perfect | |
God-provoking | |
God-speed | |
God-tempting | |
God-thou | |
God-ward | |
God.-2 | |
God.-A | |
God.-Christianity | |
Godly-fear | |
Godly-fears | |
Godly-man | |
Godly-sincerity | |
Good-conscience | |
Good-deed | |
Good-deeds | |
Good-hope | |
Good-will | |
Gospel-Truths | |
Gospel-ordinances | |
Gospel-performances | |
Gospel-truths | |
Grace-doubters | |
Great-grace | |
Great-heart | |
Great-heart's | |
Great-hearts | |
Gutenberg-tm | |
HEAD-TO | |
HEAD.-TO | |
HEART-WORK | |
HOLD-THE-WORLD | |
HOLY-MAN | |
HOUR-GLASS | |
Hadad-rimmon | |
Half-a-crown | |
Hamon-gog | |
Hard-heart | |
Hate-good | |
Hate-light | |
Hate-reproof | |
Hear-well | |
Heart-castle | |
Heart-endearing | |
Heart-work | |
Heaven-entranced | |
Heaven-when | |
Hebrews-traits | |
Hell-bred | |
Hell-fire | |
Hell-gate | |
High-mind | |
High-priest | |
Him-a | |
Hold-the-world | |
Holy-man | |
Human-wisdom | |
Humble-mind | |
ILL-FAV | |
ILL-FAVOURED | |
ILL-PAUSE | |
IV.-ED | |
Ill-pause | |
Ill-shaped | |
Ill-will | |
Inn-keeper | |
Israel-like | |
Italic-lettered | |
Jehovah-its | |
Jerusalem-and | |
Jerusalem-descend | |
Jesus-they | |
Judgment-seat | |
Know-nothing | |
LOOKING-GLASS | |
LOVE-SAINT | |
Law-term | |
Light-mind | |
Linger-after-lust | |
Little-ease | |
Little-faith | |
Live-loose | |
Long-suffering | |
Looking-Glass | |
Looking-glass | |
Lord's-day | |
Lord's-days | |
Lord-general | |
Lord-one | |
Loth-to-stoop | |
Love-flesh | |
Love-gain | |
Love-lust | |
Love-no-good | |
Love-no-light | |
Love-saint | |
Love-the-flesh | |
Love-to-Mansoul | |
MEASURING-REED | |
MONEY-LOVE | |
Mammon-how | |
Man's-invention | |
Man-soul | |
Market-cross | |
Mercy-seat | |
Metheg-Ammah,9 | |
Midnight-hold | |
Money-love | |
Morality-there | |
Mouth-gate | |
Much-afraid | |
Much-afraids | |
Narrow-grace | |
New-Year | |
New-fashioned | |
New-year | |
No-good | |
No-heart | |
No-life | |
No-sin | |
No-truth | |
Norman-French | |
Nose-gate | |
Not-right | |
O-buts | |
OVER-MUCH | |
Object.-But | |
Object.-If | |
Object.-My | |
Object.-What | |
Objection.-But | |
Padan-aram | |
Papist-like | |
Parliament.-Ed | |
Past-hope | |
Pater-nosters | |
Paul-another | |
Peace-maker | |
Pharisees-to | |
Plain-truth | |
Prayer-Book | |
Prayer-book | |
Prayer.-ED | |
Privilege.-Is | |
Privilege.-The | |
Publican-Jews | |
Puff-up | |
Quaker-like | |
RE-PRODUCE | |
READY-TO-HALT | |
ROSE-BUSH | |
Ramath-Lehi | |
Ramoth-Gilead | |
Ramoth-gilead | |
Ranter-like | |
Ready-to-halt | |
Ready-to-halts | |
Reply.-Well | |
Resurrection-doubters | |
Revelations-forgetting | |
SAVE-ALL | |
SELF-EXAMINATION | |
SELF-RIGHTEOUS | |
SEVENTH-DAY | |
SEVENTH.-USE | |
SHALL-COME | |
SIXTH.-OBJECTIONS | |
STAND-FAST | |
Sabbath-breaker | |
Sabbath-breaking | |
Sabbath-day | |
Sabbath-keepers | |
Salvation-doubters | |
Save-all | |
Save-self | |
Saviour-lies | |
Say-well | |
Scorn-truth | |
Scripture-moment | |
Second.-There | |
Second.-To | |
Self-Denial | |
Self-conceit | |
Self-confidence | |
Self-denial | |
Self-flatteries | |
Self-love | |
Self-righteousness | |
Self-will | |
Seventh-day | |
Seventh.-The | |
Shall-come | |
Shall-comes | |
Short-sighted | |
Short-wind | |
Sick-bed | |
Simple-hearted | |
Sin-sick | |
Sion-songs | |
Sixth.-Our | |
Sixth.-To | |
Slay-good | |
Sleepy-head | |
Slow-pace | |
Smooth-man | |
Snuff-dishes | |
Son-for | |
Son-in | |
Soul.-Truly | |
Spirit-faith | |
Spite-god | |
Stand-fast | |
Stand-to-lies | |
State-religion | |
Strait-Gate | |
Suffer-long | |
Sunday-school | |
Sweet-sin-hold | |
THIRTY-NINE | |
TO-DAY | |
Take-heed | |
Take-heed-what-you-hear | |
Tale-bearers | |
Taste-that-which-is-good | |
Tear-sheet | |
Tell-true | |
Temple-The | |
Temple-bar | |
Temple-chains | |
Third.-Many | |
Third.-To | |
Tiglath-pileser | |
Time-server | |
To-day | |
To-elbow | |
To-morrow | |
Too-bold | |
True-hearted | |
Tubal-Cain | |
Turn-about | |
Turn-away | |
Turn-stile-alley | |
Twenty-four | |
Two-tongues | |
Unthought-of | |
VINE-TREE | |
Vain-confidence | |
Vain-gloriously | |
Vain-glory | |
Vain-hope | |
Valiant-for-the-truth | |
Valiant-for-truth | |
Vile-affection | |
Vocation-doubters | |
WATER-BAPTISM | |
WELL-SPRING | |
Wain-wood | |
Want-wit | |
Water-Baptism | |
Water-baptism | |
Wet-eyes | |
Whitsun-ales | |
Wild-head | |
Will-be-will | |
Will-be-will's | |
Worldly-glory | |
Worldly-wiseman | |
Would-live | |
Source: https://www.gutenberg.org/ebooks/6049 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
# encoding: utf-8 | |
""" | |
Find words matching a pattern in a Project Gutenberg text. | |
TODO input from text file instead of PG? | |
""" | |
from __future__ import print_function | |
import argparse | |
import re | |
import webbrowser | |
from pprint import pprint | |
def load_list(the_filename): | |
try: | |
with open(the_filename, 'r') as f: | |
my_list = [ | |
line.decode('unicode-escape').rstrip(u'\n') for line in f] | |
except IOError: | |
my_list = [] | |
return my_list | |
def cut_verbs(cutthroats): | |
"""Given a list of cutthroats, find the verb- stems. | |
Ignores unhyphenated. | |
""" | |
verbs = set() | |
for cutthroat in cutthroats: | |
if "-" in cutthroat: | |
first_word = cutthroat.split()[0].split(",")[0] | |
if "-" in first_word: | |
verb = first_word.split("-")[0] | |
verbs.add(verb.lower()) | |
# print(verb, "\t", first_word, "\t", cutthroat) | |
return verbs | |
def text_from_pg(id_number): | |
# https://github.com/c-w/Gutenberg | |
from gutenberg.acquire import load_etext | |
from gutenberg.cleanup import strip_headers | |
# text = strip_headers(load_etext(id_number)).strip() | |
text = load_etext(id_number).strip() | |
return text | |
def words_from_text(text): | |
"""Split the text into a set of words""" | |
# https://textblob.readthedocs.org/en/dev/ | |
print("Split text into words") | |
from textblob import TextBlob | |
blob = TextBlob(text) | |
# return set(word.lower() for word in blob.words) | |
return set(blob.words) | |
def open_url(url): | |
if not args.no_web: | |
webbrowser.open(url, new=2) # 2 = open in a new tab, if possible | |
def print_it(text): | |
"""cmd.exe cannot do Unicode so encode first""" | |
print(text.encode('utf-8')) | |
def commafy(value): | |
"""Add thousands commas""" | |
return "{:,}".format(value) | |
def summarise(some_set, text): | |
if len(some_set): | |
print("\nFound", len(some_set), text + ":\n") | |
some_set = sorted(some_set) | |
print_it("\n".join(some_set)) | |
else: | |
print("\nFound no " + text + ".\n") | |
if __name__ == "__main__": | |
parser = argparse.ArgumentParser( | |
description="TODO", | |
formatter_class=argparse.ArgumentDefaultsHelpFormatter) | |
parser.add_argument( | |
'pattern', | |
# default='not-cutthroats.txt', | |
help="Input regex") | |
parser.add_argument( | |
'-pg', '--gutenberg', | |
type=int, default=2701, | |
help="ID number of a Project Gutenberg text") | |
parser.add_argument( | |
'-nw', '--no-web', action='store_true', | |
help="Don't open a web browser to show the source file") | |
args = parser.parse_args() | |
url = "https://www.gutenberg.org/ebooks/" + str(args.gutenberg) | |
text = text_from_pg(args.gutenberg) | |
print("Text has", commafy(len(text)), "characters") | |
# pprint(text) | |
words = words_from_text(text) | |
print("Text has", commafy(len(words)), "unique words") | |
found = set() | |
for word in words: | |
if word not in found and re.search(args.pattern, word): | |
found.add(word) | |
# if len(found_unknown) == 1: | |
# open_url(url) | |
summarise(found, "matching words") | |
print("\nSource: " + url) | |
# End of file |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment