Skip to content

Instantly share code, notes, and snippets.

@jimfingal
Last active August 29, 2015 14:13
Show Gist options
  • Save jimfingal/837cdba7f5fd4188811f to your computer and use it in GitHub Desktop.
Save jimfingal/837cdba7f5fd4188811f to your computer and use it in GitHub Desktop.
A python port of Darius Kazemi's "Aphorism detection for fun but definitely not profit" -- http://tinysubversions.com/notes/aphorism-detection

Aphorism Detection in Python

After reading Darius Kazemi's post, "Aphorism detection for fun but definitely not profit", I wanted in -- I've done a number of text-focused bots, but none that did anything more advanced than tokenizing things and making use of ngrams with Markov chains. I have some experience with NLP in Python so thought it would be fun to port it.

The essence of Darius's algorithm is:

  • Read in Corpus
  • Tokenize corpus into sentences
  • Filter out sentences that match a few basic patterns
    • Sentences must be greater than 20 chars and less than 50
    • Sentences must not contain quotes
    • Sentences must not contain personal pronouns
  • Filter out sentences that don't contain the Part Of Speech pattern [noun] [verb]

Python has a lot of tools for natural language processing so I figured I would use those.

Reading in the corpus is pretty straightforward -- you can use the codecs stdlib to read in utf-8 files:

import codecs

with codecs.open("./gutenberg.txt", 'r', 'utf-8') as f:
    data = f.read()
    process_aphorisms(data)

We haven't written process_aphorisms yet, but we'll get around to it.

Oh, hey, let's write the skeleton of that.

def process_aphorisms(data):

    # Tokenize corpus into sentences
    # Filter out sentences that match a few basic patterns
    #    - Sentences must be greater than 20 chars and less than 50
    #    - Sentences must not contain quotes
    #    - Sentences must not contain personal pronouns
    # Filter out sentences that don't contain the POS pattern [noun] [verb]

    pass

So far does nothing. Let's tokenize! NLTK comes along with a bunch of default tokenizers. In order to use them you have to install nltk data -- there are fancy ways of doing this into local directories if you, say, want to use nltk on heroku and only have reliable access to the local directory, but we won't get into that now.

Sentence detection is pretty straightforward -- the nltk data package includes a pre-trained tokenizer for English, which we can use like this:

sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')

def tokenize_sentences(text):
    return sent_detector.tokenize(text.strip())

Which leads us to:

def process_aphorisms(data):

    # Tokenize corpus into sentences
    sentences = tokenize_sentences(data)
    print "Number of sentences: %s" % len(sentences)

    # Number of sentences: 173100

    # Filter out sentences that match a few basic patterns
    #    - Sentences must be greater than 20 chars and less than 50
    #    - Sentences must not contain quotes
    #    - Sentences must not contain personal pronouns

    # Filter out sentences that don't contain the POS pattern [noun] [verb]

For javascript you need a library like underscore to access some basic functional idioms like filter, by Python has a nice built-ins for this sort of thing.

Rather than have one big if/else look, let's apply a list of expressions to each sentence and filter out sentences for which any expression returns True.

For sentence length, let's make a reusable higher-level function to help us:

def within_sentence_length(min_len, max_len):
    length_filter = lambda x: len(x) > min_len and len(x) < max_len
    return length_filter

Calling that will return a function that returns a boolean value w/r/t whether the sentence is within a particular length.

For containing quotes, we want there to be no double-quotes, and at most one single quote in case there is a contaction such as "don't":

def contains_quotes(sentence):
    return "\"" in sentence or sentence.count("'") > 1

Finally, for personal pronouns, we can use Darius' regex approach, but combine them all into a single compiled expresion:

import re 

pp_regex = re.compile(r"""(?P<pronoun>
                            \bI\b|
                            \bmy\b|
                            \bme\b|
                            \bhe\b|
                            \bshe\b|
                            \byou\b|
                            \bhis\b|
                            \bher\b)""", re.VERBOSE | re.IGNORECASE)

def contains_personal_pronouns(sentence):
    return pp_regex.search(sentence)

If the regex looks unfamiliar there, the gloss is: (?P{expression}) is a capturing statement, which means we can see what matched later. \b matches word boundaries. The second argument to re.compile is a set of different options -- re.VERBOSE allows us to have a multi-line statement, and re.IGNORECASE ignores case.

We can combine these together into one function, and use a list comprehension combined with an if condition to filter down to only good sentences::

def good_sentence(sentence):
    return within_sentence_length(20, 50)(sentence) and \
            not contains_personal_pronouns(sentence) and \
            not contains_quotes(sentence)

def process_aphorisms(data):

    sentences = tokenize_sentences(data)
    print "Number of sentences: %s" % len(sentences)
    # Number of sentences: 173100

    filtered = [sentence for sentence in sentences if good_sentence(sentence)]
    print "Number after rough filter: %s" % len(filtered)
    # Number after rough filter: 14307

    # Filter out sentences that don't contain the POS pattern [noun] [verb]

Almost there! For the Part of Speech pattern matching, we can make use of nltk's built-in tagger, and make our own grammer to parse out nouns followed by present-tense verbs.

For POS-tagging, we first have to word-tokenize a sentence; basic tagging looks like:

def tokenize_words(text):
    return nltk.tokenize.word_tokenize(text)

def get_pos_tags(sentence):
    tokens = tokenize_words(sentence)
    return nltk.tag.pos_tag(tokens)

We can use the RegexpParser class and a very simple grammar to detect nouns followed by verbs:

grammar = r"""
    SPVBS:
        {<VBP|VBZ>}  # singular present verbs
        
    SPNVP:
        {<NN.*> <SPVBS>}  # Nouns followed by singular present verbs
"""

chunker = nltk.RegexpParser(grammar)

def parse_sentence_tree(sentence):
    return chunker.parse(get_pos_tags(sentence))

Example usage would be:

qbf = "The quick brown fox jumped over the lazy dogs."
spnvp = "Live axle drives are souped."

print get_pos_tags(qbf)
print parse_sentence_tree(qbf)

# [('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumped', 'VBD'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dogs', 'NNS'), ('.', '.')]
# (S
#  The/DT
#  quick/NN
#  brown/NN
#  fox/NN
#  jumped/VBD
#  over/IN
#  the/DT
#  lazy/NN
#  dogs/NNS
#   ./.)

print get_pos_tags(spnvp)
print parse_sentence_tree(spnvp)

# [('Live', 'JJ'), ('axle', 'NN'), ('drives', 'NNS'), ('are', 'VBP'), ('souped', 'VBN'), ('.', '.')]
# (S Live/JJ axle/NN (SPNVP drives/NNS (SPVBS are/VBP)) souped/VBN ./.)
# True

chunker.parse gives us back a tree object; we can detect if any subtree is one of our singular-present noun-verb-phrases with this helper:

def contains_subtree(sentence, label):
    parsed = parse_sentence_tree(sentence)
    for tree in parsed.subtrees():
        if tree.label() == label:
            return True
    return False

Putting that all together, our aphorism function looks like:

def process_aphorisms(data):
    
    sentences = tokenize_sentences(data)
    print "Number of sentences: %s" % len(sentences)

    filtered = [sentence for sentence in sentences if good_sentence(sentence)]
    print "Number after rough filter: %s" % len(filtered)

    pos_filtered = filter(lambda sentence: contains_subtree(sentence, "SPNVP"), filtered)

    print "Number after POS filter: %s" % len(pos_filtered)

    print "*" * 80
    for sentence in random.sample(pos_filtered, 50):
        print sentence

Example output with the Gutenberg corpus:

Number of sentences: 173100
Number after rough filter: 14307
Number after POS filter: 1339
********************************************************************************
Death is the highest form of life.
278.--Wanderer, who art thou?
What time is the funeral?
Most American women do.
Silk flash rich stockings white.
ACHILLES CONTENDING WITH THE RIVERS.
In fact, Bunbury is dead.
But Bob Doran shouts out of him.
Those farmers are always grumbling.
The future is yet full of trial and success.
what an aid on Mars's side is seen!
Then justice is not good for much.
what is it to us what the rest do or think?
The combination is a dreadful one.
Every word is so deep, Leopold.
Adele is full of whims at such times.
--T is viceregal lodge.
The line is worth a hundred pages of fustian.
--What sort of a kip is this?
Women have no consideration!
The hearth is desolate.
[Dr. Chasuble looks astounded.]
Upon incertitude, upon unlikelihood.
[Runs back into the house.]
Prayer is allpowerful.
Moment before the next Lessing says.
Faut que jeunesse se passe.
John is practical in the extreme.
Your guardian has a very emotional nature.
what persons and cities are here?
Mrs Norman W. Tupper loves officer Taylor.
The Shepherdsons done the same.
While the kettle is boiling.
Conventionality is not morality.
The truth is always respectable.
A thousand pounds reward.
The scene is the same as in the former.
Please don't touch the cucumber sandwiches.
Everybody is clever nowadays.
And who knows what _may_ happen?
Wanted live man for spirit counter.
Art is always more abstract than we fancy.
The white whale is their demigorgon.
Yet Pelagie does not believe it.
England is in the hands of the jews.
It is fortunate Mary is so good with the baby.
--And our eyes are on Europe, says the citizen.
shall the clouds close again upon thee?
--Grand is no name for it, said Buck Mulligan.
Hector rises from the Stygian shades!

Some duds, but a number of those look like great aphorisms! I think I like it when the sentence starts with a noun-verb phrase. One last enhancement:

grammar = r"""
    SPVBS:
        {<VBP|VBZ>}  # Singlular present verbs
        
    NBAR:
        {<NN.*|JJ>*<NN.*>}  # Nouns and Adjectives, terminated with Nouns
        
    NP:
        {<NBAR>}
        {<NBAR><IN><NBAR>}  # Above, connected with in/of/etc...
    
    SPNVP:
        {<NP> <SPVBS>}
"""

# ...

def sentence_starts_with_label(sentence, label):
    parsed = parse_sentence_tree(sentence)
    subtrees = [tree for tree in parsed.subtrees()]
    return len(subtrees) > 1 and subtrees[1].label() == label

# ...

def process_aphorisms(data):
    
    sentences = tokenize_sentences(data)
    print "Number of sentences: %s" % len(sentences)
    
    filtered = [sentence for sentence in sentences if good_sentence(sentence)]
    print "Number after rough filter: %s" % len(filtered)
    
    pos_filtered = filter(lambda sentence: sentence_starts_with_label(sentence, "SPNVP"), filtered)
    
    print "Number after POS filter: %s" % len(pos_filtered)

    print "*" * 80
    for sentence in random.sample(pos_filtered, 50):
        print sentence

Which gives us:

Number of sentences: 173100
Number after rough filter: 14307
Number after POS filter: 999
********************************************************************************
We must be there before the curtain rises.
Must get those settled really.
And this would not be--circulus vitiosus deus?
O ripen'd joy of womanhood!
Every word is so deep, Leopold.
The dear child is still asleep.
The question is, what did the archbishop find?'
Grave Gladstone sees him level, Bloom for Bloom.
After God Shakespeare has created most.
the time approaches for our departure.
Women whisper eagerly.
What attractions are these beyond any before?
The line is immaterial.
Trade follows the flag.
Horseness is the whatness of allhorse.
Five great motions are peculiar to it.
Good poor brutes they look.
X is Davy's publichouse in upper Leeson street.
The Gaelic league wants something in Irish.
But the sentiment has likewise its moral quality.
But your name is Ernest.
The author arrives at England.
Our poor mother is sadly grieved.
This marriage is quite right.
K is Knockmaroon gate.
Mildew has got into the canvas.
Where the bugger is it?
A soft answer turns away wrath.
_Amor matris:_ subjective and objective genitive.
The Academy is too large and too vulgar.
God knows what poxy bowsy left them off.
For which the art has to consider and provide?
HELMER goes and unlocks the hall door.)
Wonder is it like that.
How many souls perish in its tumult!
Must go back for that lotion.
The Killer is never hunted.
Love loves to love love.
No: the appointment is in London.
Numerous houses are razed to the ground.
The letter is lying there in the box.
But this seldom happens.
All men live enveloped in whale-lines.
Might have made a worse fool of myself however.
A sunburst appears in the northwest.
The twining stresses, two by two.
Onlookers see most of the game.
This business has had a very bad effect upon him.
Everybody is clever nowadays.
For oil lines special gaskets are necessary.

Looks like an improvement! There are a few more things we could do to improve (remove sentences with question marks, names, etc.) but let's end there.

import nltk
import re
import codecs
import random
# Generate Aphorisms
def process_aphorisms(data):
sentences = tokenize_sentences(data)
print "Number of sentences: %s" % len(sentences)
filtered = [sentence for sentence in sentences if good_sentence(sentence)]
print "Number after rough filter: %s" % len(filtered)
# pos_filtered = filter(lambda sentence: contains_subtree(sentence, "SPNVP"), filtered)
pos_filtered = filter(lambda sentence: sentence_starts_with_label(sentence, "SPNVP"), filtered)
print "Number after POS filter: %s" % len(pos_filtered)
print "*" * 80
for sentence in random.sample(pos_filtered, 50):
print sentence
# NLP Stuff
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
def tokenize_sentences(text):
return sent_detector.tokenize(text.strip())
def tokenize_words(text):
return nltk.tokenize.word_tokenize(text)
def get_pos_tags(sentence):
tokens = tokenize_words(sentence)
return nltk.tag.pos_tag(tokens)
grammar = r"""
SPVBS:
{<VBP|VBZ>} # Singlular present verbs
NBAR:
{<NN.*|JJ>*<NN.*>} # Nouns and Adjectives, terminated with Nouns
NP:
{<NBAR>}
{<NBAR><IN><NBAR>} # Above, connected with in/of/etc...
SPNVP:
{<NP> <SPVBS>}
APHORISM:
{<NP> <SPVBS> <NP>}
"""
#grammar = r"""
# SPVBS:
# {<VBP|VBZ>} # singular present verbs
#
# SPNVP:
# {<NN.*> <SPVBS>} # Nouns followed by singular present verbs
#"""
chunker = nltk.RegexpParser(grammar)
def parse_sentence_tree(sentence):
return chunker.parse(get_pos_tags(sentence))
def contains_subtree(sentence, label):
parsed = parse_sentence_tree(sentence)
for tree in parsed.subtrees():
if tree.label() == label:
return True
return False
def sentence_starts_with_label(sentence, label):
parsed = parse_sentence_tree(sentence)
subtrees = [tree for tree in parsed.subtrees()]
return len(subtrees) > 1 and subtrees[1].label() == label
# Cleanup Filters
def within_sentence_length(min_len, max_len):
length_filter = lambda x: len(x) > min_len and len(x) < max_len
return length_filter
pp_regex = re.compile(r"""(?P<pronoun>
\bI\b|
\bmy\b|
\bme\b|
\bhe\b|
\bshe\b|
\byou\b|
\bhis\b|
\bher\b)""", re.VERBOSE | re.IGNORECASE)
def contains_personal_pronouns(sentence):
return pp_regex.search(sentence)
def contains_quotes(sentence):
return "\"" in sentence or sentence.count("'") > 1
def good_sentence(sentence):
return within_sentence_length(20, 50)(sentence) and \
not contains_personal_pronouns(sentence) and \
not contains_quotes(sentence)
@beaugunderson
Copy link

Very cool!

Only thing I noticed was this might be clearer than filter/lambda:

def good_sentence(sentence):
    return (within_sentence_length(20, 50)(sentence) and
            not contains_personal_pronouns(sentence) and
            not contains_quotes(sentence))

filtered = [sentence for sentence in sentences if good_sentence(sentence)]

@jimfingal
Copy link
Author

Thanks for the feedback -- yeah, I was thking something along that lines too, when it occured to me that a simple and statement would allow the statement to be short-circuited if any of the statments were False, whereas I think the way the all statement is set up there, all of the functions must be evaluated.

I sometimes get over-enamored with the functional approach, good reminder it's not always the best or clearest. While it would make sense / be more efficient if sentences was a generator, given that it is already a list I think you're right that a simpler list comprehension makes things much clearer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment