Skip to content

Instantly share code, notes, and snippets.

@jiffyclub
Created December 20, 2011 02:31
Show Gist options
  • Save jiffyclub/1499949 to your computer and use it in GitHub Desktop.
Save jiffyclub/1499949 to your computer and use it in GitHub Desktop.
Python, grammar based solution for part two of www.ai-class.com Natural Language Processing problem.
"""
This code is for part two of the AI Class optional Natural Language Processing
programming assignment. See youtu.be/KuSg1wcty3s for the complete problem
definition.
We are given a sentence that spans several lines and has been shredded
vertically so that there are 2 characters in each column. The columns are
arranged in random order and it's our job to assemble them into the correct
order.
In this code I use a letter-by-letter grammar model to reconstruct the original.
The first column is identified by looking for a capital letter in the first
column of the first row. Successive columns are added next to the first and
kept if they pass a series of tests:
* No word may have 4 consecutive consonants
* Every complete word must contain a vowel
* Capital letters may not be preceded by another letter.
* Punctuation must be followed by a space, except quotation marks,
which must be either preceded or followed by a space.
* Only 'a', 'A', and 'I' are allowed to occur by themselves.
The above list of rules proved ambiguous so I had to add more arbitrary rules
to get to the final reconstruction, like that no word should end in 'i', or
the string 'lang' should be followed by a 'u' (to get 'language'). These rules
make this code extremely non-general.
This technique proved to be fast but it had no way of recognizing whether it was
constructing English words or just nonsense that matched my rules.
Contrast this method with the dictionary method in gist.github.com/1499952,
which is much slower but able to recognize real English words.
I think it could be effective to combine the grammar and dictionary methods.
Many potential strings could be quickly ruled invalid by grammar rules and
only after passing those rules would a word be subjected to the more
time consuming dictionary tests.
Another potential enhancement to this code would be to make it probabilistic.
Here either a string fits the rules or it doesn't, which is very rigid. In a
more general code you'd want a probabilistic model that rates strings' fitness
based on how often similar strings occur in actual language.
The jumbled sentence is:
|de| | f|Cl|nf|ed|au| i|ti| |ma|ha|or|nn|ou| S|on|nd|on|
|ry| |is|th|is| b|eo|as| | |f |wh| o|ic| t|, | |he|h |
|ab| |la|pr|od|ge|ob| m|an| |s |is|el|ti|ng|il|d |ua|c |
|he| |ea|of|ho| m| t|et|ha| | t|od|ds|e |ki| c|t |ng|br|
|wo|m,|to|yo|hi|ve|u | t|ob| |pr|d |s |us| s|ul|le|ol|e |
| t|ca| t|wi| M|d |th|"A|ma|l |he| p|at|ap|it|he|ti|le|er|
|ry|d |un|Th|" |io|eo|n,|is| |bl|f |pu|Co|ic| o|he|at|mm|
|hi| | |in| | | t| | | | |ye| |ar| |s | | |. |
Reconstructed it reads:
Claude Shannon founded information
theory, which is the basis of
probabilistic language models and
of the code breaking methods that
you would use to solve this problem,
with the paper titled "A Mathematical
Theory of Communication," published
in this year.
Author
------
@jiffyclub
git.io/jiffyclub
"""
import string
S = [
'de| | f|Cl|nf|ed|au| i|ti| |ma|ha|or|nn|ou| S|on|nd|on',
'ry| |is|th|is| b|eo|as| | |f |wh| o|ic| t|, | |he|h ',
'ab| |la|pr|od|ge|ob| m|an| |s |is|el|ti|ng|il|d |ua|c ',
'he| |ea|of|ho| m| t|et|ha| | t|od|ds|e |ki| c|t |ng|br',
'wo|m,|to|yo|hi|ve|u | t|ob| |pr|d |s |us| s|ul|le|ol|e ',
' t|ca| t|wi| M|d |th|"A|ma|l |he| p|at|ap|it|he|ti|le|er',
'ry|d |un|Th|" |io|eo|n,|is| |bl|f |pu|Co|ic| o|he|at|mm',
'hi| | |in| | | t| | | | |ye| |ar| |s | | |. ']
VOWELS = 'aeiouAEIOU'
CONSONANTS = 'bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ'
def find_first_column(l):
"""
The first letter of the first row should be a capital
letter. Find the index of the column that meets this requirement.
(This is using the prior knowledge that only one column will meet
this condition.)
Parameters
----------
l : list of str
First row of paragraph as a list of two character strings.
Returns
-------
ind : int
Index of best first column.
"""
for i,s in enumerate(l):
if s[0] in string.uppercase:
return i
def check_rule_no_four_consonants(s):
"""
Returns False if a word in a string contains 4 successive consonants.
Parameters
----------
s : str
String to check.
Returns
-------
correct : bool
False if a word in the string contains 4 successive consonants,
True otherwise.
"""
split = s.split()
for word in split:
num = 0
for c in word:
if c in CONSONANTS:
num += 1
if num >= 3:
return False
else:
num = 0
return True
def check_rule_capital_letters(s):
"""
Capital letters must not be preceded by another letter.
Returns False if that is not the case for a given string.
Parameters
----------
s : str
String to check.
Returns
-------
correct : bool
Returns False if string violates capital letter rules.
"""
for i,c in enumerate(s[1:],1):
if c in string.uppercase:
if s[i-1] in string.letters:
return False
return True
def check_rule_has_a_vowel(s):
"""
All words in a string must contain a vowel. Returns False if that
is not the case. Words shorter than 2 letters are not tested.
Parameters
----------
s : str
String to check.
Returns
-------
correct : bool
False if a word in `s` is all consonants.
"""
split = s.split()
for word in split:
if len(word) < 4 and s.endswith(word):
continue
num = len([c for c in word if c in VOWELS])
if num == 0:
return False
return True
def check_rule_no_bad_endings(s):
"""
Words don't usually end in 'i', return False if a word does.
Words less than 2 letters or is not followed by a space it is not tested.
Parameters
----------
s : str
String to check.
Returns
-------
correct : bool
False if a word ends in something strang, True otherwise.
"""
forbidden = 'i'
split = s.split()
for word in split:
if len(word) < 2 or s.endswith(word):
continue
if word[-1] in forbidden:
return False
return True
def check_rule_punctuation(s):
"""
Punctuation must always be followed by a space or a quoation mark.
Quotation marks may be preceded by a space or followed by a space,
but must be adjacent to a space. Return False if that is not the case.
Parameters
----------
s : str
String to check.
Returns
-------
correct : bool
False if punctuation rules are broken, True otherwise.
"""
for i, c in enumerate(s[:-1]):
if c in string.punctuation and c != '"' and s[i+1] not in (' "'):
return False
elif c == '"' and not (s[i-1] == ' ' or s[i+1] == ' '):
return False
return True
def check_rule_single_letters(s):
"""
The only letters allowed to be by themselves are 'a', 'A', and 'I'.
Returns False if that's not the case.
Parameters
----------
s : str
String to check.
Returns
-------
correct : bool
False if any forbidden single letters are found, True otherwise.
"""
allowed = 'aAI'
split = s.split()
for word in split:
if len(word) == 1 and not s.endswith(word):
if word not in allowed:
return False
return True
def check_rule_words(s):
"""
Certain letter combinations are just likely. For example, if you have
'oul', you can be pretty certain the next letter is 'd'. This function
returns False if a word in the input string violates something like that.
(This is the last rule I implemented, and is a bit arbitrary.)
Parameters
----------
s : str
String to check.
Returns
-------
correct : bool
False if nonsensical letter combinations are found.
"""
split = s.split()
for word in split:
if 'oul' in word and not s.endswith('oul'):
if 'ould' not in word:
return False
if word.startswith('whic') and not s.endswith('whic'):
if word != 'which':
return False
if 'aa' in word:
return False
if 'ae' in word:
return False
if word.startswith('lang') and not s.endswith('lang'):
if not word.startswith('langu'):
return False
if 'titlec' in word:
return False
return True
def check_string(s):
"""
Check whether a string meets all of the language rules. Returns
True if all rules are met.
Parameters
----------
s : str
String to check.
Returns
-------
correct : bool
True if all rules are met, False otherwise.
"""
correct = (check_rule_has_a_vowel(s) and
check_rule_capital_letters(s) and
check_rule_no_four_consonants(s) and
check_rule_no_bad_endings(s) and
check_rule_punctuation(s) and
check_rule_single_letters(s) and
check_rule_words(s))
return correct
def check_column(decoded, paragraph, col):
"""
Check whether a given column of the paragraph fits in with what's
already been added to the decoded paragraph.
Parameters
----------
decoded : list of str
List of paragraph lines.
paragraph : list of lists
List of lists containing paragraph lines split into two
character chunks.
col : int
Column to add to `decoded`.
Returns
-------
correct : bool
True if column `col` is consistent with what has already
been decoded, False otherwise.
"""
correct = [check_string(decoded[i] + paragraph[i][col])
for i in range(len(decoded))]
if correct.count(True) != len(decoded):
return False
else:
return True
def add_column(decoded, paragraph, col):
"""
Add a column from the paragraph to the decoded container.
Also removes the column from use.
Parameters
----------
decoded : list of str
List of paragraph lines.
paragraph : list of lists
List of lists containing paragraph lines split into two
character chunks.
col : int
Column to add to `decoded`.
Returns
-------
decoded : list of str
Paragraph lines with column appended.
paragraph : list of lists
Line chunks with column `col` removed.
"""
for i, line in enumerate(decoded):
decoded[i] = line + paragraph[i].pop(col)
return decoded, paragraph
def main():
"""
Start with the column identified by `find_first_column` and then
successively add columns that pass all of the rule tests.
"""
# turn S into list of lists
paragraph = [s.split('|') for s in S]
# make a container for decoded paragraph
decoded = [''] * len(paragraph)
# this will contain the indices of the columns in their corrected
# order.
order = []
# find the first column and remove it from paragraph
order.append(find_first_column(paragraph[0]))
decoded, paragraph = add_column(decoded, paragraph, order[-1])
# now go through columns finding one that goes after the last
# column in order
while True:
matched = False
for col in range(len(paragraph[0])):
if check_column(decoded, paragraph, col):
order.append(col)
decoded, paragraph = add_column(decoded, paragraph, order[-1])
matched = True
break
if not matched:
print decoded
raise StandardError("No column matched your rules!")
# check if we've matched all columns
if len(paragraph[0]) == 0:
break
for line in decoded:
print line
if __name__ == '__main__':
main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment