Created
December 20, 2011 02:31
-
-
Save jiffyclub/1499949 to your computer and use it in GitHub Desktop.
Python, grammar based solution for part two of www.ai-class.com Natural Language Processing problem.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
This code is for part two of the AI Class optional Natural Language Processing | |
programming assignment. See youtu.be/KuSg1wcty3s for the complete problem | |
definition. | |
We are given a sentence that spans several lines and has been shredded | |
vertically so that there are 2 characters in each column. The columns are | |
arranged in random order and it's our job to assemble them into the correct | |
order. | |
In this code I use a letter-by-letter grammar model to reconstruct the original. | |
The first column is identified by looking for a capital letter in the first | |
column of the first row. Successive columns are added next to the first and | |
kept if they pass a series of tests: | |
* No word may have 4 consecutive consonants | |
* Every complete word must contain a vowel | |
* Capital letters may not be preceded by another letter. | |
* Punctuation must be followed by a space, except quotation marks, | |
which must be either preceded or followed by a space. | |
* Only 'a', 'A', and 'I' are allowed to occur by themselves. | |
The above list of rules proved ambiguous so I had to add more arbitrary rules | |
to get to the final reconstruction, like that no word should end in 'i', or | |
the string 'lang' should be followed by a 'u' (to get 'language'). These rules | |
make this code extremely non-general. | |
This technique proved to be fast but it had no way of recognizing whether it was | |
constructing English words or just nonsense that matched my rules. | |
Contrast this method with the dictionary method in gist.github.com/1499952, | |
which is much slower but able to recognize real English words. | |
I think it could be effective to combine the grammar and dictionary methods. | |
Many potential strings could be quickly ruled invalid by grammar rules and | |
only after passing those rules would a word be subjected to the more | |
time consuming dictionary tests. | |
Another potential enhancement to this code would be to make it probabilistic. | |
Here either a string fits the rules or it doesn't, which is very rigid. In a | |
more general code you'd want a probabilistic model that rates strings' fitness | |
based on how often similar strings occur in actual language. | |
The jumbled sentence is: | |
|de| | f|Cl|nf|ed|au| i|ti| |ma|ha|or|nn|ou| S|on|nd|on| | |
|ry| |is|th|is| b|eo|as| | |f |wh| o|ic| t|, | |he|h | | |
|ab| |la|pr|od|ge|ob| m|an| |s |is|el|ti|ng|il|d |ua|c | | |
|he| |ea|of|ho| m| t|et|ha| | t|od|ds|e |ki| c|t |ng|br| | |
|wo|m,|to|yo|hi|ve|u | t|ob| |pr|d |s |us| s|ul|le|ol|e | | |
| t|ca| t|wi| M|d |th|"A|ma|l |he| p|at|ap|it|he|ti|le|er| | |
|ry|d |un|Th|" |io|eo|n,|is| |bl|f |pu|Co|ic| o|he|at|mm| | |
|hi| | |in| | | t| | | | |ye| |ar| |s | | |. | | |
Reconstructed it reads: | |
Claude Shannon founded information | |
theory, which is the basis of | |
probabilistic language models and | |
of the code breaking methods that | |
you would use to solve this problem, | |
with the paper titled "A Mathematical | |
Theory of Communication," published | |
in this year. | |
Author | |
------ | |
@jiffyclub | |
git.io/jiffyclub | |
""" | |
import string | |
S = [ | |
'de| | f|Cl|nf|ed|au| i|ti| |ma|ha|or|nn|ou| S|on|nd|on', | |
'ry| |is|th|is| b|eo|as| | |f |wh| o|ic| t|, | |he|h ', | |
'ab| |la|pr|od|ge|ob| m|an| |s |is|el|ti|ng|il|d |ua|c ', | |
'he| |ea|of|ho| m| t|et|ha| | t|od|ds|e |ki| c|t |ng|br', | |
'wo|m,|to|yo|hi|ve|u | t|ob| |pr|d |s |us| s|ul|le|ol|e ', | |
' t|ca| t|wi| M|d |th|"A|ma|l |he| p|at|ap|it|he|ti|le|er', | |
'ry|d |un|Th|" |io|eo|n,|is| |bl|f |pu|Co|ic| o|he|at|mm', | |
'hi| | |in| | | t| | | | |ye| |ar| |s | | |. '] | |
VOWELS = 'aeiouAEIOU' | |
CONSONANTS = 'bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ' | |
def find_first_column(l): | |
""" | |
The first letter of the first row should be a capital | |
letter. Find the index of the column that meets this requirement. | |
(This is using the prior knowledge that only one column will meet | |
this condition.) | |
Parameters | |
---------- | |
l : list of str | |
First row of paragraph as a list of two character strings. | |
Returns | |
------- | |
ind : int | |
Index of best first column. | |
""" | |
for i,s in enumerate(l): | |
if s[0] in string.uppercase: | |
return i | |
def check_rule_no_four_consonants(s): | |
""" | |
Returns False if a word in a string contains 4 successive consonants. | |
Parameters | |
---------- | |
s : str | |
String to check. | |
Returns | |
------- | |
correct : bool | |
False if a word in the string contains 4 successive consonants, | |
True otherwise. | |
""" | |
split = s.split() | |
for word in split: | |
num = 0 | |
for c in word: | |
if c in CONSONANTS: | |
num += 1 | |
if num >= 3: | |
return False | |
else: | |
num = 0 | |
return True | |
def check_rule_capital_letters(s): | |
""" | |
Capital letters must not be preceded by another letter. | |
Returns False if that is not the case for a given string. | |
Parameters | |
---------- | |
s : str | |
String to check. | |
Returns | |
------- | |
correct : bool | |
Returns False if string violates capital letter rules. | |
""" | |
for i,c in enumerate(s[1:],1): | |
if c in string.uppercase: | |
if s[i-1] in string.letters: | |
return False | |
return True | |
def check_rule_has_a_vowel(s): | |
""" | |
All words in a string must contain a vowel. Returns False if that | |
is not the case. Words shorter than 2 letters are not tested. | |
Parameters | |
---------- | |
s : str | |
String to check. | |
Returns | |
------- | |
correct : bool | |
False if a word in `s` is all consonants. | |
""" | |
split = s.split() | |
for word in split: | |
if len(word) < 4 and s.endswith(word): | |
continue | |
num = len([c for c in word if c in VOWELS]) | |
if num == 0: | |
return False | |
return True | |
def check_rule_no_bad_endings(s): | |
""" | |
Words don't usually end in 'i', return False if a word does. | |
Words less than 2 letters or is not followed by a space it is not tested. | |
Parameters | |
---------- | |
s : str | |
String to check. | |
Returns | |
------- | |
correct : bool | |
False if a word ends in something strang, True otherwise. | |
""" | |
forbidden = 'i' | |
split = s.split() | |
for word in split: | |
if len(word) < 2 or s.endswith(word): | |
continue | |
if word[-1] in forbidden: | |
return False | |
return True | |
def check_rule_punctuation(s): | |
""" | |
Punctuation must always be followed by a space or a quoation mark. | |
Quotation marks may be preceded by a space or followed by a space, | |
but must be adjacent to a space. Return False if that is not the case. | |
Parameters | |
---------- | |
s : str | |
String to check. | |
Returns | |
------- | |
correct : bool | |
False if punctuation rules are broken, True otherwise. | |
""" | |
for i, c in enumerate(s[:-1]): | |
if c in string.punctuation and c != '"' and s[i+1] not in (' "'): | |
return False | |
elif c == '"' and not (s[i-1] == ' ' or s[i+1] == ' '): | |
return False | |
return True | |
def check_rule_single_letters(s): | |
""" | |
The only letters allowed to be by themselves are 'a', 'A', and 'I'. | |
Returns False if that's not the case. | |
Parameters | |
---------- | |
s : str | |
String to check. | |
Returns | |
------- | |
correct : bool | |
False if any forbidden single letters are found, True otherwise. | |
""" | |
allowed = 'aAI' | |
split = s.split() | |
for word in split: | |
if len(word) == 1 and not s.endswith(word): | |
if word not in allowed: | |
return False | |
return True | |
def check_rule_words(s): | |
""" | |
Certain letter combinations are just likely. For example, if you have | |
'oul', you can be pretty certain the next letter is 'd'. This function | |
returns False if a word in the input string violates something like that. | |
(This is the last rule I implemented, and is a bit arbitrary.) | |
Parameters | |
---------- | |
s : str | |
String to check. | |
Returns | |
------- | |
correct : bool | |
False if nonsensical letter combinations are found. | |
""" | |
split = s.split() | |
for word in split: | |
if 'oul' in word and not s.endswith('oul'): | |
if 'ould' not in word: | |
return False | |
if word.startswith('whic') and not s.endswith('whic'): | |
if word != 'which': | |
return False | |
if 'aa' in word: | |
return False | |
if 'ae' in word: | |
return False | |
if word.startswith('lang') and not s.endswith('lang'): | |
if not word.startswith('langu'): | |
return False | |
if 'titlec' in word: | |
return False | |
return True | |
def check_string(s): | |
""" | |
Check whether a string meets all of the language rules. Returns | |
True if all rules are met. | |
Parameters | |
---------- | |
s : str | |
String to check. | |
Returns | |
------- | |
correct : bool | |
True if all rules are met, False otherwise. | |
""" | |
correct = (check_rule_has_a_vowel(s) and | |
check_rule_capital_letters(s) and | |
check_rule_no_four_consonants(s) and | |
check_rule_no_bad_endings(s) and | |
check_rule_punctuation(s) and | |
check_rule_single_letters(s) and | |
check_rule_words(s)) | |
return correct | |
def check_column(decoded, paragraph, col): | |
""" | |
Check whether a given column of the paragraph fits in with what's | |
already been added to the decoded paragraph. | |
Parameters | |
---------- | |
decoded : list of str | |
List of paragraph lines. | |
paragraph : list of lists | |
List of lists containing paragraph lines split into two | |
character chunks. | |
col : int | |
Column to add to `decoded`. | |
Returns | |
------- | |
correct : bool | |
True if column `col` is consistent with what has already | |
been decoded, False otherwise. | |
""" | |
correct = [check_string(decoded[i] + paragraph[i][col]) | |
for i in range(len(decoded))] | |
if correct.count(True) != len(decoded): | |
return False | |
else: | |
return True | |
def add_column(decoded, paragraph, col): | |
""" | |
Add a column from the paragraph to the decoded container. | |
Also removes the column from use. | |
Parameters | |
---------- | |
decoded : list of str | |
List of paragraph lines. | |
paragraph : list of lists | |
List of lists containing paragraph lines split into two | |
character chunks. | |
col : int | |
Column to add to `decoded`. | |
Returns | |
------- | |
decoded : list of str | |
Paragraph lines with column appended. | |
paragraph : list of lists | |
Line chunks with column `col` removed. | |
""" | |
for i, line in enumerate(decoded): | |
decoded[i] = line + paragraph[i].pop(col) | |
return decoded, paragraph | |
def main(): | |
""" | |
Start with the column identified by `find_first_column` and then | |
successively add columns that pass all of the rule tests. | |
""" | |
# turn S into list of lists | |
paragraph = [s.split('|') for s in S] | |
# make a container for decoded paragraph | |
decoded = [''] * len(paragraph) | |
# this will contain the indices of the columns in their corrected | |
# order. | |
order = [] | |
# find the first column and remove it from paragraph | |
order.append(find_first_column(paragraph[0])) | |
decoded, paragraph = add_column(decoded, paragraph, order[-1]) | |
# now go through columns finding one that goes after the last | |
# column in order | |
while True: | |
matched = False | |
for col in range(len(paragraph[0])): | |
if check_column(decoded, paragraph, col): | |
order.append(col) | |
decoded, paragraph = add_column(decoded, paragraph, order[-1]) | |
matched = True | |
break | |
if not matched: | |
print decoded | |
raise StandardError("No column matched your rules!") | |
# check if we've matched all columns | |
if len(paragraph[0]) == 0: | |
break | |
for line in decoded: | |
print line | |
if __name__ == '__main__': | |
main() |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment