jiffyclub/aiclass_nlp2-1.py

## aiclass_nlp2-1.py
"""
This code is for part two of the AI Class optional Natural Language Processing
programming assignment. See youtu.be/KuSg1wcty3s for the complete problem
definition.

We are given a sentence that spans several lines and has been shredded
vertically so that there are 2 characters in each column. The columns are
arranged in random order and it's our job to assemble them into the correct
order.

In this code I use a letter-by-letter grammar model to reconstruct the original.
The first column is identified by looking for a capital letter in the first
column of the first row. Successive columns are added next to the first and
kept if they pass a series of tests:

    * No word may have 4 consecutive consonants
    * Every complete word must contain a vowel
    * Capital letters may not be preceded by another letter.
    * Punctuation must be followed by a space, except quotation marks,
      which must be either preceded or followed by a space.
    * Only 'a', 'A', and 'I' are allowed to occur by themselves.

The above list of rules proved ambiguous so I had to add more arbitrary rules
to get to the final reconstruction, like that no word should end in 'i', or
the string 'lang' should be followed by a 'u' (to get 'language'). These rules
make this code extremely non-general.

This technique proved to be fast but it had no way of recognizing whether it was
constructing English words or just nonsense that matched my rules.

Contrast this method with the dictionary method in gist.github.com/1499952,
which is much slower but able to recognize real English words.

I think it could be effective to combine the grammar and dictionary methods.
Many potential strings could be quickly ruled invalid by grammar rules and
only after passing those rules would a word be subjected to the more
time consuming dictionary tests.

Another potential enhancement to this code would be to make it probabilistic.
Here either a string fits the rules or it doesn't, which is very rigid. In a
more general code you'd want a probabilistic model that rates strings' fitness
based on how often similar strings occur in actual language.


The jumbled sentence is:

|de|  | f|Cl|nf|ed|au| i|ti|  |ma|ha|or|nn|ou| S|on|nd|on|
|ry|  |is|th|is| b|eo|as|  |  |f |wh| o|ic| t|, |  |he|h |
|ab|  |la|pr|od|ge|ob| m|an|  |s |is|el|ti|ng|il|d |ua|c |
|he|  |ea|of|ho| m| t|et|ha|  | t|od|ds|e |ki| c|t |ng|br|
|wo|m,|to|yo|hi|ve|u | t|ob|  |pr|d |s |us| s|ul|le|ol|e |
| t|ca| t|wi| M|d |th|"A|ma|l |he| p|at|ap|it|he|ti|le|er|
|ry|d |un|Th|" |io|eo|n,|is|  |bl|f |pu|Co|ic| o|he|at|mm|
|hi|  |  |in|  |  | t|  |  |  |  |ye|  |ar|  |s |  |  |. |

Reconstructed it reads:

Claude Shannon founded information
theory, which is the basis of
probabilistic language models and
of the code breaking methods that
you would use to solve this problem,
with the paper titled "A Mathematical
Theory of Communication," published
in this year.

Author
------
@jiffyclub
git.io/jiffyclub

"""

import string

S = [
'de|  | f|Cl|nf|ed|au| i|ti|  |ma|ha|or|nn|ou| S|on|nd|on',
'ry|  |is|th|is| b|eo|as|  |  |f |wh| o|ic| t|, |  |he|h ',
'ab|  |la|pr|od|ge|ob| m|an|  |s |is|el|ti|ng|il|d |ua|c ',
'he|  |ea|of|ho| m| t|et|ha|  | t|od|ds|e |ki| c|t |ng|br',
'wo|m,|to|yo|hi|ve|u | t|ob|  |pr|d |s |us| s|ul|le|ol|e ',
' t|ca| t|wi| M|d |th|"A|ma|l |he| p|at|ap|it|he|ti|le|er',
'ry|d |un|Th|" |io|eo|n,|is|  |bl|f |pu|Co|ic| o|he|at|mm',
'hi|  |  |in|  |  | t|  |  |  |  |ye|  |ar|  |s |  |  |. ']

VOWELS = 'aeiouAEIOU'
CONSONANTS = 'bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ'


def find_first_column(l):
    """
    The first letter of the first row should be a capital
    letter. Find the index of the column that meets this requirement.

    (This is using the prior knowledge that only one column will meet
    this condition.)

    Parameters
    ----------
    l : list of str
        First row of paragraph as a list of two character strings.

    Returns
    -------
    ind : int
        Index of best first column.

    """
    for i,s in enumerate(l):
        if s[0] in string.uppercase:
            return i

def check_rule_no_four_consonants(s):
    """
    Returns False if a word in a string contains 4 successive consonants.

    Parameters
    ----------
    s : str
        String to check.

    Returns
    -------
    correct : bool
        False if a word in the string contains 4 successive consonants,
        True otherwise.

    """
    split = s.split()

    for word in split:
        num = 0

        for c in word:
            if c in CONSONANTS:
                num += 1

                if num >= 3:
                    return False

            else:
                num = 0

    return True


def check_rule_capital_letters(s):
    """
    Capital letters must not be preceded by another letter.
    Returns False if that is not the case for a given string.

    Parameters
    ----------
    s : str
        String to check.

    Returns
    -------
    correct : bool
        Returns False if string violates capital letter rules.

    """
    for i,c in enumerate(s[1:],1):
        if c in string.uppercase:
            if s[i-1] in string.letters:
                return False

    return True


def check_rule_has_a_vowel(s):
    """
    All words in a string must contain a vowel. Returns False if that
    is not the case. Words shorter than 2 letters are not tested.

    Parameters
    ----------
    s : str
        String to check.

    Returns
    -------
    correct : bool
        False if a word in `s` is all consonants.

    """
    split = s.split()

    for word in split:
        if len(word) < 4 and s.endswith(word):
            continue

        num = len([c for c in word if c in VOWELS])

        if num == 0:
            return False

    return True


def check_rule_no_bad_endings(s):
    """
    Words don't usually end in 'i', return False if a word does.
    Words less than 2 letters or is not followed by a space it is not tested.

    Parameters
    ----------
    s : str
        String to check.

    Returns
    -------
    correct : bool
        False if a word ends in something strang, True otherwise.

    """
    forbidden = 'i'

    split = s.split()

    for word in split:
        if len(word) < 2 or s.endswith(word):
            continue

        if word[-1] in forbidden:
            return False

    return True


def check_rule_punctuation(s):
    """
    Punctuation must always be followed by a space or a quoation mark.
    Quotation marks may be preceded by a space or followed by a space,
    but must be adjacent to a space. Return False if that is not the case.

    Parameters
    ----------
    s : str
        String to check.

    Returns
    -------
    correct : bool
        False if punctuation rules are broken, True otherwise.

    """
    for i, c in enumerate(s[:-1]):
        if c in string.punctuation and c != '"' and s[i+1] not in (' "'):
            return False
        elif c == '"' and not (s[i-1] == ' ' or s[i+1] == ' '):
            return False

    return True


def check_rule_single_letters(s):
    """
    The only letters allowed to be by themselves are 'a', 'A', and 'I'.
    Returns False if that's not the case.

    Parameters
    ----------
    s : str
        String to check.

    Returns
    -------
    correct : bool
        False if any forbidden single letters are found, True otherwise.

    """
    allowed = 'aAI'

    split = s.split()

    for word in split:
        if len(word) == 1 and not s.endswith(word):
            if word not in allowed:
                return False

    return True


def check_rule_words(s):
    """
    Certain letter combinations are just likely. For example, if you have
    'oul', you can be pretty certain the next letter is 'd'. This function
    returns False if a word in the input string violates something like that.

    (This is the last rule I implemented, and is a bit arbitrary.)

    Parameters
    ----------
    s : str
        String to check.

    Returns
    -------
    correct : bool
        False if nonsensical letter combinations are found.

    """
    split = s.split()

    for word in split:
        if 'oul' in word and not s.endswith('oul'):
            if 'ould' not in word:
                return False
        if word.startswith('whic') and not s.endswith('whic'):
            if word != 'which':
                return False
        if 'aa' in word:
            return False
        if 'ae' in word:
            return False
        if word.startswith('lang') and not s.endswith('lang'):
            if not word.startswith('langu'):
                return False
        if 'titlec' in word:
            return False

    return True


def check_string(s):
    """
    Check whether a string meets all of the language rules. Returns
    True if all rules are met.

    Parameters
    ----------
    s : str
        String to check.

    Returns
    -------
    correct : bool
        True if all rules are met, False otherwise.

    """
    correct = (check_rule_has_a_vowel(s) and
                check_rule_capital_letters(s) and
                check_rule_no_four_consonants(s) and
                check_rule_no_bad_endings(s) and
                check_rule_punctuation(s) and
                check_rule_single_letters(s) and
                check_rule_words(s))

    return correct


def check_column(decoded, paragraph, col):
    """
    Check whether a given column of the paragraph fits in with what's
    already been added to the decoded paragraph.

    Parameters
    ----------
    decoded : list of str
        List of paragraph lines.

    paragraph : list of lists
        List of lists containing paragraph lines split into two
        character chunks.

    col : int
        Column to add to `decoded`.

    Returns
    -------
    correct : bool
        True if column `col` is consistent with what has already
        been decoded, False otherwise.

    """
    correct = [check_string(decoded[i] + paragraph[i][col])
                for i in range(len(decoded))]

    if correct.count(True) != len(decoded):
        return False
    else:
        return True


def add_column(decoded, paragraph, col):
    """
    Add a column from the paragraph to the decoded container.
    Also removes the column from use.

    Parameters
    ----------
    decoded : list of str
        List of paragraph lines.

    paragraph : list of lists
        List of lists containing paragraph lines split into two
        character chunks.

    col : int
        Column to add to `decoded`.

    Returns
    -------
    decoded : list of str
        Paragraph lines with column appended.

    paragraph : list of lists
        Line chunks with column `col` removed.

    """
    for i, line in enumerate(decoded):
        decoded[i] = line + paragraph[i].pop(col)

    return decoded, paragraph


def main():
    """
    Start with the column identified by `find_first_column` and then
    successively add columns that pass all of the rule tests.

    """
    # turn S into list of lists
    paragraph = [s.split('|') for s in S]

    # make a container for decoded paragraph
    decoded = [''] * len(paragraph)

    # this will contain the indices of the columns in their corrected
    # order.
    order = []

    # find the first column and remove it from paragraph
    order.append(find_first_column(paragraph[0]))

    decoded, paragraph = add_column(decoded, paragraph, order[-1])

    # now go through columns finding one that goes after the last
    # column in order
    while True:
        matched = False

        for col in range(len(paragraph[0])):
            if check_column(decoded, paragraph, col):
                order.append(col)

                decoded, paragraph = add_column(decoded, paragraph, order[-1])

                matched = True

                break

        if not matched:
            print decoded
            raise StandardError("No column matched your rules!")

        # check if we've matched all columns
        if len(paragraph[0]) == 0:
            break

    for line in decoded:
        print line


if __name__ == '__main__':
    main()
	"""
	This code is for part two of the AI Class optional Natural Language Processing
	programming assignment. See youtu.be/KuSg1wcty3s for the complete problem
	definition.

	We are given a sentence that spans several lines and has been shredded
	vertically so that there are 2 characters in each column. The columns are
	arranged in random order and it's our job to assemble them into the correct
	order.

	In this code I use a letter-by-letter grammar model to reconstruct the original.
	The first column is identified by looking for a capital letter in the first
	column of the first row. Successive columns are added next to the first and
	kept if they pass a series of tests:

	* No word may have 4 consecutive consonants
	* Every complete word must contain a vowel
	* Capital letters may not be preceded by another letter.
	* Punctuation must be followed by a space, except quotation marks,
	which must be either preceded or followed by a space.
	* Only 'a', 'A', and 'I' are allowed to occur by themselves.

	The above list of rules proved ambiguous so I had to add more arbitrary rules
	to get to the final reconstruction, like that no word should end in 'i', or
	the string 'lang' should be followed by a 'u' (to get 'language'). These rules
	make this code extremely non-general.

	This technique proved to be fast but it had no way of recognizing whether it was
	constructing English words or just nonsense that matched my rules.

	Contrast this method with the dictionary method in gist.github.com/1499952,
	which is much slower but able to recognize real English words.

	I think it could be effective to combine the grammar and dictionary methods.
	Many potential strings could be quickly ruled invalid by grammar rules and
	only after passing those rules would a word be subjected to the more
	time consuming dictionary tests.

	Another potential enhancement to this code would be to make it probabilistic.
	Here either a string fits the rules or it doesn't, which is very rigid. In a
	more general code you'd want a probabilistic model that rates strings' fitness
	based on how often similar strings occur in actual language.


	The jumbled sentence is:

	\|de\| \| f\|Cl\|nf\|ed\|au\| i\|ti\| \|ma\|ha\|or\|nn\|ou\| S\|on\|nd\|on\|
	\|ry\| \|is\|th\|is\| b\|eo\|as\| \| \|f \|wh\| o\|ic\| t\|, \| \|he\|h \|
	\|ab\| \|la\|pr\|od\|ge\|ob\| m\|an\| \|s \|is\|el\|ti\|ng\|il\|d \|ua\|c \|
	\|he\| \|ea\|of\|ho\| m\| t\|et\|ha\| \| t\|od\|ds\|e \|ki\| c\|t \|ng\|br\|
	\|wo\|m,\|to\|yo\|hi\|ve\|u \| t\|ob\| \|pr\|d \|s \|us\| s\|ul\|le\|ol\|e \|
	\| t\|ca\| t\|wi\| M\|d \|th\|"A\|ma\|l \|he\| p\|at\|ap\|it\|he\|ti\|le\|er\|
	\|ry\|d \|un\|Th\|" \|io\|eo\|n,\|is\| \|bl\|f \|pu\|Co\|ic\| o\|he\|at\|mm\|
	\|hi\| \| \|in\| \| \| t\| \| \| \| \|ye\| \|ar\| \|s \| \| \|. \|

	Reconstructed it reads:

	Claude Shannon founded information
	theory, which is the basis of
	probabilistic language models and
	of the code breaking methods that
	you would use to solve this problem,
	with the paper titled "A Mathematical
	Theory of Communication," published
	in this year.

	Author
	------
	@jiffyclub
	git.io/jiffyclub

	"""

	import string

	S = [
	'de\| \| f\|Cl\|nf\|ed\|au\| i\|ti\| \|ma\|ha\|or\|nn\|ou\| S\|on\|nd\|on',
	'ry\| \|is\|th\|is\| b\|eo\|as\| \| \|f \|wh\| o\|ic\| t\|, \| \|he\|h ',
	'ab\| \|la\|pr\|od\|ge\|ob\| m\|an\| \|s \|is\|el\|ti\|ng\|il\|d \|ua\|c ',
	'he\| \|ea\|of\|ho\| m\| t\|et\|ha\| \| t\|od\|ds\|e \|ki\| c\|t \|ng\|br',
	'wo\|m,\|to\|yo\|hi\|ve\|u \| t\|ob\| \|pr\|d \|s \|us\| s\|ul\|le\|ol\|e ',
	' t\|ca\| t\|wi\| M\|d \|th\|"A\|ma\|l \|he\| p\|at\|ap\|it\|he\|ti\|le\|er',
	'ry\|d \|un\|Th\|" \|io\|eo\|n,\|is\| \|bl\|f \|pu\|Co\|ic\| o\|he\|at\|mm',
	'hi\| \| \|in\| \| \| t\| \| \| \| \|ye\| \|ar\| \|s \| \| \|. ']

	VOWELS = 'aeiouAEIOU'
	CONSONANTS = 'bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ'


	def find_first_column(l):
	"""
	The first letter of the first row should be a capital
	letter. Find the index of the column that meets this requirement.

	(This is using the prior knowledge that only one column will meet
	this condition.)

	Parameters
	----------
	l : list of str
	First row of paragraph as a list of two character strings.

	Returns
	-------
	ind : int
	Index of best first column.

	"""
	for i,s in enumerate(l):
	if s[0] in string.uppercase:
	return i

	def check_rule_no_four_consonants(s):
	"""
	Returns False if a word in a string contains 4 successive consonants.

	Parameters
	----------
	s : str
	String to check.

	Returns
	-------
	correct : bool
	False if a word in the string contains 4 successive consonants,
	True otherwise.

	"""
	split = s.split()

	for word in split:
	num = 0

	for c in word:
	if c in CONSONANTS:
	num += 1

	if num >= 3:
	return False

	else:
	num = 0

	return True


	def check_rule_capital_letters(s):
	"""
	Capital letters must not be preceded by another letter.
	Returns False if that is not the case for a given string.

	Parameters
	----------
	s : str
	String to check.

	Returns
	-------
	correct : bool
	Returns False if string violates capital letter rules.

	"""
	for i,c in enumerate(s[1:],1):
	if c in string.uppercase:
	if s[i-1] in string.letters:
	return False

	return True


	def check_rule_has_a_vowel(s):
	"""
	All words in a string must contain a vowel. Returns False if that
	is not the case. Words shorter than 2 letters are not tested.

	Parameters
	----------
	s : str
	String to check.

	Returns
	-------
	correct : bool
	False if a word in `s` is all consonants.

	"""
	split = s.split()

	for word in split:
	if len(word) < 4 and s.endswith(word):
	continue

	num = len([c for c in word if c in VOWELS])

	if num == 0:
	return False

	return True


	def check_rule_no_bad_endings(s):
	"""
	Words don't usually end in 'i', return False if a word does.
	Words less than 2 letters or is not followed by a space it is not tested.

	Parameters
	----------
	s : str
	String to check.

	Returns
	-------
	correct : bool
	False if a word ends in something strang, True otherwise.

	"""
	forbidden = 'i'

	split = s.split()

	for word in split:
	if len(word) < 2 or s.endswith(word):
	continue

	if word[-1] in forbidden:
	return False

	return True


	def check_rule_punctuation(s):
	"""
	Punctuation must always be followed by a space or a quoation mark.
	Quotation marks may be preceded by a space or followed by a space,
	but must be adjacent to a space. Return False if that is not the case.

	Parameters
	----------
	s : str
	String to check.

	Returns
	-------
	correct : bool
	False if punctuation rules are broken, True otherwise.

	"""
	for i, c in enumerate(s[:-1]):
	if c in string.punctuation and c != '"' and s[i+1] not in (' "'):
	return False
	elif c == '"' and not (s[i-1] == ' ' or s[i+1] == ' '):
	return False

	return True


	def check_rule_single_letters(s):
	"""
	The only letters allowed to be by themselves are 'a', 'A', and 'I'.
	Returns False if that's not the case.

	Parameters
	----------
	s : str
	String to check.

	Returns
	-------
	correct : bool
	False if any forbidden single letters are found, True otherwise.

	"""
	allowed = 'aAI'

	split = s.split()

	for word in split:
	if len(word) == 1 and not s.endswith(word):
	if word not in allowed:
	return False

	return True


	def check_rule_words(s):
	"""
	Certain letter combinations are just likely. For example, if you have
	'oul', you can be pretty certain the next letter is 'd'. This function
	returns False if a word in the input string violates something like that.

	(This is the last rule I implemented, and is a bit arbitrary.)

	Parameters
	----------
	s : str
	String to check.

	Returns
	-------
	correct : bool
	False if nonsensical letter combinations are found.

	"""
	split = s.split()

	for word in split:
	if 'oul' in word and not s.endswith('oul'):
	if 'ould' not in word:
	return False
	if word.startswith('whic') and not s.endswith('whic'):
	if word != 'which':
	return False
	if 'aa' in word:
	return False
	if 'ae' in word:
	return False
	if word.startswith('lang') and not s.endswith('lang'):
	if not word.startswith('langu'):
	return False
	if 'titlec' in word:
	return False

	return True


	def check_string(s):
	"""
	Check whether a string meets all of the language rules. Returns
	True if all rules are met.

	Parameters
	----------
	s : str
	String to check.

	Returns
	-------
	correct : bool
	True if all rules are met, False otherwise.

	"""
	correct = (check_rule_has_a_vowel(s) and
	check_rule_capital_letters(s) and
	check_rule_no_four_consonants(s) and
	check_rule_no_bad_endings(s) and
	check_rule_punctuation(s) and
	check_rule_single_letters(s) and
	check_rule_words(s))

	return correct


	def check_column(decoded, paragraph, col):
	"""
	Check whether a given column of the paragraph fits in with what's
	already been added to the decoded paragraph.

	Parameters
	----------
	decoded : list of str
	List of paragraph lines.

	paragraph : list of lists
	List of lists containing paragraph lines split into two
	character chunks.

	col : int
	Column to add to `decoded`.

	Returns
	-------
	correct : bool
	True if column `col` is consistent with what has already
	been decoded, False otherwise.

	"""
	correct = [check_string(decoded[i] + paragraph[i][col])
	for i in range(len(decoded))]

	if correct.count(True) != len(decoded):
	return False
	else:
	return True


	def add_column(decoded, paragraph, col):
	"""
	Add a column from the paragraph to the decoded container.
	Also removes the column from use.

	Parameters
	----------
	decoded : list of str
	List of paragraph lines.

	paragraph : list of lists
	List of lists containing paragraph lines split into two
	character chunks.

	col : int
	Column to add to `decoded`.

	Returns
	-------
	decoded : list of str
	Paragraph lines with column appended.

	paragraph : list of lists
	Line chunks with column `col` removed.

	"""
	for i, line in enumerate(decoded):
	decoded[i] = line + paragraph[i].pop(col)

	return decoded, paragraph


	def main():
	"""
	Start with the column identified by `find_first_column` and then
	successively add columns that pass all of the rule tests.

	"""
	# turn S into list of lists
	paragraph = [s.split('\|') for s in S]

	# make a container for decoded paragraph
	decoded = [''] * len(paragraph)

	# this will contain the indices of the columns in their corrected
	# order.
	order = []

	# find the first column and remove it from paragraph
	order.append(find_first_column(paragraph[0]))

	decoded, paragraph = add_column(decoded, paragraph, order[-1])

	# now go through columns finding one that goes after the last
	# column in order
	while True:
	matched = False

	for col in range(len(paragraph[0])):
	if check_column(decoded, paragraph, col):
	order.append(col)

	decoded, paragraph = add_column(decoded, paragraph, order[-1])

	matched = True

	break

	if not matched:
	print decoded
	raise StandardError("No column matched your rules!")

	# check if we've matched all columns
	if len(paragraph[0]) == 0:
	break

	for line in decoded:
	print line


	if __name__ == '__main__':
	main()