Skip to content

Instantly share code, notes, and snippets.

@dmarx
Created November 29, 2014 19:06
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dmarx/b6a095d2b161eccb18a3 to your computer and use it in GitHub Desktop.
Save dmarx/b6a095d2b161eccb18a3 to your computer and use it in GitHub Desktop.
Notebook for extracting "I before E except after C" style spelling heuristics from an input vocabulary or corpus
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "I before E except after C..."
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": "# 'I' Before 'E' Except After 'C'\n\nA recent post on Patrick Vennebush blog *Math Jokes for 4 Mathy Folks* asserted that the rule of thumb *\"I before E except after C\"* was \"total bullshit.\" This got me thinking: the \"I before E except after C\" rule was almost certainly developed without any research at all, just based on the subjective experience of educators. It's not a *horrible* rule, but certainly we can more intelligently construct better rules of this kind (for lack of a better term, I'll be refering to these as \"trigram rules\"). We have the technology, and I have nothing better to do with my night off :)\n\n## Vocabulary trigram rules\n\nTo start with, we're going to need a corpus of words (i.e. a vocabulary) to analyze. I'll be using the wordlist in the NLTK package."
},
{
"cell_type": "code",
"collapsed": false,
"input": "from nltk.corpus import words, brown\nimport string\nfrom collections import Counter, defaultdict\nfrom itertools import combinations\nfrom __future__ import division\n\ncorpus = set(w.lower() for w in words.words())\nprint len(corpus)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "233621\n"
}
],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": "This analysis will proceed by identifying all of the trigrams in my corpus, and then decomposing each trigram into its starting letter and the trailing bigram. For each bigram in the corpus, we want to count the occurence of the bigram, and the frequency with which that bigram is preceded by each respective \"starting letter\" in all occurences of trigrams. My favorite datatype for this sort of operation is a [`defaultdict`](https://docs.python.org/2/library/collections.html#collections.defaultdict) initialized by [`Counter`](https://docs.python.org/2/library/collections.html#collections.Counter)s. What this means is that if I key into my dict for an key that isn't already there, the dict will initialize that key pointing to an empty [`Counter`](https://docs.python.org/2/library/collections.html#collections.Counter) as the value. This way, I can key into my object for each occurence of a bigram and update the counter corresponding to the associated trigram's starting letter.\n\nYou'll notice that I refer to this data structure as a \"trie\" although I'm not positive this is technically accurate. At the very least, it's similar to a trie."
},
{
"cell_type": "code",
"collapsed": false,
"input": "trigram_trie = defaultdict(Counter)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Next, I'll define a simple function to tokenize an input word into ngrams. I could specify a function to just trigram tokenize, but it's easy enough to generalize this to any ngram. I haven't gotten into it yet, but maybe at a later date I'll want to repeat this experiment with 4-grams or something instead of just focusing on trigrams. In any event, having a simple ngram tokenizing function handy certainly can't hurt."
},
{
"cell_type": "code",
"collapsed": false,
"input": "def ngram_tokenize(word, n=3):\n m = len(word)\n grams = []\n for i in range(m-n+1):\n j = i + n\n grams.append(word[i:j])\n return grams",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now that we've got all of the necessary machinery in place, let's process our corpus and tally our trigram rules (unigram->bigram relationshps)"
},
{
"cell_type": "code",
"collapsed": false,
"input": "# Process trigrams as unigram->bigram\nfor word in corpus:\n for trigram in ngram_tokenize(word, n=3):\n start, bigram = trigram[0], trigram[1:]\n trigram_trie[bigram].update([start])\n \n# Count occurence of bigrams\nstats = Counter()\nfor k,v in trigram_trie.iteritems():\n stats[k] = sum(v.values())",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Just to make my life a little easier, I define a function that prints some simple information for a given rule."
},
{
"cell_type": "code",
"collapsed": false,
"input": "# Investigate what makes something a rule\ndef print_rule_stats(rule, trigram_trie=trigram_trie, stats=stats):\n \"\"\"\n Print summary statistics for a trigram rule\n \"\"\"\n ab,c = rule\n a,b = ab\n ba = b+a\n\n print ab, stats[ab], '\\t', c+ab, trigram_trie[ab][c]\n print ba, stats[ba], '\\t', c+ba, trigram_trie[ba][c]\n #print c+ab, trigram_trie[ab][c]\n #print c+ba, trigram_trie[ba][c] ",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 16
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Our analysis corroborates the research that the \"'i' before 'e' except after 'c'\" rule is not effective from the naive perspective of frequency in the vocabulary (it might be valid if we consider the occurence of the trigram in *written language* and not just the vocabulary -- since the usage of different words is absolutely not uniform -- but we haven't investigated this). We can contrast this with a better trigram rule to get a sense of what we're looking for."
},
{
"cell_type": "code",
"collapsed": false,
"input": "print \"Weak rule\"\nprint_rule_stats(['ie','c']) \nprint \"\\nStrong rule\"\nprint_rule_stats(['ie','a']) ",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Weak rule\nie 3950 \tcie 256\nei 2607 \tcei 156\n\nStrong rule\nie 3950 \taie 16\nei 2607 \taei 44\n"
}
],
"prompt_number": 17
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Given a trigram rule of the form \"'a' then 'b' except after 'c',\" the strength of the rule is relative to the extent with which it exhibits the following properties:\n\n* the bigram \"ab\" occurs more frequently in the vocabulary than the bigram \"ba\"\n* the trigram \"cba\" occurs more frequently than the trigram \"cab\"\n\nTo rank rules, we'll need some mechanism of scoring them. A simple method I've found reasonably effective for the purposes of this experiment is to filter out rules where the ratio of $count(ab)/count(ba)$ is below some threshhold, and then use the inverse ratio of trigram occurences as the score: $count(cba)/count(cab)$. In our best rules, the trigram `cab` may not occur in our vocabulary at all, so to mitigate division-by-zero issues, we can Laplace smooth the trigram occurence ratio by adding 1 to both the numerator and the denominator. Our scoring function then becomes:\n\n$$\nscore([ab,c], k) = \\left\\{\n \\begin{array}{lr}\n \\frac{count(cba)}{count(cab)} :& \\frac{count(ab)}{count(ba)} \\ge k \\\\\n 0 : otherwise\n \\end{array}\n \\right.\n$$ "
},
{
"cell_type": "code",
"collapsed": false,
"input": "def score_rule(rule, k=1/2):\n ab,c = rule\n a,b = ab\n ba=b+a\n if stats[ab]/(stats[ab] + stats[ba]) > k:\n return (trigram_trie[ba][c]+1)/(trigram_trie[ab][c]+1)\n else:\n return 0",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 20
},
{
"cell_type": "markdown",
"metadata": {},
"source": "With the notion of a trigram rule defined and a scoring function in place, we can pipeline the rule identification, scoring and ranking process into a function to allow us to easily identify and rank all trigram rules in a given vocabulary/corpus."
},
{
"cell_type": "code",
"collapsed": false,
"input": "def extract_trigram_rules(corpus, scoring_function=score_rule):\n \"\"\"\n Extracts \"I before E except after C\" style 'trigram rules' from an input corpus,\n ranked by the specified scoring function.\n \"\"\"\n # Process trigrams as unigram->bigram\n trigram_trie = defaultdict(Counter) \n for word in corpus:\n for trigram in ngram_tokenize(word, n=3):\n start, bigram = trigram[0], trigram[1:]\n trigram_trie[bigram].update([start])\n \n # Tally bigrams\n stats = Counter()\n for k,v in trigram_trie.iteritems():\n stats[k] = sum(v.values())\n \n special_cases = []\n for a, b in combinations(string.lowercase,2):\n #print a, b\n comb1 = a+b\n comb2 = b+a\n if stats[comb2] > stats[comb1]:\n comb1, comb2 = comb2, comb1\n for c in trigram_trie[comb2].keys():\n if trigram_trie[comb2][c] > trigram_trie[comb1][c]:\n rule = [comb1, str(c)]\n special_cases.append([scoring_function(rule), rule])\n \n special_cases.sort(reverse=True)\n \n return special_cases, trigram_trie, stats",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": "\nHere are the top 20 trigram rules. The notation `['ab','c']` should be interpreted as *\"`a` before `b` except after `c`.\"*\n\nAlthough the highest ranked rule is *\"P before E, except after C,\"* my favorite is the second one: *\"E before U, except after Q.\"* "
},
{
"cell_type": "code",
"collapsed": false,
"input": "vocab_rules, trigram_trie, stats = extract_trigram_rules(corpus, scoring_function=lambda x:score_rule(x, k=1/2))\n\nvocab_rules[0]\nfor score, rule in vocab_rules[:5]:\n print rule\n print_rule_stats(rule, trigram_trie, stats)\n print",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "['pe', 'c']\npe 8052 \tcpe 0\nep 5053 \tcep 955\n\n['eu', 'q']\neu 2620 \tqeu 0\nue 1981 \tque 949\n\n['ic', 'i']\nic 26140 \tiic 1\nci 6561 \tici 1830\n\n['te', 'm']\nte 27265 \tmte 2\net 11743 \tmet 2684\n\n['rd', 'n']\nrd 3641 \tnrd 0\ndr 2738 \tndr 808\n\n"
}
],
"prompt_number": 21
},
{
"cell_type": "markdown",
"metadata": {},
"source": "If we modify the scoring heuristic to filter on a more aggressive cutoff threshold for the bigram ratio, different rules get highlighted."
},
{
"cell_type": "code",
"collapsed": false,
"input": "vocab_rules, trigram_trie, stats = extract_trigram_rules(corpus, scoring_function=lambda x:score_rule(x, k=3/4))\n\nvocab_rules[0]\nfor score, rule in vocab_rules[:5]:\n print rule\n print_rule_stats(rule, trigram_trie, stats)\n print",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "['ic', 'i']\nic 26140 \tiic 1\nci 6561 \tici 1830\n\n['ns', 's']\nns 6442 \tsns 0\nsn 1121 \tssn 298\n\n['og', 'a']\nog 7578 \taog 2\ngo 2459 \tago 619\n\n['ex', 'e']\nex 1688 \teex 0\nxe 398 \texe 201\n\n['ng', 'n']\nng 13024 \tnng 1\ngn 1767 \tngn 373\n\n"
}
],
"prompt_number": 24
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Corpus trigram rules\n\nThe above analysis determined trigram rules that are useful across the english vocabulary. But really, the aim of these rules is to help us with our day-to-day spelling, so it might be more useful to conduct the analysis scoring trigrams based not on their appearance in unique words, but in their actual occurence in written English. For this purpose, a useful corpus of reasonably size is the Brown corpus."
},
{
"cell_type": "code",
"collapsed": false,
"input": "brown_corpus = [w.lower() for w in brown.words()]",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 12
},
{
"cell_type": "code",
"collapsed": false,
"input": "print \"Total words:\", len(brown_corpus) # plus a few grammatical tokens\nprint \"Unique words:\", len(set(brown_corpus))\nprint \"\\nMost common tokens:\"\nCounter(brown_corpus).most_common(10)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Total words: 1161192\nUnique words: "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "49815\n\nMost common tokens:\n"
},
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 13,
"text": "[(u'the', 69971),\n (u',', 58334),\n (u'.', 49346),\n (u'of', 36412),\n (u'and', 28853),\n (u'to', 26158),\n (u'a', 23195),\n (u'in', 21337),\n (u'that', 10594),\n (u'is', 10109)]"
}
],
"prompt_number": 13
},
{
"cell_type": "code",
"collapsed": false,
"input": "brown_rules, brown_trigrams, brown_stats = extract_trigram_rules(brown_corpus)\n\nfor score, rule in brown_rules[:5]:\n print rule\n print_rule_stats(rule, brown_trigrams, brown_stats)\n print",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "['pe', 'c']\npe 12235 \tcpe 0\nep 6038 \tcep 891\n\n['ic', 'i']\nic 24054 \tiic 0\nci 7575 \tici 1592\n\n['te', 'm']\nte 39262 \tmte 2\net 15913 \tmet 1785\n\n['rd', 'n']\nrd 7135 \tnrd 0\ndr 1305 \tndr 377\n\n['iv', 'i']\niv 9448 \tiiv 0\nvi 6956 \tivi 1874\n\n"
}
],
"prompt_number": 25
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We've established that the \"I before E\" rule doesn't hold up when we just look at the vocabulary, but in the context of the distribution of words in written language is the rule salvageable?"
},
{
"cell_type": "code",
"collapsed": false,
"input": "print_rule_stats(['ie','c'], brown_trigrams, brown_stats)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "ie 13275\nei 5677\ncie 1310\ncei 485\n"
}
],
"prompt_number": 15
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We've been lied to for so long... I don't know what's real anymore."
},
{
"cell_type": "code",
"collapsed": false,
"input": "",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 15
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment