Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save juanshishido/ef970cf042df730604c7 to your computer and use it in GitHub Desktop.
Save juanshishido/ef970cf042df730604c7 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {
"collapsed": false,
"trusted": true
},
"cell_type": "code",
"source": "%pprint\n\nimport re\nimport random\nfrom collections import defaultdict\n\nimport pandas as pd\nimport nltk\nfrom nltk import word_tokenize\nfrom nltk.corpus import stopwords\nfrom nltk.tokenize import regexp_tokenize",
"execution_count": 1,
"outputs": [
{
"output_type": "stream",
"text": "Pretty printing has been turned OFF\n",
"name": "stdout"
}
]
},
{
"metadata": {
"collapsed": true,
"trusted": true
},
"cell_type": "code",
"source": "sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')",
"execution_count": 2,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**First, load in the file or files below.** First, take a look at your text. An easy way to get started is to first read it in, and then run it through the sentence tokenizer to divide it up, even if this division is not fully accurate. You may have to do a bit of work to figure out which will be the \"opening phrase\" that Wolfram Alpha shows. Below, write the code to read in the text and split it into sentences, and then print out the **opening phrase**."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Manually deleted the Project Gutenberg information from each of the 11 files. Then, combined them using the following `bash` script, which replaces tabs, new lines, and carriage returns with spaces."
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "!cat text-collection/jsm-collection.sh",
"execution_count": 3,
"outputs": [
{
"output_type": "stream",
"text": "#!/bin/bash\r\n\r\nrm jsm-collection.txt\r\n\r\nFILES=/Users/JS/Code/_INFO256/text-collection/*.txt\r\n\r\nfor f in $FILES\r\ndo\r\n tr -s '\\t\\n\\r' ' ' < $f >> temp.txt\r\ndone\r\n\r\ntr -s '“”' '\"' < temp.txt > temp1.txt\r\ntr -s \"’\" \"'\" < temp1.txt > jsm-collection.txt\r\n\r\nrm temp.txt\r\nrm temp1.txt\r\n",
"name": "stdout"
}
]
},
{
"metadata": {
"collapsed": false,
"trusted": true
},
"cell_type": "code",
"source": "with open ('text-collection/jsm-collection.txt', 'r', encoding='utf-8') as jsm:\n t = jsm.read()\n\n# remove chapter and section headings\nt = re.sub('\\s+', ' ',\n re.sub(r'[A-Z]{2,}', '',\n re.sub('((?<=[A-Z])\\sI | I\\s(?=[A-Z]))', ' ', t)))\n\nsentences = [s.strip() + '.' for s in t.split('.')]\n\nsentences[0]",
"execution_count": 4,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "'It seems proper that I should prefix to the following biographical sketch some mention of the reasons which have made me think it desirable that I should leave behind me such a memorial of so uneventful a life as mine.'"
},
"metadata": {},
"execution_count": 4
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Next, tokenize.** Look at the several dozen sentences to see what kind of tokenization issues you'll have. Write a regular expression tokenizer, using the `nltk.regexp_tokenize()` as seen in class, or using something more sophisticated if you prefer, to do a nice job of breaking your text up into words. You may need to make changes to the regex pattern that is given in the book to make it work well for your text collection. \n\n*How you break up the words will have effects down the line for how you can manipulate your text collection. You may want to refine this code later.*"
},
{
"metadata": {
"collapsed": false,
"trusted": true
},
"cell_type": "code",
"source": "start = random.randint(0, len(sentences))\n\nsentences[start : start+10]",
"execution_count": 5,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "['I felt, too, that mine was not an interesting, or in any way respectable distress.', 'There was nothing in it to attract sympathy.', 'Advice, if I had known where to seek it, would have been most precious.', 'The words of Macbeth to the physician often occurred to my thoughts.', 'But there was no one on whom I could build the faintest hope of such assistance.', 'My father, to whom it would have been natural to me to have recourse in any practical difficulties, was the last person to whom, in such a case as this, I looked for help.', 'Everything convinced me that he had no knowledge of any such mental state as I was suffering from, and that even if he could be made to understand it, he was not the physician who could heal it.', 'My education, which was wholly his work, had been conducted without any regard to the possibility of its ending in this result; and I saw no use in giving him the pain of thinking that his plans had failed, when the failure was probably irremediable, and, at all events, beyond the power of _his_ remedies.', 'Of other friends, I had at that time none to whom I had any hope of making my condition intelligible.', 'It was, however, abundantly intelligible to myself; and the more I dwelt upon it, the more hopeless it appeared.']"
},
"metadata": {},
"execution_count": 5
}
]
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "words = regexp_tokenize(t, pattern=\"\\w+(?:[-']\\w+)*\")",
"execution_count": 6,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "len(words)",
"execution_count": 7,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "1070892"
},
"metadata": {},
"execution_count": 7
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Compute word counts.** Now compute your frequency distribution using a `FreqDist` over the words. Let's not do lowercasing or stemming yet. You can run this over the whole collection together, or sentence by sentence. Write the code for computing the FreqDist below and show the most common 20 words that result."
},
{
"metadata": {
"collapsed": false,
"trusted": true
},
"cell_type": "code",
"source": "words_fd = nltk.FreqDist(words)\nwords_fd.most_common(20)",
"execution_count": 8,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "[('the', 73040), ('of', 59972), ('to', 30496), ('and', 24763), ('in', 23297), ('a', 20221), ('is', 18764), ('that', 14476), ('which', 14060), ('be', 13108), ('it', 11922), ('as', 10354), ('not', 9830), ('by', 9676), ('or', 8651), ('are', 8076), ('for', 7752), ('on', 5930), ('from', 5699), ('have', 5625)]"
},
"metadata": {},
"execution_count": 8
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Normalize the text.** Now adjust the output by normalizing the text: things you can try include removing stopwords, removing very short words, lowercasing the text, improving the tokenization, and/or doing other adjustments to bring content words higher up in the results. The goal is to dig deeper into the collection to find interesting but relatively frequent words. Show the code for these changes below. "
},
{
"metadata": {
"collapsed": false,
"trusted": true
},
"cell_type": "code",
"source": "stop_words = stopwords.words('english')\n\nwords_clean = [w.lower() for w in words if w.lower() not in stop_words]",
"execution_count": 9,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Show adjusted word counts.** Show the most frequent 20 words that result from these adjustments."
},
{
"metadata": {
"collapsed": false,
"trusted": true
},
"cell_type": "code",
"source": "wc_fd = nltk.FreqDist(words_clean)\nwc_fd.most_common(20)",
"execution_count": 10,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "[('one', 4729), ('would', 3702), ('may', 3159), ('general', 2010), ('must', 1824), ('even', 1766), ('two', 1666), ('case', 1555), ('much', 1473), ('every', 1470), ('laws', 1466), ('therefore', 1337), ('first', 1305), ('made', 1293), ('without', 1280), ('nature', 1277), ('things', 1273), ('great', 1264), ('capital', 1261), ('law', 1251)]"
},
"metadata": {},
"execution_count": 10
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Creating a table.**\nPython provides an easy way to line columns up in a table. You can specify a width for a string such as %6s, producing a string that is padded to width 6. It is right-justified by default, but a minus sign in front of it switches it to left-justified, so -3d% means left justify an integer with width 3. *AND* if you don't know the width in advance, you can make it a variable by using an asterisk rather than a number before the '\\*s%' or the '-\\*d%'. Check out this example (this is just fyi):"
},
{
"metadata": {
"collapsed": false,
"trusted": true
},
"cell_type": "code",
"source": "print('%-16s' % 'Info type', '%10s' % 'Value')\nprint('%-16s' % 'number of words', '%10d' % len(words_clean))",
"execution_count": 11,
"outputs": [
{
"output_type": "stream",
"text": "Info type Value\nnumber of words 505793\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Word Properties Table** Next there is a table of word properties, which you should compute (skip unique word stems, since we will do stemming in class on Wed). Make a table that prints out:\n1. number of words\n2. number of unique words\n3. average word length\n4. longest word\n\nYou can make your table look prettier than the example I showed above if you like!\n\nYou can decide for yourself if you want to eliminate punctuation and function words (stop words) or not. It's your collection! \n"
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "words_n = len(words_clean)\nwords_unique = len(set(words_clean))\n\nwords_lengths = defaultdict(int)\nfor w in set(words_clean):\n words_lengths[w] = len(w)\nwords_mean_len = sum(words_lengths.values()) / float(len(words_lengths))\n\nwords_longest = ''.join([k for k, v in words_lengths.items() if v == max(words_lengths.values())])",
"execution_count": 12,
"outputs": []
},
{
"metadata": {
"collapsed": false,
"trusted": true
},
"cell_type": "code",
"source": "print('%-32s' % 'Info type', '%16s' % 'Value')\nprint('%-32s' % 'number of words', '%16d' % words_n)\nprint('%-32s' % 'number of unique words', '%16d' % words_unique)\nprint('%-32s' % 'average word length', '%16.2f' % round(words_mean_len, 2))\nprint('%-16s' % 'longest word', '%32s' % words_longest)",
"execution_count": 13,
"outputs": [
{
"output_type": "stream",
"text": "Info type Value\nnumber of words 505793\nnumber of unique words 25399\naverage word length 8.15\nlongest word mathematico-astronomical\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Most Frequent Words List.** Next is the most frequent words list. This table shows the percent of the total as well as the most frequent words, so compute this number as well. "
},
{
"metadata": {
"trusted": true,
"collapsed": true
},
"cell_type": "code",
"source": "mf_words_words = [k for k, v in wc_fd.most_common(20)]\nmf_words_count = [v for k, v in wc_fd.most_common(20)]",
"execution_count": 14,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "df = pd.DataFrame({'word' : mf_words_words, 'frequency' : mf_words_count, 'total' : [words_n] * 20})\ndf['decimal'] = df['frequency'] / df['total']\ndf = df[['word', 'frequency', 'decimal']]\ndf",
"execution_count": 15,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": " word frequency decimal\n0 one 4729 0.009350\n1 would 3702 0.007319\n2 may 3159 0.006246\n3 general 2010 0.003974\n4 must 1824 0.003606\n5 even 1766 0.003492\n6 two 1666 0.003294\n7 case 1555 0.003074\n8 much 1473 0.002912\n9 every 1470 0.002906\n10 laws 1466 0.002898\n11 therefore 1337 0.002643\n12 first 1305 0.002580\n13 made 1293 0.002556\n14 without 1280 0.002531\n15 nature 1277 0.002525\n16 things 1273 0.002517\n17 great 1264 0.002499\n18 capital 1261 0.002493\n19 law 1251 0.002473",
"text/html": "<div>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>word</th>\n <th>frequency</th>\n <th>decimal</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>one</td>\n <td>4729</td>\n <td>0.009350</td>\n </tr>\n <tr>\n <th>1</th>\n <td>would</td>\n <td>3702</td>\n <td>0.007319</td>\n </tr>\n <tr>\n <th>2</th>\n <td>may</td>\n <td>3159</td>\n <td>0.006246</td>\n </tr>\n <tr>\n <th>3</th>\n <td>general</td>\n <td>2010</td>\n <td>0.003974</td>\n </tr>\n <tr>\n <th>4</th>\n <td>must</td>\n <td>1824</td>\n <td>0.003606</td>\n </tr>\n <tr>\n <th>5</th>\n <td>even</td>\n <td>1766</td>\n <td>0.003492</td>\n </tr>\n <tr>\n <th>6</th>\n <td>two</td>\n <td>1666</td>\n <td>0.003294</td>\n </tr>\n <tr>\n <th>7</th>\n <td>case</td>\n <td>1555</td>\n <td>0.003074</td>\n </tr>\n <tr>\n <th>8</th>\n <td>much</td>\n <td>1473</td>\n <td>0.002912</td>\n </tr>\n <tr>\n <th>9</th>\n <td>every</td>\n <td>1470</td>\n <td>0.002906</td>\n </tr>\n <tr>\n <th>10</th>\n <td>laws</td>\n <td>1466</td>\n <td>0.002898</td>\n </tr>\n <tr>\n <th>11</th>\n <td>therefore</td>\n <td>1337</td>\n <td>0.002643</td>\n </tr>\n <tr>\n <th>12</th>\n <td>first</td>\n <td>1305</td>\n <td>0.002580</td>\n </tr>\n <tr>\n <th>13</th>\n <td>made</td>\n <td>1293</td>\n <td>0.002556</td>\n </tr>\n <tr>\n <th>14</th>\n <td>without</td>\n <td>1280</td>\n <td>0.002531</td>\n </tr>\n <tr>\n <th>15</th>\n <td>nature</td>\n <td>1277</td>\n <td>0.002525</td>\n </tr>\n <tr>\n <th>16</th>\n <td>things</td>\n <td>1273</td>\n <td>0.002517</td>\n </tr>\n <tr>\n <th>17</th>\n <td>great</td>\n <td>1264</td>\n <td>0.002499</td>\n </tr>\n <tr>\n <th>18</th>\n <td>capital</td>\n <td>1261</td>\n <td>0.002493</td>\n </tr>\n <tr>\n <th>19</th>\n <td>law</td>\n <td>1251</td>\n <td>0.002473</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {},
"execution_count": 15
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Most Frequent Capitalized Words List** We haven't lower-cased the text so you should be able to compute this. Don't worry about whether capitalization comes from proper nouns, start of sentences, or elsewhere. You need to make a different FreqDist to do this one. Write the code here for the new FreqDist and the List itself. Show the list here."
},
{
"metadata": {
"collapsed": false,
"trusted": true
},
"cell_type": "code",
"source": "words_capitalized = [w for w in words if w.lower() not in stop_words]\nwords_capitalized = [w for w in words_capitalized if w[0].isupper()]\nwns_fd = nltk.FreqDist(words_capitalized)\nwns_fd.most_common(20)",
"execution_count": 16,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "[('England', 646), ('Mr', 609), ('B', 572), ('States', 547), ('United', 486), ('M', 465), ('C', 297), ('Method', 275), ('English', 274), ('Comte', 270), ('Political', 264), ('Economy', 236), ('Dr', 217), ('Thus', 216), ('Germany', 190), ('Even', 175), ('American', 170), ('Parliament', 163), ('Government', 150), ('France', 150)]"
},
"metadata": {},
"execution_count": 16
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Sentence Properties Table** This summarizes number of sentences and average sentence length in words and characters (you decide if you want to include stopwords/punctuation or not). Print those out in a table here."
},
{
"metadata": {
"collapsed": false,
"trusted": true
},
"cell_type": "code",
"source": "sentence_lens_char = []\nsentence_lens_word = []\nfor s in sentences:\n sentence_lens_char.append(len(s))\n sentence_lens_word.append(len(s.split()))\n\nsentences_n = len(sentences)\nsentences_mean_char = sum(sentence_lens_char) / float(len(sentence_lens_char))\nsentences_mean_word = sum(sentence_lens_word) / float(len(sentence_lens_word))\n\nprint('%-32s' % 'Info type', '%16s' % 'Value')\nprint('%-32s' % 'number of sentences', '%16d' % sentences_n)\nprint('%-32s' % 'average sentence length: characters', '%13.2f' % round(sentences_mean_char, 2))\nprint('%-32s' % 'average sentence length: words', '%16.2f' % round(sentences_mean_word, 2))",
"execution_count": 17,
"outputs": [
{
"output_type": "stream",
"text": "Info type Value\nnumber of sentences 34431\naverage sentence length: characters 181.68\naverage sentence length: words 31.11\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Reflect on the Output** (Write a brief paragraph below answering these questions.) What does it tell you about your collection? What does it fail to tell you? How does your collection perhaps differ from others?"
},
{
"metadata": {
"collapsed": true,
"trusted": true
},
"cell_type": "markdown",
"source": "The first thing I noticed was that more than half of the 1,070,892 words are stop words. In looking at the 20 most frequent \"cleaned\" words (lowercase and removing stop words), the words \"case,\" \"laws,\" \"capital\", and \"law\"&mdash;words most likely associated with government&mdash;appear most frequently. Mill was, after all, a political economist, among other things. Also interesting is that he uses \"therefore\" quite frequently. Having an average word length of over 8 characters seems higher than would be typical (at least in contemporary times). The same is true for the average sentence length, which makes sense based on the time period of these writings."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Compare to Another Collection** Now do the same analysis on another collection in NLTK. \nIf your collection is a book, you can compare against another book. Or you can contrast against an entirely different collection (Brown corpus, presidential inaugural addresses, etc) to see the difference.\nThe list of collections is here: http://www.nltk.org/nltk_data/\nReflect on the similarities to or differences from your text collection.\n"
},
{
"metadata": {
"collapsed": false,
"trusted": true
},
"cell_type": "code",
"source": "from nltk.book import *",
"execution_count": 18,
"outputs": [
{
"output_type": "stream",
"text": "*** Introductory Examples for the NLTK Book ***\nLoading text1, ..., text9 and sent1, ..., sent9\nType the name of the text or sentence to view it.\nType: 'texts()' or 'sents()' to list the materials.\ntext1: Moby Dick by Herman Melville 1851\ntext2: Sense and Sensibility by Jane Austen 1811\ntext3: The Book of Genesis\ntext4: Inaugural Address Corpus\ntext5: Chat Corpus\ntext6: Monty Python and the Holy Grail\ntext7: Wall Street Journal\ntext8: Personals Corpus\ntext9: The Man Who Was Thursday by G . K . Chesterton 1908\n",
"name": "stdout"
}
]
},
{
"metadata": {
"trusted": true,
"collapsed": true
},
"cell_type": "code",
"source": "iac_words_clean = [w.lower() for w in text4 if w.lower() not in stop_words]",
"execution_count": 19,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "iac_fd = nltk.FreqDist(iac_words_clean)\niac_fd.most_common(20)",
"execution_count": 20,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "[(',', 6840), ('.', 4676), ('government', 593), ('people', 563), (';', 544), ('us', 455), ('upon', 369), ('--', 363), ('must', 346), ('may', 334), ('great', 331), ('world', 329), ('states', 329), ('shall', 314), ('country', 302), ('nation', 302), ('every', 285), ('-', 280), ('peace', 252), ('one', 243)]"
},
"metadata": {},
"execution_count": 20
}
]
},
{
"metadata": {
"trusted": true,
"collapsed": true
},
"cell_type": "code",
"source": "words_n = len(iac_words_clean)\nwords_unique = len(set(iac_words_clean))\n\nwords_lengths = defaultdict(int)\nfor w in set(iac_words_clean):\n words_lengths[w] = len(w)\nwords_mean_len = sum(words_lengths.values()) / float(len(words_lengths))\n\nwords_longest = ''.join([k for k, v in words_lengths.items() if v == max(words_lengths.values())])",
"execution_count": 21,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "print('%-32s' % 'Info type', '%64s' % 'Value')\nprint('%-32s' % 'number of words', '%64d' % words_n)\nprint('%-32s' % 'number of unique words', '%64d' % words_unique)\nprint('%-32s' % 'average word length', '%64.2f' % round(words_mean_len, 2))\nprint('%-28s' % 'longest word', '%64s' % words_longest)",
"execution_count": 22,
"outputs": [
{
"output_type": "stream",
"text": "Info type Value\nnumber of words 76213\nnumber of unique words 8944\naverage word length 7.82\nlongest word contradistinctionantiphilosophistsmisrepresentationinstrumentalities\n",
"name": "stdout"
}
]
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "len('antiphilosophistscontradistinctioninstrumentalitiesmisrepresentation')",
"execution_count": 23,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "68"
},
"metadata": {},
"execution_count": 23
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Punctuation in the Inaugural Address Corpus (IAC) is not removed; it is in the John Stuart Mill Corpus (JSMC). The average word length in the IAC is slightly lower than in the JSMC. However, this could be due to the presence of \"antiphilosophistscontradistinctioninstrumentalitiesmisrepresentation,\" which is 68 characters long. The IAC is smaller than the JSMC. Also the ratio of unique words to words in the IAC is higher than in the JSMC. The IAC is, understandably, more geared toward government and politics."
},
{
"metadata": {
"trusted": true,
"collapsed": true
},
"cell_type": "code",
"source": "",
"execution_count": null,
"outputs": []
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"language_info": {
"nbconvert_exporter": "python",
"name": "python",
"pygments_lexer": "ipython3",
"mimetype": "text/x-python",
"file_extension": ".py",
"version": "3.4.2",
"codemirror_mode": {
"name": "ipython",
"version": 3
}
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment