jerielizabeth/PreProcess-SDA-Periodicals-Corpus.ipynb

## PreProcess-SDA-Periodicals-Corpus.ipynb
{
  "cells": [
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "# Overview\n\nThe project for this notebook is to stream in all of the periodical docs and preprocess them for topic modeling. \n\nThe code example that I am using is from http://radimrehurek.com/topic_modeling_tutorial/1%20-%20Streamed%20Corpora.html. \n\nThis notebook relies on NLTK and Gensim for parsing the text in the corpus files.\n\nThe corpus I am working with is composed of the pages of the periodicals published by the Seventh-day Adventist church during the years 1849 to 1920 and published in one of the following regions: Lake Union; Pacific Union; Southern Union; or Columbia Union. Each periodical has been split into pages, with each page containing approximate 1000 words, making them the ideal size for topic modeling (citation). \n\nThe corpus size is 196,762 documents.\n\n## Preprocessing steps\n\n1. Address problem of line-endings with the following regex: `re.sub(r'(\\w+)[-]\\s(\\w+)', r'\\1\\2', text)`\n2. Convert to lowercase and tokenize"
    },
    {
      "metadata": {
        "collapsed": true,
        "trusted": false
      },
      "cell_type": "code",
      "source": "# import and setup modules we'll be using in this notebook\nimport os\nimport sys\nimport re\nimport tarfile\nimport itertools\nimport logging\n\nimport nltk\nfrom nltk.collocations import TrigramCollocationFinder\nfrom nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures\n\nimport gensim\nfrom gensim.parsing.preprocessing import STOPWORDS",
      "execution_count": 1,
      "outputs": []
    },
    {
      "metadata": {
        "collapsed": false,
        "trusted": false
      },
      "cell_type": "code",
      "source": "# http://stackoverflow.com/questions/35936086/\nlogger = logging.getLogger()\nlogger.setLevel(logging.DEBUG)\n\n# Create STDERR handler\nhandler = logging.StreamHandler(sys.stderr)\n# ch.setLevel(logging.DEBUG)\n\n# Create formatter and add it to the handler\nformatter = logging.Formatter('%(name)s - %(levelname)s - %(message)s')\nhandler.setFormatter(formatter)\n\n# Set STDERR handler as the only handler \nlogger.handlers = [handler]",
      "execution_count": 2,
      "outputs": []
    },
    {
      "metadata": {
        "collapsed": false,
        "trusted": false
      },
      "cell_type": "code",
      "source": "def process_page(page):\n    \"\"\"\n    Preprocess a single periodical page, returning the result as\n    a unicode string.\n    \"\"\"\n    content = gensim.utils.to_unicode(page, 'utf8').strip()\n\n    \"\"\"\n    Cleans up the special characters in the text to those we would expect in the corpus.\n    Leaves punctuation, which may result in additional noise. \n    Removes all accented characters. There is a higher rate of messy OCR reporting \n    accented characters than use in this corpus of languages other than English. \n    This approach removes from study questions of non-English language use, but\n    significantly reduces OCR noise.\n    \"\"\"\n    content = re.sub(r\"[^a-zA-Z0-9\\n\\.\\s\\'\\\"\\!\\,\\-\\;\\(\\)\\:\\?\\$\\%\\&]\", \" \", content)\n    \n    return(content)",
      "execution_count": 3,
      "outputs": []
    },
    {
      "metadata": {
        "collapsed": false,
        "trusted": false
      },
      "cell_type": "code",
      "source": "def iter_SDA_Periodicals(fname, log_every=None):\n    \"\"\"\n    Yield plain text of each periodical page, as a unicode string.\n\n    The pages are read from the directory `corpus-text-splits` on disk.\n    (e.g. `/Users/jeriwieringa/Dissertation/text/corpus-text-splits/`)\n\n    \"\"\"\n    extracted = 0\n    with tarfile.open(fname, 'r:gz') as tf:\n        for file_number, file_info in enumerate(tf):\n            if file_info.isfile():\n                if log_every and extracted % log_every == 0:\n                    logging.info(\"extracting file #%i: %s\" % (extracted, file_info.name))\n                content = tf.extractfile(file_info).read()\n                yield process_page(content)\n                extracted += 1",
      "execution_count": 4,
      "outputs": []
    },
    {
      "metadata": {
        "collapsed": true,
        "trusted": false
      },
      "cell_type": "code",
      "source": "corpus = '/mnt/volume-sfo2-01-part1/dissertation-texts/2016-11-16-SDA-Corpus-Preliminary-Cleaning.tar.gz'",
      "execution_count": 5,
      "outputs": []
    },
    {
      "metadata": {
        "collapsed": false,
        "trusted": false
      },
      "cell_type": "code",
      "source": "message_stream = iter_SDA_Periodicals(corpus, log_every=2)",
      "execution_count": 6,
      "outputs": []
    },
    {
      "metadata": {
        "collapsed": false,
        "trusted": false
      },
      "cell_type": "code",
      "source": "print(list(itertools.islice(message_stream, 2)))",
      "execution_count": 7,
      "outputs": [
        {
          "name": "stderr",
          "text": "root - INFO - extracting file #0: 2016-11-16-corpus-with-preliminary-cleaning/PHJ18920301-V07-03-page3.txt\n",
          "output_type": "stream"
        },
        {
          "name": "stdout",
          "text": "[' \\nNERVOUSNESS.\\nBY W. H. MAXSON, M. D.\\n\" DocToR, I am so nervous I can scarcely live. Can I get well? and how long will it take ?\" These and similar questions are heard on every side in a busy practice, and they are pertinent questions, from the fact that a large portion of the human race in this day and age of the world have been profligate in the expenditure of the vital forces, and, like the prodigal son, would now like to escape the consequences of a misspent life. Life is the dearest earthly boon we possess, and when the young and middle-aged groan over the effects of wasted nerve force, it is, indeed,\\nunfortunate that, from a physical standpoint, we find no \"father\\'s house\" where we can don the purple robe, kill the fatted calf, and \" make merry,\" no royal road to suffering humanity whereby health and physical ease can be obtained and the suffering ones sit down under the parental roof to enjoy the fat of the land.\\nNature\\'s laws are just,, not merciful and retribution is sure.\\nBy spiritual relationship and divine mercy we may escape the penalty of our own iniquities, but by physical laws each sufferer pays for the transgression, even to the utmost farthing.\\nHow much am I in debt to nature\\'s laws? am I willing to pay it? and how long will it take? is a physiological problem upon the solution of which depends our happiness and usefulness, and the individual who begins to solve the question early in life will evade much pain and sorrow, as well as save time and money for the upbuilding and furtherance of some noble aim in life.\\nNo doubt the Lord, when he created man in his own image, breathed into his nostrils that degree of life that placed man, his crowning work, above every other created creature, and well able to stand at the head of all the \" dominion \" given him, ruler supreme in his physical relationship over the beasts of the field, the fowls of the air, and every creeping or living creature upon the earth.\\nIn his supremacy he must have enjoyed perfect immunity from pain or disease, by virtue of having possessed a degree of life or vital force adequate to repel and keep the citadel of his being free from the baneful effects of climatic\\nchanges, dangers from the beasts of the field, as well as all other adverse influences, able also to withstand the inroads of germ life, which we now realize is the mortal enemy of the human race. Consequently the degree of vitality allotted to every human being, less the effects of heredity, will place the individual, if properly . developed and fostered, above every disease to which flesh is heir. Quite likely nations of the past have come upon the stage of action and passed away the individuals of which have never experienced the pangs of painful dyspepsia or the depression of nervous prostration. Neither is it uncommon in the nineteenth century to meet here and there one in or past the prime of life who, by virtue of great vitality, has never experienced a sick day in all his life. And yet in no age has there been so high a rate of mortality as in the present age, in no race such vitiation of the vital force, with all its dreaded consequences, as in the present race.\\nIt is a well-established fact that most of the diseases incident to germ life, many of which carry a high rate of mortality, belong mainly to the nineteenth century.\\nThe reason of this is apparent when we study the condition and vital status of the people of today, and become acquainted with the baneful practices that from generation to generation have eaten like a canker at the heart life and morals of the human race. Vital force may be divided into three parts. This we can reasonably infer from the fact that the capacity of the lungs is so divided; and as there is perfect co-adaptation in the system, we are led to believe it was originally designed that one-third of the vitality was to be used in the ordinary functions of the body pertaining to the elaboration of food with which to sustain the body, and one-third to be expended in the various avocations and lines of work to which we are called, and the other one-third guarded as a \"reserve nerve force,\" with which the system retrenches itself in times of crisis, protracted illness, shock, or fright.\\nThus we can see that when the vitality is recklessly squandered by pernicious habits of living, the system is deprived of the very force it so much needs with which to meet the exigencies of life, and in the absence of which life\\'s bark is often wrecked in some of the fierce storms of life.\\nHowever, if only the \"reserve nerve force\"\\nEDITORIAL. 67\\'', 'rt\\nR0 T\\nL\\nN E\\n, N\\n1\\nChicago arid Florida Special\\nAM: (\\n- \\'\"1) ( 1,\\'ill\\'t\\nA.  \\n \\n1\\n6i\\nADVERTISEMENTS.\\n P,   ;\\'1\\n\\'\\nIA\\nUEEN &CRESCENT ROUTE\\nAND S SOUTHERN RAILWAY 0\\nR T\\nThrough Pullman Service from Chicago, Cleveland, Detroit, Louisville L\\nON AND AFTER JANUARY 11, 1904 SOLID PULLMAN TRAIN FROM CINCINNATI\\nto Jacksonville and Saint Augustine.\\nFlorida Limited\\nSolid Train, Cincinnati to Jacksonville N and St. Augustine, with through Pullman service from Chicago. E\\nALSO PULLMAN SERVICE BETWEEN\\nCincinnati, Asheville, Savannah, Charleston, Atlanta, Birmingham, New Orleans and Texas Points.\\n5 r.\\nDining and Observation Cars on all through Trains. Write for rates and information\\nW. A. Garrett, G. M.\\nW. C. Rinearson, G. P. A.\\nCincinnati\\nNEW ORLEANS\\nIn replying to advertisements please mention GOOD HEALTH.']\n",
          "output_type": "stream"
        }
      ]
    },
    {
      "metadata": {
        "collapsed": true,
        "trusted": false
      },
      "cell_type": "code",
      "source": "class SDA_Periodicals(object):\n    def __init__(self, fname):\n        self.fname = fname\n\n    def __iter__(self):\n        for text in iter_SDA_Periodicals(self.fname):\n            # tokenize each message; simply lowercase & match alphabetic chars, for now\n            yield list(gensim.utils.tokenize(text, lower=True))\n\n",
      "execution_count": 8,
      "outputs": []
    },
    {
      "metadata": {
        "collapsed": true,
        "trusted": false
      },
      "cell_type": "code",
      "source": "def best_ngrams(words, top_n=1000, min_freq=100):\n    \"\"\"\n    Extract `top_n` most salient collocations (bigrams and trigrams),\n    from a stream of words. Ignore collocations with frequency\n    lower than `min_freq`.\n\n    This fnc uses NLTK for the collocation detection itself -- not very scalable!\n\n    Return the detected ngrams as compiled regular expressions, for their faster\n    detection later on.\n\n    \"\"\"\n    tcf = TrigramCollocationFinder.from_words(words)\n    tcf.apply_freq_filter(min_freq)\n    trigrams = [' '.join(w) for w in tcf.nbest(TrigramAssocMeasures.chi_sq, top_n)]\n    logging.info(\"%i trigrams found: %s...\" % (len(trigrams), trigrams[:20]))\n\n    bcf = tcf.bigram_finder()\n    bcf.apply_freq_filter(min_freq)\n    bigrams = [' '.join(w) for w in bcf.nbest(BigramAssocMeasures.pmi, top_n)]\n    logging.info(\"%i bigrams found: %s...\" % (len(bigrams), bigrams[:20]))\n\n    pat_gram2 = re.compile('(%s)' % '|'.join(bigrams), re.UNICODE)\n    pat_gram3 = re.compile('(%s)' % '|'.join(trigrams), re.UNICODE)\n\n    return pat_gram2, pat_gram3",
      "execution_count": 9,
      "outputs": []
    },
    {
      "metadata": {
        "collapsed": true,
        "trusted": false
      },
      "cell_type": "code",
      "source": "from gensim.parsing.preprocessing import STOPWORDS\n\nclass SDAPeriodicals_Collocations(object):\n    def __init__(self, fname):\n        self.fname = fname\n        logging.info(\"collecting ngrams from %s\" % self.fname)\n        # generator of documents; one element = list of words\n        documents = (self.split_words(text) for text in iter_SDA_Periodicals(self.fname, log_every=1000))\n        # generator: concatenate (chain) all words into a single sequence, lazily\n        words = itertools.chain.from_iterable(documents)\n        self.bigrams, self.trigrams = best_ngrams(words)\n\n    def split_words(self, text, stopwords=STOPWORDS):\n        \"\"\"\n        Break text into a list of single words. Ignore any token that falls into\n        the `stopwords` set.\n\n        \"\"\"\n        return [word\n                for word in gensim.utils.tokenize(text, lower=True) if word not in STOPWORDS and len(word) > 3\n               ]\n\n    def tokenize(self, message):\n        \"\"\"\n        Break text (string) into a list of Unicode tokens.\n        \n        The resulting tokens can be longer phrases (collocations) too,\n        e.g. `new_york`, `real_estate` etc.\n\n        \"\"\"\n        text = u' '.join(self.split_words(message))\n        text = re.sub(self.trigrams, lambda match: match.group(0).replace(u' ', u'_'), text)\n        text = re.sub(self.bigrams, lambda match: match.group(0).replace(u' ', u'_'), text)\n        return text.split()\n\n    def __iter__(self):\n        for message in iter_SDA_Periodicals(self.fname):\n            yield self.tokenize(message)\n",
      "execution_count": 10,
      "outputs": []
    },
    {
      "metadata": {
        "collapsed": false,
        "trusted": false
      },
      "cell_type": "code",
      "source": "collocations_corpus = SDAPeriodicals_Collocations(corpus)",
      "execution_count": 11,
      "outputs": [
        {
          "name": "stderr",
          "text": "root - INFO - collecting ngrams from /mnt/volume-sfo2-01-part1/dissertation-texts/2016-11-16-SDA-Corpus-Preliminary-Cleaning.tar.gz\nroot - INFO - extracting file #0: 2016-11-16-corpus-with-preliminary-cleaning/PHJ18920301-V07-03-page3.txt\nroot - INFO - extracting file #1000: 2016-11-16-corpus-with-preliminary-cleaning/RH18721022-V40-19-page5.txt\nroot - INFO - extracting file #2000: 2016-11-16-corpus-with-preliminary-cleaning/LB19111101-V14-11-page25.txt\nroot - INFO - extracting file #3000: 2016-11-16-corpus-with-preliminary-cleaning/HR19051101-V40-11-page52.txt\nroot - INFO - extracting file #4000: 2016-11-16-corpus-with-preliminary-cleaning/RH18860202-V63-05-page13.txt\nroot - INFO - extracting file #5000: 2016-11-16-corpus-with-preliminary-cleaning/RH19090923-V86-38-page6.txt\nroot - INFO - extracting file #6000: 2016-11-16-corpus-with-preliminary-cleaning/RH19191030-V96-44-page26.txt\nroot - INFO - extracting file #7000: 2016-11-16-corpus-with-preliminary-cleaning/RH19180117-V95-03-page11.txt\nroot - INFO - extracting file #8000: 2016-11-16-corpus-with-preliminary-cleaning/RH18730204-V41-08-page8.txt\nroot - INFO - extracting file #9000: 2016-11-16-corpus-with-preliminary-cleaning/LH19150901-V30-09-page17.txt\nroot - INFO - extracting file #10000: 2016-11-16-corpus-with-preliminary-cleaning/RH18810322-V57-12-page11.txt\nroot - INFO - extracting file #11000: 2016-11-16-corpus-with-preliminary-cleaning/CE19100201-V01-03-page8.txt\nroot - INFO - extracting file #12000: 2016-11-16-corpus-with-preliminary-cleaning/ST18810901-V07-33-page3.txt\nroot - INFO - extracting file #13000: 2016-11-16-corpus-with-preliminary-cleaning/HM18940201-V06-02-page20.txt\nroot - INFO - extracting file #14000: 2016-11-16-corpus-with-preliminary-cleaning/RH19140604-V91-23-page24.txt\nroot - INFO - extracting file #15000: 2016-11-16-corpus-with-preliminary-cleaning/PUR19160406-V15-35-page7.txt\nroot - INFO - extracting file #16000: 2016-11-16-corpus-with-preliminary-cleaning/RH19171122-V94-47-page2.txt\nroot - INFO - extracting file #17000: 2016-11-16-corpus-with-preliminary-cleaning/AmSn18990518-V14-20-page12.txt\nroot - INFO - extracting file #18000: 2016-11-16-corpus-with-preliminary-cleaning/RH18601113-V16-26-page3.txt\nroot - INFO - extracting file #19000: 2016-11-16-corpus-with-preliminary-cleaning/HR18920701-V27-07-page8.txt\nroot - INFO - extracting file #20000: 2016-11-16-corpus-with-preliminary-cleaning/HR18840501-V19-05-page32.txt\nroot - INFO - extracting file #21000: 2016-11-16-corpus-with-preliminary-cleaning/YI19130204-V61-05-page32.txt\nroot - INFO - extracting file #22000: 2016-11-16-corpus-with-preliminary-cleaning/RH18921129-V69-47-page6.txt\nroot - INFO - extracting file #23000: 2016-11-16-corpus-with-preliminary-cleaning/CUV19050920-V09-36-page3.txt\nroot - INFO - extracting file #24000: 2016-11-16-corpus-with-preliminary-cleaning/CUV19201007-V25-40-page2.txt\nroot - INFO - extracting file #25000: 2016-11-16-corpus-with-preliminary-cleaning/AmSn18970707-V12-27-page6.txt\nroot - INFO - extracting file #26000: 2016-11-16-corpus-with-preliminary-cleaning/CUV19140408-V19-15-page1.txt\nroot - INFO - extracting file #27000: 2016-11-16-corpus-with-preliminary-cleaning/HM18950101-V07-01a-page2.txt\nroot - INFO - extracting file #28000: 2016-11-16-corpus-with-preliminary-cleaning/RH19130911-V90-37-page7.txt\nroot - INFO - extracting file #29000: 2016-11-16-corpus-with-preliminary-cleaning/LUH19191210-V11-50-page9.txt\nroot - INFO - extracting file #30000: 2016-11-16-corpus-with-preliminary-cleaning/YI19190204-V67-05-page12.txt\nroot - INFO - extracting file #31000: 2016-11-16-corpus-with-preliminary-cleaning/HR18960701-V31-07-page42.txt\nroot - INFO - extracting file #32000: 2016-11-16-corpus-with-preliminary-cleaning/CUV19150211-V20-06-page6.txt\nroot - INFO - extracting file #33000: 2016-11-16-corpus-with-preliminary-cleaning/YI19020717-V50-29-page7.txt\nroot - INFO - extracting file #34000: 2016-11-16-corpus-with-preliminary-cleaning/HR18690601-V03-12-page15.txt\nroot - INFO - extracting file #35000: 2016-11-16-corpus-with-preliminary-cleaning/GCB19130525-V07-08-page4.txt\nroot - INFO - extracting file #36000: 2016-11-16-corpus-with-preliminary-cleaning/RH18861214-V63-49-page8.txt\nroot - INFO - extracting file #37000: 2016-11-16-corpus-with-preliminary-cleaning/RH18961208-V73-49-page8.txt\nroot - INFO - extracting file #38000: 2016-11-16-corpus-with-preliminary-cleaning/YI18850318-V33-09-page2.txt\nroot - INFO - extracting file #39000: 2016-11-16-corpus-with-preliminary-cleaning/LH19181101-V33-11-page4.txt\nroot - INFO - extracting file #40000: 2016-11-16-corpus-with-preliminary-cleaning/RH19060628-V83-26-page21.txt\nroot - INFO - extracting file #41000: 2016-11-16-corpus-with-preliminary-cleaning/SOL19021101-V17-12-page1.txt\nroot - INFO - extracting file #42000: 2016-11-16-corpus-with-preliminary-cleaning/RH18910120-V68-03-page12.txt\nroot - INFO - extracting file #43000: 2016-11-16-corpus-with-preliminary-cleaning/RH18870719-V64-29-page12.txt\nroot - INFO - extracting file #44000: 2016-11-16-corpus-with-preliminary-cleaning/LibM19160401-V11-02-page51.txt\nroot - INFO - extracting file #45000: 2016-11-16-corpus-with-preliminary-cleaning/RH18750617-V45-25-page8.txt\nroot - INFO - extracting file #46000: 2016-11-16-corpus-with-preliminary-cleaning/HR19031201-V38-12-page65.txt\nroot - INFO - extracting file #47000: 2016-11-16-corpus-with-preliminary-cleaning/ST18860527-V12-20-page4.txt\nroot - INFO - extracting file #48000: 2016-11-16-corpus-with-preliminary-cleaning/ST19060404-V32-14-page8.txt\nroot - INFO - extracting file #49000: 2016-11-16-corpus-with-preliminary-cleaning/RH18830417-V60-16-page14.txt\nroot - INFO - extracting file #50000: 2016-11-16-corpus-with-preliminary-cleaning/IR19030819-V09-17-page2.txt\nroot - INFO - extracting file #51000: 2016-11-16-corpus-with-preliminary-cleaning/GH19020319-V04-11-page7.txt\nroot - INFO - extracting file #52000: 2016-11-16-corpus-with-preliminary-cleaning/SOL19000927-V15-38-page5.txt\nroot - INFO - extracting file #53000: 2016-11-16-corpus-with-preliminary-cleaning/CUV19111129-V16-47-page3.txt\nroot - INFO - extracting file #54000: 2016-11-16-corpus-with-preliminary-cleaning/LB19100501-V13-05-page42.txt\nroot - INFO - extracting file #55000: 2016-11-16-corpus-with-preliminary-cleaning/SUW19121128-V06-48-page3.txt\nroot - INFO - extracting file #56000: 2016-11-16-corpus-with-preliminary-cleaning/ST19050920-V31-38-page2.txt\n",
          "output_type": "stream"
        },
        {
          "traceback": [
            "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
            "\u001b[1;31mMemoryError\u001b[0m                               Traceback (most recent call last)",
            "\u001b[1;32m<ipython-input-11-e52faec9bedf>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mcollocations_corpus\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mSDAPeriodicals_Collocations\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mcorpus\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
            "\u001b[1;32m<ipython-input-10-ca5c28a03323>\u001b[0m in \u001b[0;36m__init__\u001b[1;34m(self, fname)\u001b[0m\n\u001b[0;32m      9\u001b[0m         \u001b[1;31m# generator: concatenate (chain) all words into a single sequence, lazily\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     10\u001b[0m         \u001b[0mwords\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mitertools\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mchain\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfrom_iterable\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdocuments\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 11\u001b[1;33m         \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mbigrams\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mtrigrams\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mbest_ngrams\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mwords\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m     12\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     13\u001b[0m     \u001b[1;32mdef\u001b[0m \u001b[0msplit_words\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mtext\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mstopwords\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mSTOPWORDS\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
            "\u001b[1;32m<ipython-input-9-c2a06b0fa427>\u001b[0m in \u001b[0;36mbest_ngrams\u001b[1;34m(words, top_n, min_freq)\u001b[0m\n\u001b[0;32m     11\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     12\u001b[0m     \"\"\"\n\u001b[1;32m---> 13\u001b[1;33m     \u001b[0mtcf\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mTrigramCollocationFinder\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfrom_words\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mwords\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m     14\u001b[0m     \u001b[0mtcf\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mapply_freq_filter\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmin_freq\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     15\u001b[0m     \u001b[0mtrigrams\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m[\u001b[0m\u001b[1;34m' '\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mjoin\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mw\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0mw\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mtcf\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mnbest\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mTrigramAssocMeasures\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mchi_sq\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mtop_n\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
            "\u001b[1;32m/home/jerielizabeth/miniconda3/envs/dissertation/lib/python3.5/site-packages/nltk/collocations.py\u001b[0m in \u001b[0;36mfrom_words\u001b[1;34m(cls, words, window_size)\u001b[0m\n\u001b[0;32m    226\u001b[0m                     \u001b[1;32mcontinue\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    227\u001b[0m                 \u001b[0mwildfd\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mw1\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mw3\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m+=\u001b[0m \u001b[1;36m1\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 228\u001b[1;33m                 \u001b[0mtfd\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mw1\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mw2\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mw3\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m+=\u001b[0m \u001b[1;36m1\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    229\u001b[0m         \u001b[1;32mreturn\u001b[0m \u001b[0mcls\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mwfd\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mbfd\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mwildfd\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mtfd\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    230\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
            "\u001b[1;31mMemoryError\u001b[0m: "
          ],
          "evalue": "",
          "ename": "MemoryError",
          "output_type": "error"
        }
      ]
    },
    {
      "metadata": {
        "collapsed": false,
        "trusted": false
      },
      "cell_type": "code",
      "source": "print(list(itertools.islice(collocations_corpus, 4)))",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "collapsed": true,
        "trusted": false
      },
      "cell_type": "code",
      "source": "",
      "execution_count": null,
      "outputs": []
    }
  ],
  "metadata": {
    "kernelspec": {
      "name": "python3",
      "display_name": "Python [default]",
      "language": "python"
    },
    "language_info": {
      "file_extension": ".py",
      "version": "3.5.2",
      "pygments_lexer": "ipython3",
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python"
    },
    "anaconda-cloud": {},
    "gist": {
      "id": "",
      "data": {
        "description": "Attempt to identify bigrams",
        "public": true
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}