rjweiss/cleaning_text

## cleaning_text
{
 "metadata": {
  "name": "cleaningtext.ipynb"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Cleaning text\n",
      "\n",
      "Here, we're going to go over some basic text cleaning steps in Python."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "raw_docs = [\"Here are some very simple basic sentences.\",\n",
      "\"They won't be very interesting, I'm afraid.\",\n",
      "\"The point of these examples is to _learn how basic text cleaning works_ on *very simple* data.\"]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Tokenizing text into bags of words\n",
      "\n",
      "NLTK makes it easy to convert documents-as-strings into word-vectors, a process called **tokenizing.** "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nltk.tokenize import word_tokenize\n",
      "\n",
      "tokenized_docs = [word_tokenize(doc) for doc in raw_docs]\n",
      "print tokenized_docs"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[['Here', 'are', 'some', 'very', 'simple', 'basic', 'sentences', '.'], ['They', 'wo', \"n't\", 'be', 'very', 'interesting', ',', 'I', \"'m\", 'afraid', '.'], ['The', 'point', 'of', 'these', 'examples', 'is', 'to', '_learn', 'how', 'basic', 'text', 'cleaning', 'works_', 'on', '*very', 'simple*', 'data', '.']]\n"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Removing punctuation\n",
      "\n",
      "Punctuation can help with tokenizers, but once you've done that, there's no reason to keep it around.  There are tons of ways to remove punctuation.  Since we have already learned regex, how would we do this?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import re\n",
      "import string\n",
      "regex = re.compile('[%s]' % re.escape(string.punctuation)) #see documentation here: http://docs.python.org/2/library/string.html\n",
      "\n",
      "tokenized_docs_no_punctuation = []\n",
      "\n",
      "for review in tokenized_docs:\n",
      "    \n",
      "    new_review = []\n",
      "    for token in review: \n",
      "        new_token = regex.sub(u'', token)\n",
      "        if not new_token == u'':\n",
      "            new_review.append(new_token)\n",
      "    \n",
      "    tokenized_docs_no_punctuation.append(new_review)\n",
      "    \n",
      "print tokenized_docs_no_punctuation"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[['Here', 'are', 'some', 'very', 'simple', 'basic', 'sentences'], ['They', 'wo', u'nt', 'be', 'very', 'interesting', 'I', u'm', 'afraid'], ['The', 'point', 'of', 'these', 'examples', 'is', 'to', u'learn', 'how', 'basic', 'text', 'cleaning', u'works', 'on', u'very', u'simple', 'data']]\n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Cleaning text of stopwords\n",
      "\n",
      "There are some really basic words that just don't matter.  NLTK comes with a list of them for many languages."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nltk.corpus import stopwords\n",
      "\n",
      "tokenized_docs_no_stopwords = []\n",
      "for doc in tokenized_docs_no_punctuation:\n",
      "    new_term_vector = []\n",
      "    for word in doc:\n",
      "        if not word in stopwords.words('english'):\n",
      "            new_term_vector.append(word)\n",
      "    tokenized_docs_no_stopwords.append(new_term_vector)\n",
      "            \n",
      "print tokenized_docs_no_stopwords"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[['Here', 'simple', 'basic', 'sentences'], ['They', 'wo', u'nt', 'interesting', 'I', u'm', 'afraid'], ['The', 'point', 'examples', u'learn', 'basic', 'text', 'cleaning', u'works', u'simple', 'data']]\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Stemming and Lemmatizing\n",
      "\n",
      "If you have taken linguistics, you may be familiar with morphology.  This is the belief that words have a root form.  If you want to get to the basic term meaning of the word, you can try applying a stemmer or lemmatizer.  Here are three very popular methods ready to go right out of the NLTK box.  It's up to you to see which one you want to use."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nltk.stem.porter import PorterStemmer\n",
      "from nltk.stem.snowball import SnowballStemmer\n",
      "from nltk.stem.wordnet import WordNetLemmatizer\n",
      "\n",
      "porter = PorterStemmer()\n",
      "snowball = SnowballStemmer('english')\n",
      "wordnet = WordNetLemmatizer()\n",
      "\n",
      "preprocessed_docs = []\n",
      "\n",
      "for doc in tokenized_docs_no_stopwords:\n",
      "    final_doc = []\n",
      "    for word in doc:\n",
      "        final_doc.append(porter.stem(word))\n",
      "        #final_doc.append(snowball.stem(word))\n",
      "        #final_doc.append(wordnet.lemmatize(word)) #note that lemmatize() can also takes part of speech as an argument!\n",
      "    preprocessed_docs.append(final_doc)\n",
      "\n",
      "print preprocessed_docs"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[['Here', 'simpl', 'basic', 'sentenc'], ['They', 'wo', u'nt', 'interest', 'I', u'm', 'afraid'], ['The', 'point', 'exampl', u'learn', 'basic', 'text', 'clean', u'work', u'simpl', 'data']]\n"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Remember how we made a list of review texts in Text 1?\n",
      "\n",
      "### Exercise 1\n",
      "Create a new list of `review_texts` called `clean_reviews` that are:\n",
      "\n",
      "- tokenized\n",
      "- free of punctuation\n",
      "- free of stopwords\n",
      "- stemmed or lemmatized"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import os\n",
      "import csv\n",
      "\n",
      "#os.chdir('/Users/rweiss/Dropbox/presentations/IRiSS2013/text2/extra/')\n",
      "\n",
      "with open('amazon/sociology_2010.csv', 'rb') as csvfile:\n",
      "    amazon_reader = csv.DictReader(csvfile, delimiter=',')\n",
      "    amazon_reviews = [row['review_text'] for row in amazon_reader]\n",
      "    \n",
      "    #your code here!!!"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Removing HTML entities and tags\n",
      "\n",
      "Recall that HTML entities are an artifact from the pre-Unicode era.  Browsers know to render HTML entities a certain way on the page, but we don't need them anymore. \n",
      "\n",
      "Here's some code that will do this for you (function courtesy of the author of lxml)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import re, htmlentitydefs\n",
      "\n",
      "##\n",
      "# Removes HTML or XML character references and entities from a text string.\n",
      "#\n",
      "# @param text The HTML (or XML) source text.\n",
      "# @return The plain text, as a Unicode string, if necessary.\n",
      "# AUTHOR: Fredrik Lundh\n",
      "\n",
      "def unescape(text):\n",
      "    def fixup(m):\n",
      "        text = m.group(0)\n",
      "        if text[:2] == \"&#\":\n",
      "            # character reference\n",
      "            try:\n",
      "                if text[:3] == \"&#x\":\n",
      "                    return unichr(int(text[3:-1], 16))\n",
      "                else:\n",
      "                    return unichr(int(text[2:-1]))\n",
      "            except ValueError:\n",
      "                pass\n",
      "        else:\n",
      "            # named entity\n",
      "            try:\n",
      "                text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])\n",
      "            except KeyError:\n",
      "                pass\n",
      "        return text # leave as is\n",
      "    return re.sub(\"&#?\\w+;\", fixup, text)\n",
      "\n",
      "test_string =\"<p>While many of the stories tugged at the heartstrings, I never felt manipulated by the authors. (Note: Part of the reason why I don't like the &quot;Chicken Soup for the Soul&quot; series is that I feel that the authors are just dying to make the reader clutch for the box of tissues.)\"\n",
      "\n",
      "print test_string\n",
      "print unescape(test_string)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "<p>While many of the stories tugged at the heartstrings, I never felt manipulated by the authors. (Note: Part of the reason why I don't like the &quot;Chicken Soup for the Soul&quot; series is that I feel that the authors are just dying to make the reader clutch for the box of tissues.)\n",
        "<p>While many of the stories tugged at the heartstrings, I never felt manipulated by the authors. (Note: Part of the reason why I don't like the \"Chicken Soup for the Soul\" series is that I feel that the authors are just dying to make the reader clutch for the box of tissues.)\n"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import nltk\n",
      "\n",
      "nltk.clean_html(unescape(test_string.decode('utf8'))) #notice that it returns unicode!"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "pyout",
       "prompt_number": 8,
       "text": [
        "u'While many of the stories tugged at the heartstrings, I never felt manipulated by the authors. (Note: Part of the reason why I don\\'t like the \"Chicken Soup for the Soul\" series is that I feel that the authors are just dying to make the reader clutch for the box of tissues.)'"
       ]
      }
     ],
     "prompt_number": 8
    }
   ],
   "metadata": {}
  }
 ]
}
	{
	"metadata": {
	"name": "cleaningtext.ipynb"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Cleaning text\n",
	"\n",
	"Here, we're going to go over some basic text cleaning steps in Python."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"raw_docs = [\"Here are some very simple basic sentences.\",\n",
	"\"They won't be very interesting, I'm afraid.\",\n",
	"\"The point of these examples is to _learn how basic text cleaning works_ on very simple data.\"]"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 1
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Tokenizing text into bags of words\n",
	"\n",
	"NLTK makes it easy to convert documents-as-strings into word-vectors, a process called tokenizing. "
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"from nltk.tokenize import word_tokenize\n",
	"\n",
	"tokenized_docs = [word_tokenize(doc) for doc in raw_docs]\n",
	"print tokenized_docs"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"[['Here', 'are', 'some', 'very', 'simple', 'basic', 'sentences', '.'], ['They', 'wo', \"n't\", 'be', 'very', 'interesting', ',', 'I', \"'m\", 'afraid', '.'], ['The', 'point', 'of', 'these', 'examples', 'is', 'to', '_learn', 'how', 'basic', 'text', 'cleaning', 'works_', 'on', 'very', 'simple', 'data', '.']]\n"
	]
	}
	],
	"prompt_number": 2
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Removing punctuation\n",
	"\n",
	"Punctuation can help with tokenizers, but once you've done that, there's no reason to keep it around. There are tons of ways to remove punctuation. Since we have already learned regex, how would we do this?"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"import re\n",
	"import string\n",
	"regex = re.compile('[%s]' % re.escape(string.punctuation)) #see documentation here: http://docs.python.org/2/library/string.html\n",
	"\n",
	"tokenized_docs_no_punctuation = []\n",
	"\n",
	"for review in tokenized_docs:\n",
	" \n",
	" new_review = []\n",
	" for token in review: \n",
	" new_token = regex.sub(u'', token)\n",
	" if not new_token == u'':\n",
	" new_review.append(new_token)\n",
	" \n",
	" tokenized_docs_no_punctuation.append(new_review)\n",
	" \n",
	"print tokenized_docs_no_punctuation"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"[['Here', 'are', 'some', 'very', 'simple', 'basic', 'sentences'], ['They', 'wo', u'nt', 'be', 'very', 'interesting', 'I', u'm', 'afraid'], ['The', 'point', 'of', 'these', 'examples', 'is', 'to', u'learn', 'how', 'basic', 'text', 'cleaning', u'works', 'on', u'very', u'simple', 'data']]\n"
	]
	}
	],
	"prompt_number": 3
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Cleaning text of stopwords\n",
	"\n",
	"There are some really basic words that just don't matter. NLTK comes with a list of them for many languages."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"from nltk.corpus import stopwords\n",
	"\n",
	"tokenized_docs_no_stopwords = []\n",
	"for doc in tokenized_docs_no_punctuation:\n",
	" new_term_vector = []\n",
	" for word in doc:\n",
	" if not word in stopwords.words('english'):\n",
	" new_term_vector.append(word)\n",
	" tokenized_docs_no_stopwords.append(new_term_vector)\n",
	" \n",
	"print tokenized_docs_no_stopwords"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"[['Here', 'simple', 'basic', 'sentences'], ['They', 'wo', u'nt', 'interesting', 'I', u'm', 'afraid'], ['The', 'point', 'examples', u'learn', 'basic', 'text', 'cleaning', u'works', u'simple', 'data']]\n"
	]
	}
	],
	"prompt_number": 4
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Stemming and Lemmatizing\n",
	"\n",
	"If you have taken linguistics, you may be familiar with morphology. This is the belief that words have a root form. If you want to get to the basic term meaning of the word, you can try applying a stemmer or lemmatizer. Here are three very popular methods ready to go right out of the NLTK box. It's up to you to see which one you want to use."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"from nltk.stem.porter import PorterStemmer\n",
	"from nltk.stem.snowball import SnowballStemmer\n",
	"from nltk.stem.wordnet import WordNetLemmatizer\n",
	"\n",
	"porter = PorterStemmer()\n",
	"snowball = SnowballStemmer('english')\n",
	"wordnet = WordNetLemmatizer()\n",
	"\n",
	"preprocessed_docs = []\n",
	"\n",
	"for doc in tokenized_docs_no_stopwords:\n",
	" final_doc = []\n",
	" for word in doc:\n",
	" final_doc.append(porter.stem(word))\n",
	" #final_doc.append(snowball.stem(word))\n",
	" #final_doc.append(wordnet.lemmatize(word)) #note that lemmatize() can also takes part of speech as an argument!\n",
	" preprocessed_docs.append(final_doc)\n",
	"\n",
	"print preprocessed_docs"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"[['Here', 'simpl', 'basic', 'sentenc'], ['They', 'wo', u'nt', 'interest', 'I', u'm', 'afraid'], ['The', 'point', 'exampl', u'learn', 'basic', 'text', 'clean', u'work', u'simpl', 'data']]\n"
	]
	}
	],
	"prompt_number": 5
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Remember how we made a list of review texts in Text 1?\n",
	"\n",
	"### Exercise 1\n",
	"Create a new list of `review_texts` called `clean_reviews` that are:\n",
	"\n",
	"- tokenized\n",
	"- free of punctuation\n",
	"- free of stopwords\n",
	"- stemmed or lemmatized"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"import os\n",
	"import csv\n",
	"\n",
	"#os.chdir('/Users/rweiss/Dropbox/presentations/IRiSS2013/text2/extra/')\n",
	"\n",
	"with open('amazon/sociology_2010.csv', 'rb') as csvfile:\n",
	" amazon_reader = csv.DictReader(csvfile, delimiter=',')\n",
	" amazon_reviews = [row['review_text'] for row in amazon_reader]\n",
	" \n",
	" #your code here!!!"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 6
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Removing HTML entities and tags\n",
	"\n",
	"Recall that HTML entities are an artifact from the pre-Unicode era. Browsers know to render HTML entities a certain way on the page, but we don't need them anymore. \n",
	"\n",
	"Here's some code that will do this for you (function courtesy of the author of lxml)."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"import re, htmlentitydefs\n",
	"\n",
	"##\n",
	"# Removes HTML or XML character references and entities from a text string.\n",
	"#\n",
	"# @param text The HTML (or XML) source text.\n",
	"# @return The plain text, as a Unicode string, if necessary.\n",
	"# AUTHOR: Fredrik Lundh\n",
	"\n",
	"def unescape(text):\n",
	" def fixup(m):\n",
	" text = m.group(0)\n",
	" if text[:2] == \"&#\":\n",
	" # character reference\n",
	" try:\n",
	" if text[:3] == \"&#x\":\n",
	" return unichr(int(text[3:-1], 16))\n",
	" else:\n",
	" return unichr(int(text[2:-1]))\n",
	" except ValueError:\n",
	" pass\n",
	" else:\n",
	" # named entity\n",
	" try:\n",
	" text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])\n",
	" except KeyError:\n",
	" pass\n",
	" return text # leave as is\n",
	" return re.sub(\"&#?\\w+;\", fixup, text)\n",
	"\n",
	"test_string =\"<p>While many of the stories tugged at the heartstrings, I never felt manipulated by the authors. (Note: Part of the reason why I don't like the "Chicken Soup for the Soul" series is that I feel that the authors are just dying to make the reader clutch for the box of tissues.)\"\n",
	"\n",
	"print test_string\n",
	"print unescape(test_string)"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"<p>While many of the stories tugged at the heartstrings, I never felt manipulated by the authors. (Note: Part of the reason why I don't like the "Chicken Soup for the Soul" series is that I feel that the authors are just dying to make the reader clutch for the box of tissues.)\n",
	"<p>While many of the stories tugged at the heartstrings, I never felt manipulated by the authors. (Note: Part of the reason why I don't like the \"Chicken Soup for the Soul\" series is that I feel that the authors are just dying to make the reader clutch for the box of tissues.)\n"
	]
	}
	],
	"prompt_number": 7
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"import nltk\n",
	"\n",
	"nltk.clean_html(unescape(test_string.decode('utf8'))) #notice that it returns unicode!"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "pyout",
	"prompt_number": 8,
	"text": [
	"u'While many of the stories tugged at the heartstrings, I never felt manipulated by the authors. (Note: Part of the reason why I don\\'t like the \"Chicken Soup for the Soul\" series is that I feel that the authors are just dying to make the reader clutch for the box of tissues.)'"
	]
	}
	],
	"prompt_number": 8
	}
	],
	"metadata": {}
	}
	]
	}