Skip to content

Instantly share code, notes, and snippets.

@FCTweedie
Last active August 29, 2015 14:07
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save FCTweedie/29f74e8e7077239b661e to your computer and use it in GitHub Desktop.
Save FCTweedie/29f74e8e7077239b661e to your computer and use it in GitHub Desktop.
{
"metadata": {
"name": "",
"signature": "sha256:85162be468c0cc9e55a1b8fe7a54747fb9f0b8729794af6f6768138f9722de04"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"An introduction to NLTK - processing raw text and basic analysis"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Working with raw text"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Some corpora have already been marked up for use with NLTK, but you're often going to want to work with your own texts. So how to we load them in and prepare them for use with NLTK? We're going to start by looking at some plain text (.txt) files of speeches and press releases from the Malcolm Fraser archive, held by the University of Melbourne. We'll look at some of the advantages and disadvantages of using NLTK, and problems of data wrangling. You can check out the Fraser Archive here: http://www.unimelb.edu.au/malcolmfraser/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First of all, let's load in our text."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Via file management, open and inspect one file. What do you see? Are there any potential problems?"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from __future__ import division\n",
"import nltk, re, pprint\n",
"import os\n",
"#import tokenizers\n",
"from nltk import word_tokenize \n",
"from nltk.text import Text"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"nltk.data.path.append('/home/researcher/nltk_data/')\n",
"nltk.download(\"book\", download_dir='/home/researcher/nltk_data/')"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, run the above import statements. You'll need these to import and process raw text.\n",
"Now that we've got our texts, let's have a look at what is in the file directory."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#access items in the directory 'UMA_Fraser_Radio_Talks' and view the first 3\n",
"os.listdir('UMA_Fraser_Radio_Talks')[:3]"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Basic text analysis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First we'll read in one speech and tokenize it. This means breaking it up into words for analysis"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#open a file and call the content 'speech'\n",
"speech = open('UMA_Fraser_Radio_Talks/UDS2013680-100-full.txt').read()\n",
"#tokenize the speech and call the result 'vocab'\n",
"vocab = word_tokenize(speech)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"len(vocab)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"len(set(vocab))\n"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"vocab.count('South')"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"len(vocab)/len(set(vocab))"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"V = set(vocab)\n",
"long_words = [word for word in V if len(word) > 12]\n",
"sorted(long_words)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To perform more complex operations, we'll need to use a different tokenizer"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"sent_vocab = Text(word_tokenize(speech))"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"sent_vocab.concordance('wool')"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"sent_vocab.collocations()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#build a table of the 15 most common words in the text\n",
"from nltk.probability import FreqDist\n",
"fdist1 = FreqDist(sent_vocab)\n",
"fdist1.tabulate(15)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#graph the 20 most common words in the text\n",
"%matplotlib inline\n",
"fdist1.plot(20, cumulative=True)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"fdist1.max()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"100.0*fdist1.freq('Portland')"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"vocab[:20]"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"len(set(word.lower() for word in vocab))"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"len(set(word.lower() for word in vocab if word.isalpha()))"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Exploring further: splitting up text"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We've had a look at one file, but the real strength of NLTK is to be able to explore large bodies of text. \n",
"When we manually inspected the first file, we saw that it contained a metadata section, before the body of the text. We can ask Python to show us the start of the file. For analysing the text, it is useful to split the metadata section off, so that we can interrogate it separately but also so that it won't distort our results when we analyse the text."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#view the first 100 characters of the first file\n",
"open('UMA_Fraser_Radio_Talks/' + os.listdir('UMA_Fraser_Radio_Talks')[0]).read()[:100]"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#open the first file, read it and then split it into two parts, metadata and body\n",
"data = open('UMA_Fraser_Radio_Talks/' + os.listdir('UMA_Fraser_Radio_Talks')[0]).read().split(\"<!--end metadata-->\")"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#view the first part\n",
"data[0]"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#split into lines, add '*' to the start of each line\n",
"for line in data[0].split('\\r\\n'):\n",
" print '*', line"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#get rid of any line that starts with '<'\n",
"for line in data[0].split('\\r\\n'):\n",
" if line[0] == '<':\n",
" continue\n",
" print '*', line"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#skip empty lines and any line that starts with '<'\n",
"for line in data[0].split('\\r\\n'):\n",
" if not line: \n",
" continue\n",
" if line[0] == '<':\n",
" continue\n",
" print '*', line"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#split the metadata items on ':' so that we can interrogate each one\n",
"for line in data[0].split('\\r\\n'):\n",
" if not line: \n",
" continue\n",
" if line[0] == '<':\n",
" continue\n",
" element = line.split(':') \n",
" print '*', element"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#actually, only split on the first colon\n",
"for line in data[0].split('\\r\\n'):\n",
" if not line: \n",
" continue\n",
" if line[0] == '<':\n",
" continue\n",
" element = line.split(':', 1) \n",
" print '*', element"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Build a dictionary and define a function"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We've now split up the elements of the metadata, but we want to be able to interrogate it so that we can start to find out something about the collection of files. To do that, we need to build a dictionary."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"metadata = {}\n",
"for line in data[0].split('\\r\\n'):\n",
" if not line: \n",
" continue\n",
" if line[0] == '<':\n",
" continue\n",
" element = line.split(':', 1) \n",
" metadata[element[0]] = element[-1]\n",
"print metadata"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#look up the date\n",
"print metadata['Date']"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Creating a function means that we can perform an operation multiple times without having to type out all the code every time. There are over 700 files in our directory, so by defining a function and running it over all the files in our directory, we can then interrogate the collection and learn something about it. Creating a function also means that we can be sure that the exactly the same thing is happening each time"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#open the first file, read it and then split it into two parts, metadata and body\n",
"data = open('UMA_Fraser_Radio_Talks/UDS2013680-100-full.txt')\n",
"data = data.read().split(\"<!--end metadata-->\")"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#define a function that breaks up the metadata for each file and gets rid of the whitespace at the start of each element\n",
"def parse_metadata(text):\n",
" metadata = {}\n",
" for line in text.split('\\r\\n'):\n",
" if not line: \n",
" continue\n",
" if line[0] == '<':\n",
" continue\n",
" element = line.split(':', 1) \n",
" metadata[element[0]] = element[-1].strip(' ')\n",
" return metadata"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"parse_metadata(data[0])"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Putting it together: exploring multiple files"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we're confident that the function works, let's find out a bit about the corpus. As a start, it would be useful to know which years the texts are from. Are they evenly distributed over time? A graph will tell us!"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#import conditional frequency distribution\n",
"from nltk.probability import ConditionalFreqDist\n",
"cfdist = ConditionalFreqDist()\n",
"for filename in os.listdir('UMA_Fraser_Radio_Talks'):\n",
" text = open('UMA_Fraser_Radio_Talks/' + filename).read()\n",
" #split text of file on 'end metadata'\n",
" text = text.split(\"<!--end metadata-->\")\n",
" #parse metadata using previously defined function \"parse_metadata\"\n",
" metadata = parse_metadata(text[0])\n",
" #skip all speeches for which there is no exact date\n",
" if metadata['Date'][0] == 'c':\n",
" continue\n",
" #build a frequency distribution graph by year, that is, take the final bit of the 'Date' string after '/'\n",
" cfdist['count'][metadata['Date'].split('/')[-1]] += 1\n",
"cfdist.plot()\n"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cfdistA = ConditionalFreqDist()\n",
"for filename in os.listdir('UMA_Fraser_Radio_Talks'):\n",
" text = open('UMA_Fraser_Radio_Talks/' + filename).read()\n",
" #split text of file on 'end metadata'\n",
" text = text.split(\"<!--end metadata-->\")\n",
" #parse metadata using previously defined function \"parse_metadata\"\n",
" metadata = parse_metadata(text[0])\n",
" date = metadata['Date']\n",
" if date[0] == 'c':\n",
" year = date[1:]\n",
" elif date[0] != 'c':\n",
" year = date.split('/')[-1]\n",
" if year:\n",
" cfdistA['count'][year] += 1\n",
"cfdistA.plot()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cfdist2 = ConditionalFreqDist()\n",
"for filename in os.listdir('UMA_Fraser_Radio_Talks'):\n",
" text = open('UMA_Fraser_Radio_Talks/' + filename).read()\n",
" #split text of file on 'end metadata'\n",
" text = text.split(\"<!--end metadata-->\")\n",
" #parse metadata using previously defined function \"parse_metadata\"\n",
" metadata = parse_metadata(text[0])\n",
" #skip all speeches for which there is no exact date\n",
" if metadata['Date'][0] == 'c':\n",
" continue\n",
" #build a frequency distribution graph by 'Description'\n",
" cfdist2['count'][metadata['Description']] += 1\n",
"cfdist2.plot()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Previously, we tokenized the text of a file so that we could conduct some analysis. Let's now tokenize just the body of the file, not the metadata. As an exersize, let's see how the modal verbs 'must', 'should' and 'will' occur in the text."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#tokenize the body of the text so that we can start to analyse it\n",
"tokens = word_tokenize(data[1])\n",
"tokens.count('should')"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For each file, tokenize the body then count how often 'must', 'will' and 'should' occurs in each"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for filename in os.listdir('UMA_Fraser_Radio_Talks'):\n",
" text = open('UMA_Fraser_Radio_Talks/' + filename).read()\n",
" #split text of file on 'end metadata'\n",
" text = text.split(\"<!--end metadata-->\")\n",
" #parse metadata using previously defined function \"parse_metadata\"\n",
" metadata = parse_metadata(text[0])\n",
" #skip all speeches for which there is no exact date\n",
" if metadata['Date'][0] == 'c':\n",
" continue\n",
" #tokenise the text of the speech\n",
" tokens = word_tokenize(text[1].decode('ISO-8859-1'))\n",
" #show the date of each speech count how often 'should' and 'must' are used in each\n",
" print metadata['Date'], ',', tokens.count('should'), ',', tokens.count('must'), ',', tokens.count('will')"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And graph that"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cfdist3 = ConditionalFreqDist()\n",
"for filename in os.listdir('UMA_Fraser_Radio_Talks'):\n",
" text = open('UMA_Fraser_Radio_Talks/' + filename).read()\n",
" text = text.split('<!--end metadata-->')\n",
" metadata = parse_metadata(text[0])\n",
" date = metadata['Date']\n",
" if date[0] == 'c':\n",
" year = date[1:]\n",
" elif date[0] != 'c':\n",
" year = date.split('/')[-1]\n",
" if year == '1966':\n",
" continue\n",
" tokens = word_tokenize(text[1].decode('ISO-8859-1'))\n",
" cfdist3['should'][year] += tokens.count('should')\n",
" cfdist3['will'][year] += tokens.count('will')\n",
" cfdist3['must'][year] += tokens.count('must')\n",
"cfdist3.plot()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cfdist3 = ConditionalFreqDist()\n",
"for filename in os.listdir('UMA_Fraser_Radio_Talks'):\n",
" text = open('UMA_Fraser_Radio_Talks/' + filename).read()\n",
" text = text.split('<!--end metadata-->')\n",
" metadata = parse_metadata(text[0])\n",
" date = metadata['Date']\n",
" if date[0] == 'c':\n",
" year = date[1:]\n",
" elif date[0] != 'c':\n",
" year = date.split('/')[-1]\n",
" if year == '1966':\n",
" continue\n",
" tokens = word_tokenize(text[1].decode('ISO-8859-1'))\n",
" if len(tokens) == 0:\n",
" continue\n",
" cfdist3['should'][year] += tokens.count('should') / len(tokens)\n",
" cfdist3['will'][year] += tokens.count('will') / len(tokens)\n",
" cfdist3['must'][year] += tokens.count('must') / len(tokens)\n",
"cfdist3.plot()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment