Skip to content

Instantly share code, notes, and snippets.

@tgalery
Created August 12, 2012 17:09
Show Gist options
  • Save tgalery/3333034 to your computer and use it in GitHub Desktop.
Save tgalery/3333034 to your computer and use it in GitHub Desktop.
Python spell-checker for twiter stream

This is a simple python program that streams tweets from 2 locations, London and Exeter, in our example, and compares which one has the greatest number of spelling mistakes.

1 – Set-up used:

*Ubuntu 11.04 Natty AMD64
*Python 2.7.3
*python re library
*python nltk 2.0 library and the required NumPy and PyYaml (For NLP tasks)
*python tweeterstream 1.1.1 library (For Tweeter Manipulation)
*python enchant 1.6 library (For spelling verification)

Installation Instructions:


  • Python and python installation packages: from command prompt run:

sudo apt-get install python python-pip python-setuptools


  • NLTK, NumPy, PyYaml library: from command prompt run

sudo pip install -U numpy
sudo pip install -U pyyaml nltk
Test installation: run python then type import nltk
*Tweeterstream 1.1.1
Downoald at:
http://pypi.python.org/pypi/tweetstream/

decompress the file, enter directory and run:
sudo python setup.py install

*Python Enchant:
Downoald at:
http://packages.python.org/pyenchant/

decompress the file, enter directory and run:
sudo python setup.py install

Assumptions:

Geospatial:
We use the ‘locations’ parameter of the twiter streaming API do determine the location of a tweet. This move was made, because it centralizes the dependency of external services into a single API, hence reducing possible problems. Areas are determined by a box which is determined by two points:

(i) a south-west corner: determined by a lon-lat tuple (NOTE: the order of points is the inverse from google maps)

(ii) a north-east corner: determined by another lon- lat tuple (NOTE: the order of points is the inverse from google maps)
The resulting area coding is a list containing the 4 points, e.g. [SW-lon, NE-lat, SW-lon, NE-lat]

London’s south-west corner has been set between Felthan and Croydon; lon = -0.32 , lat = 51.4
London’s north-east corner has been set between Enfield and Hornchurch; lon = 0.132, lat = 51.65

Exeter’s south-west corner has been set between Ide and Alpthinton
Exeter’s north-east corner has been set between Stoke Hill and Whipton

Linguistic Assumptions:
The twiter API gives plenty of information about the user. One of them being the user language. London hosts many language communities, when compared to Exeter, for example. I found a strong correlation between users whose language parameter is not English and the production of tweets in a language different from English (which should not participate in our count of English mispellings. As a result, the process of gathering the tweets is slower, but the data is cleaner.

Analysis assumptions:

In addition to filtering by a language parameter, our analysis of the spelling is done in two stages.

Stage one: words starting with ‘@’ or ‘#’ (justified by the requirements of the exercise) and sequences of punctuation symbols are replaced by white spaces (justified by the fact that the tokenizer used at a later stage preserves punctuation, e.g. “New York.” is parsed as “New” and “York.”, but is much better for tokenizing contractions e.g. “can’t” becomes “can” and “ ‘t ” ).

Stage two: we tokenize the tweets using nltk.PunktWordTokenizer() method, which is much better for contracted forms. The forms themselves, e.g. (‘d, ‘s, ‘m) are not words according to the Enchant dictionary, but they can be added to an instance of the dictionary, hence counting as words.

It’s important to observe that we are using a rate (number of spelling mistakes / total number of words) as a benchmark, so we normalise the data (more words used would inevitably contain more mistakes).

Running:

The problem was broken down into two subtasks (i) streaming data and providing a bag of tweets, (ii) checking for spelling mistakes and keeping track of them:

First we stream the data using the twitter api, clean it up a bit, tokenize it and return a bag of tweet words, a list of lists representing the tokens inside a tweet string. This is done by the getBagOfTweets function which takes a username, password, area, limit, and a language as arguments, and returns a list containing lists of tokens. This procedure is kept in a separate file (tweetwords2.py) for modularisation purposes.

Second, I defined a spelling class which defines objects that contains number of correct words, spelling mistakes, total number of words checked, and so on. The constructor of this object requires a list containing lists of tokens, so a Spelling object can be instantiated. For modularisation purposes the class was kept in a separate file (wordcheck2.py). The checkWordsLists method gets a list represeting the tokens in a tweek, and checks whether it is in the dictionary (returning a boolean True or False). If true, we increment the number of correct words, if not we increment the number of spelling mistakes. In either case, we increment the number of total words.

Finally, I defined a main file (tweetwards2.py) that loads the two python files (and the libraries they call). The main asks for the number of tweets the user would like to stream. This number is used as limit for data-stream. We call the getBagOfTweets function that returns the lists of tokens in a list, we then use this list to calculate the spelling mistakes. Finally we divide the number of mistakes by the number of total words, which gives us a rate. The bigger the rate, the more mistakes you have.

#This imports the python classes in order to run the program.
#Each class imports its own libraries to do the necessary work, please look in the class file for further instructions
import twiterwords2
import wordcheck2
#This sets up the intial details to have access to the twiter API, you should replace them with your own credentials.
# We use the 'locations' parameter of the twiter streaming API do determine the location of a tweet
# Areas are determined by a box which is determined by two points:
# (i) a south-west corner: determined by a lon-lat tuple (NOTE: the order of points is the inverse from google maps)
# (ii) a north-east corner: determined by another lon- lat tuple (NOTE: the order of points is the inverse from google maps)
# The resulting area coding is a list containing the 4 points, e.g. [SW-lon, NE-lat, SW-lon, NE-lat]
#Geospatial assumptions:
#London's south-west corner has been set between Felthan and Croydon; lon = -0.32 , lat = 51.4
#London's north-east corner has been set between Enfield and Hornchurch; lon = 0.132, lat = 51.65
londonarea = ["-0.32,51.4,0.132,51.65"] #
#Exeter's south-west corner has been set between Ide and Alpthinton
#Exeter's north-east corner has been set between Stoke Hill and Whipton
exeterarea = ["-3.55,50.70,3.48,50.73"]
# Linguistic Assumptions:
# The twiter API gives plenty of information about the user. One of them being the user language.
# London hosts many language communities, when compared to Exeter, for example.
# I found a strong correlation between users whose language parameter is not english and
# the production of tweets in a language different than Enlgish (which should not participate in
# our count of English mispellings.
# Thus, I will store the language parameter that is used to filter the tweets in a variable, which
# is passed to our bag-of-tweets object. This can be changed if needed
languageparam = "en" # This is for the twiter language parameter
# We also use a language dictionary parameter for checking the word spelling (needed for Python Enchant)
dictparam = "en_UK"
# Comparison
# We define below a procedure that compares two spelling objects and returns the name of the inner
if __name__ == '__main__':
print """Welcome to Twiter Wars!
This program aims to find out whether the population of Exeter or London are better at spelling.
Please enter the number of tweets that you want us to stream.
For testing purposes, you should use a low number (e.g. 5).
"""
username = str(raw_input("Please enter your twiter username "))
password = str(raw_input("Please enter your twiter password "))
tweets2stream = int(input("Please provide us the number of tweets to stream using digits: "))
print """Thank you, we are now streaming tweets from London.
Loading London Tweets ----------------------"""
#This initialises the stream object
londonwords = twiterwords2.getBagOfTweets(username, password, londonarea, tweets2stream, languageparam)
#print londonwords #uncomment this for debugging purposes
print """ Thank you, we are now streaming tweets from Exeter. This might take a little longer.
Loading Exeter Tweets ----------------------"""
exeterwords = twiterwords2.getBagOfTweets(username, password, exeterarea, tweets2stream, languageparam)
#print exeterwords #uncomment this for debugging purposes
print """Thank you, we are now moving to the analysis phase."""
#In the analysis phase, we initialise two spelling objects
londonspelling = wordcheck2.Spelling("London", dictparam, londonwords)
exeterspelling = wordcheck2.Spelling("Exeter", dictparam, exeterwords)
# This analyses the words gathered previously
londonspelling.checkWordLists()
exeterspelling.checkWordLists()
print "Analysis results:"
londonspelling.compareSpelling(exeterspelling)
#the staments below clear authentication details
username = ""
password = ""
import re
import nltk
import tweetstream
def getBagOfTweets(u, p, a, lim, lang): #takes a username, password, area, limit, and a language as arguments
pattern = r'(?:[#@][a-zA-Z0-9_]*)|(?:(?:\?|\.|\!|\.|\;|\,|\-)+)' # first level of analysis, removes hash and at comments and puctuations
counter = 0 #initialises counter for tweets
tweets = [] #initialises the list that contains the streamed tweet
bagoftweets = [] # initialises the container for the lists representing the set of words of each tweet
try:
with tweetstream.FilterStream(u, p, locations=a) as stream: # initilises stream of tweets
for tweet in stream:
if counter < lim: #compares the counter of tweets against the limit stipulated by the user
if tweet["user"]["lang"] == lang: # extracts tweets that only match a stipulation set out initially (this could be changed so it would be provided by the user)
#print tweet["text"] # uncomment to see streeamed tweet
tweets.append(re.sub(pattern, ' ', tweet["text"].decode("utf-8"))) # appends the tweet that results from removing hash tags and puctuation symbols
print counter+1, "tweets streamed so far"#, "Tweet:", tweet["text"]#, "Speaking Language: ", tweet["user"]["lang"], ["place"]["name"], "\n"# uncomment here to have access to extra information about the tweets:
counter = counter + 1
else:
break
except tweetstream.ConnectionError, e: # in case connection failled for some reason
print "Disconnected from twitter. Reason:", e.reason , "We will resume shortly."
for streamedtweet in tweets:
#print streamedtweet
bagoftweets.append(nltk.PunktWordTokenizer().tokenize(streamedtweet)) # add the streamed tweets tokenized into a bag of words
return bagoftweets
import enchant
class Spelling(object):
name = ""
totalwords = 0.0
nonwords = 0.0
correctwords = 0.0
dictionary = ""
wordlists = []
rate = 0.0
def __init__(self, name, dictionary, wordlists):
self.name = name
self.dictionary = enchant.Dict(dictionary)
self.wordlists = wordlists
def checkWordLists(self):
self.dictionary.add(["'t","'s","'d", "'ll", "'m"]) # this adds a list of contracted word forms given my the tokenizer as words
for l in self.wordlists:
for word in l:
#print word
try:
if self.dictionary.check(word):
self.correctwords = self.correctwords + 1
else:
self.nonwords = self.nonwords + 1
self.totalwords = self.totalwords + 1
except:
print "Sorry, we couldn't parse this word, probably it doesn't contain a unicode character"
if self.totalwords > 0:
self.rate = self.nonwords / self.totalwords
else:
print "Warning total number of words is 0, something funny happened!"
def compareSpelling(self, o):
try:
print "Number of spelling mistakes : ", self.name , ": ", self.nonwords, ", ", o.name, ": ", o.nonwords
print "Number of correct words : ", self.name , ": ", self.correctwords, ", ", o.name, ": ", o.correctwords
print "Total number of words : ", self.name , ": ", self.totalwords, ", ", o.name, ": ", o.totalwords
print "The rate of error (mispellings / total number of words) words is: ", self.name , ": ", self.rate, ", ", o.name, ": ", o.rate
if self.rate == o.rate:
print "It was a tie"
elif self.rate > o.rate:
print self.name, " has the most mispellings"
else:
print o.name, " has the most mispellings"
except:
print "You are probably not making a comparison to a spelling object"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment