tgalery/twiterwars2.py

## gistfile1.textile

      
    Raw
  

              gistfile1.textile
            
          
    This is a simple python program that streams tweets from 2 locations, London and Exeter, in our example,  and compares which one has the greatest number of spelling mistakes.
	1 – Set-up used:
	*Ubuntu 11.04 Natty AMD64

*Python 2.7.3

*python re library

*python nltk 2.0 library and the required NumPy and PyYaml (For NLP tasks)

*python tweeterstream 1.1.1 library (For Tweeter Manipulation)

*python enchant 1.6 library (For spelling verification)
	Installation Instructions:
		

		Python and python installation packages: from command prompt run: 

sudo apt-get install python python-pip python-setuptools

	
		NLTK, NumPy, PyYaml library: from command prompt run 


sudo pip install -U numpy 

	sudo pip install -U pyyaml nltk

	Test installation: run python then type import nltk

*Tweeterstream 1.1.1 

	Downoald at:

 http://pypi.python.org/pypi/tweetstream/

		decompress the file, enter directory and run:

	sudo python setup.py install
	*Python Enchant:

	Downoald at:

 http://packages.python.org/pyenchant/
		decompress the file, enter directory and run:

	sudo python setup.py install
	Assumptions:
	Geospatial:

We use the ‘locations’ parameter of the twiter streaming API do determine the location of a tweet. This move was made, because it centralizes the dependency of external services into a single API, hence reducing possible problems. Areas are determined by a box which is determined by two points: 
	(i) a south-west corner: determined by a lon-lat tuple (NOTE: the order of points is the inverse from google maps)
	(ii) a north-east corner: determined by another lon- lat tuple (NOTE: the order of points is the inverse from google maps) 

The resulting area coding is a list containing the 4 points, e.g. [SW-lon, NE-lat, SW-lon, NE-lat] 
	London’s south-west corner has been set between Felthan and Croydon; lon = -0.32 , lat = 51.4

London’s north-east corner has been set between Enfield and Hornchurch; lon = 0.132, lat = 51.65
	Exeter’s south-west corner has been set between Ide and Alpthinton

Exeter’s north-east corner has been set between Stoke Hill and Whipton
	Linguistic Assumptions:

The twiter API gives plenty of information about the user. One of them being the user language. London hosts many language communities, when compared to Exeter, for example. I found a strong correlation between users whose language parameter is not English and the production of tweets in a language different from English (which should not participate in our count of English mispellings. As a result, the process of gathering the tweets is slower, but the data is cleaner.
	Analysis assumptions:
	In addition to filtering by a language parameter, our analysis of the spelling is done in two stages.
	Stage one: words starting with ‘@’ or ‘#’ (justified by the requirements of the exercise) and sequences of punctuation symbols are replaced by white spaces (justified by the fact that the tokenizer used at a later stage preserves punctuation, e.g. “New York.” is parsed as “New” and “York.”, but is much better for tokenizing contractions e.g. “can’t” becomes “can” and “ ‘t ” ). 
	Stage two: we tokenize the tweets using nltk.PunktWordTokenizer() method, which is much better for contracted forms. The forms themselves, e.g. (‘d, ‘s, ‘m) are not words according to the Enchant dictionary, but they can be added to an instance of the dictionary, hence counting as words.
	It’s important to observe that we are using a rate (number of spelling mistakes / total number of words) as a benchmark, so we normalise the data (more words used would inevitably contain more mistakes).
	Running:
	The problem was broken down into two subtasks (i) streaming data and providing a bag of tweets, (ii) checking for spelling mistakes and keeping track of them:
	First we stream the data using the twitter api, clean it up a bit, tokenize it and return a bag of tweet words, a list of lists representing the tokens inside a tweet string. This is done by the getBagOfTweets function which takes a username, password, area, limit, and a language as arguments, and returns a list containing lists of tokens. This procedure is kept in a separate file (tweetwords2.py) for modularisation purposes.
	Second, I defined a spelling class which defines objects that contains number of correct words, spelling mistakes, total number of words checked, and so on. The constructor of this object requires a list containing lists of tokens, so a Spelling object can be instantiated.  For modularisation purposes the class was kept in a separate file (wordcheck2.py). The checkWordsLists method gets a list represeting the tokens in a tweek, and checks whether it is in the dictionary (returning a boolean True or False). If true, we increment the number of correct words, if not we increment the number of spelling mistakes. In either case, we increment the number of total words.
	Finally, I defined a main file  (tweetwards2.py) that loads the two python files (and the libraries they call). The main asks for the number of tweets the user would like to stream. This number is used as limit for data-stream. We call the getBagOfTweets function that returns the lists of tokens in a list, we then use this list to calculate the spelling mistakes. Finally we divide the number of mistakes by the number of total words, which gives us a rate. The bigger the rate, the more mistakes you have. 
  

## twiterwars2.py
#This imports the python classes in order to run the program.
#Each class imports its own libraries to do the necessary work, please look in the class file for further instructions
import twiterwords2
import wordcheck2

#This sets up the intial details to have access to the twiter API, you should replace them with your own credentials.

# We use the 'locations' parameter of the twiter streaming API do determine the location of a tweet
# Areas are determined by a box which is determined by two points:
# (i) a south-west corner: determined by a lon-lat tuple (NOTE: the order of points is the inverse from google maps)
# (ii) a north-east corner: determined by another lon- lat tuple (NOTE: the order of points is the inverse from google maps)
# The resulting area coding is a list containing the 4 points, e.g. [SW-lon, NE-lat, SW-lon, NE-lat]

#Geospatial assumptions:
#London's south-west corner has been set between Felthan and Croydon; lon = -0.32 , lat = 51.4
#London's north-east corner has been set between Enfield and Hornchurch; lon = 0.132, lat = 51.65
londonarea = ["-0.32,51.4,0.132,51.65"] #
#Exeter's south-west corner has been set between Ide and Alpthinton
#Exeter's north-east corner has been set between Stoke Hill and Whipton
exeterarea = ["-3.55,50.70,3.48,50.73"]

# Linguistic Assumptions:
# The twiter API gives plenty of information about the user. One of them being the user language.
# London hosts many language communities, when compared to Exeter, for example.
# I found a strong correlation between users whose language parameter is not english and
# the production of tweets in a language different than Enlgish (which should not participate in
# our count of English mispellings.
# Thus, I will store the language parameter that is used to filter the tweets in a variable, which
# is passed to our bag-of-tweets object. This can be changed if needed
languageparam = "en" # This is for the twiter language parameter
# We also use a language dictionary parameter for checking the word spelling (needed for Python Enchant)
dictparam = "en_UK"

# Comparison
# We define below a procedure that compares two spelling objects and returns the name of the inner


if __name__ == '__main__':

    print """Welcome to Twiter Wars!
    This program aims to find out whether the population of Exeter or London are better at spelling.
    Please enter the number of tweets that you want us to stream.
    For testing purposes, you should use a low number (e.g. 5).
     """
    username = str(raw_input("Please enter your twiter username "))
    password = str(raw_input("Please enter your twiter password "))
    tweets2stream = int(input("Please provide us the number of tweets to stream using digits: "))

    print """Thank you, we are now streaming tweets from London.
    Loading London Tweets ----------------------"""

    #This initialises the stream object
    londonwords = twiterwords2.getBagOfTweets(username, password, londonarea, tweets2stream, languageparam)
    #print londonwords #uncomment this for debugging purposes

    print """ Thank you, we are now streaming tweets from Exeter. This might take a little longer.
    Loading Exeter Tweets ----------------------"""

    exeterwords = twiterwords2.getBagOfTweets(username, password, exeterarea, tweets2stream, languageparam)
    #print exeterwords #uncomment this for debugging purposes

    print """Thank you, we are now moving to the analysis phase."""

    #In the analysis phase, we initialise two spelling objects
    londonspelling = wordcheck2.Spelling("London", dictparam, londonwords)
    exeterspelling = wordcheck2.Spelling("Exeter", dictparam, exeterwords)
    # This analyses the words gathered previously
    londonspelling.checkWordLists()
    exeterspelling.checkWordLists()
    print "Analysis results:"

    londonspelling.compareSpelling(exeterspelling)

    #the staments below clear authentication details
    username = ""
    password = ""

## twiterwords2.py
import re
import nltk
import tweetstream


def getBagOfTweets(u, p, a, lim, lang): #takes a username, password, area, limit, and a language as arguments
    pattern = r'(?:[#@][a-zA-Z0-9_]*)|(?:(?:\?|\.|\!|\.|\;|\,|\-)+)' # first level of analysis, removes hash and at comments and puctuations
    counter = 0 #initialises counter for tweets
    tweets = [] #initialises the list that contains the streamed tweet
    bagoftweets = [] # initialises the container for the lists representing the set of words of each tweet
    try:
      with tweetstream.FilterStream(u, p, locations=a) as stream: # initilises stream of tweets
        for tweet in stream:
          if counter < lim:  #compares the counter of tweets against the limit stipulated by the user
            if tweet["user"]["lang"] == lang: # extracts tweets that only match a stipulation set out initially (this could be changed so it would be provided by the user)
              #print tweet["text"] # uncomment to see streeamed tweet
              tweets.append(re.sub(pattern, ' ', tweet["text"].decode("utf-8"))) # appends the tweet that results from removing hash tags and puctuation symbols
              print counter+1, "tweets streamed so far"#, "Tweet:", tweet["text"]#, "Speaking Language: ", tweet["user"]["lang"], ["place"]["name"], "\n"# uncomment here to have access to extra information about the tweets:
              counter = counter + 1
          else:
            break
    except tweetstream.ConnectionError, e: # in case connection failled for some reason
      print "Disconnected from twitter.  Reason:", e.reason , "We will resume shortly."
    for streamedtweet in tweets:
      #print streamedtweet
      bagoftweets.append(nltk.PunktWordTokenizer().tokenize(streamedtweet)) # add the streamed tweets tokenized into a bag of words
    return bagoftweets

## wordcheck2.py
import enchant


class Spelling(object):
  name = ""
  totalwords = 0.0
  nonwords = 0.0
  correctwords = 0.0
  dictionary = ""
  wordlists = []
  rate = 0.0


  def __init__(self, name, dictionary, wordlists):
      self.name = name
      self.dictionary = enchant.Dict(dictionary)
      self.wordlists = wordlists

  def checkWordLists(self):
    self.dictionary.add(["'t","'s","'d", "'ll", "'m"]) # this adds a list of contracted word forms given my the tokenizer as words
    for l in self.wordlists:
      for word in l:
        #print word
        try:
          if self.dictionary.check(word):
            self.correctwords = self.correctwords + 1
          else:
            self.nonwords = self.nonwords + 1
          self.totalwords = self.totalwords + 1
        except:
          print "Sorry, we couldn't parse this word, probably it doesn't contain a unicode character"
    if self.totalwords > 0:
      self.rate = self.nonwords / self.totalwords
    else:
      print "Warning total number of words is 0, something funny happened!"

  def compareSpelling(self, o):
    try:
      print "Number of spelling mistakes : ", self.name , ": ", self.nonwords, ", ", o.name, ": ", o.nonwords
      print "Number of correct words : ", self.name , ": ", self.correctwords, ", ", o.name, ": ", o.correctwords
      print "Total number of words : ", self.name , ": ", self.totalwords, ", ", o.name, ": ", o.totalwords
      print "The rate of error (mispellings / total number of words) words is: ", self.name , ": ",  self.rate, ", ", o.name, ": ", o.rate
      if self.rate == o.rate:
        print "It was a tie"
      elif self.rate > o.rate:
        print self.name, " has the most mispellings"
      else:
        print o.name, " has the most mispellings"
    except:
      print "You are probably not making a comparison to a spelling object"
	#This imports the python classes in order to run the program.
	#Each class imports its own libraries to do the necessary work, please look in the class file for further instructions
	import twiterwords2
	import wordcheck2

	#This sets up the intial details to have access to the twiter API, you should replace them with your own credentials.

	# We use the 'locations' parameter of the twiter streaming API do determine the location of a tweet
	# Areas are determined by a box which is determined by two points:
	# (i) a south-west corner: determined by a lon-lat tuple (NOTE: the order of points is the inverse from google maps)
	# (ii) a north-east corner: determined by another lon- lat tuple (NOTE: the order of points is the inverse from google maps)
	# The resulting area coding is a list containing the 4 points, e.g. [SW-lon, NE-lat, SW-lon, NE-lat]

	#Geospatial assumptions:
	#London's south-west corner has been set between Felthan and Croydon; lon = -0.32 , lat = 51.4
	#London's north-east corner has been set between Enfield and Hornchurch; lon = 0.132, lat = 51.65
	londonarea = ["-0.32,51.4,0.132,51.65"] #
	#Exeter's south-west corner has been set between Ide and Alpthinton
	#Exeter's north-east corner has been set between Stoke Hill and Whipton
	exeterarea = ["-3.55,50.70,3.48,50.73"]

	# Linguistic Assumptions:
	# The twiter API gives plenty of information about the user. One of them being the user language.
	# London hosts many language communities, when compared to Exeter, for example.
	# I found a strong correlation between users whose language parameter is not english and
	# the production of tweets in a language different than Enlgish (which should not participate in
	# our count of English mispellings.
	# Thus, I will store the language parameter that is used to filter the tweets in a variable, which
	# is passed to our bag-of-tweets object. This can be changed if needed
	languageparam = "en" # This is for the twiter language parameter
	# We also use a language dictionary parameter for checking the word spelling (needed for Python Enchant)
	dictparam = "en_UK"

	# Comparison
	# We define below a procedure that compares two spelling objects and returns the name of the inner



	if __name__ == '__main__':

	print """Welcome to Twiter Wars!
	This program aims to find out whether the population of Exeter or London are better at spelling.
	Please enter the number of tweets that you want us to stream.
	For testing purposes, you should use a low number (e.g. 5).
	"""
	username = str(raw_input("Please enter your twiter username "))
	password = str(raw_input("Please enter your twiter password "))
	tweets2stream = int(input("Please provide us the number of tweets to stream using digits: "))

	print """Thank you, we are now streaming tweets from London.
	Loading London Tweets ----------------------"""

	#This initialises the stream object
	londonwords = twiterwords2.getBagOfTweets(username, password, londonarea, tweets2stream, languageparam)
	#print londonwords #uncomment this for debugging purposes

	print """ Thank you, we are now streaming tweets from Exeter. This might take a little longer.
	Loading Exeter Tweets ----------------------"""

	exeterwords = twiterwords2.getBagOfTweets(username, password, exeterarea, tweets2stream, languageparam)
	#print exeterwords #uncomment this for debugging purposes

	print """Thank you, we are now moving to the analysis phase."""

	#In the analysis phase, we initialise two spelling objects
	londonspelling = wordcheck2.Spelling("London", dictparam, londonwords)
	exeterspelling = wordcheck2.Spelling("Exeter", dictparam, exeterwords)
	# This analyses the words gathered previously
	londonspelling.checkWordLists()
	exeterspelling.checkWordLists()
	print "Analysis results:"

	londonspelling.compareSpelling(exeterspelling)

	#the staments below clear authentication details
	username = ""
	password = ""
	import re
	import nltk
	import tweetstream


	def getBagOfTweets(u, p, a, lim, lang): #takes a username, password, area, limit, and a language as arguments
	pattern = r'(?:[#@][a-zA-Z0-9_]*)\|(?:(?:\?\|\.\|\!\|\.\|\;\|\,\|\-)+)' # first level of analysis, removes hash and at comments and puctuations
	counter = 0 #initialises counter for tweets
	tweets = [] #initialises the list that contains the streamed tweet
	bagoftweets = [] # initialises the container for the lists representing the set of words of each tweet
	try:
	with tweetstream.FilterStream(u, p, locations=a) as stream: # initilises stream of tweets
	for tweet in stream:
	if counter < lim: #compares the counter of tweets against the limit stipulated by the user
	if tweet["user"]["lang"] == lang: # extracts tweets that only match a stipulation set out initially (this could be changed so it would be provided by the user)
	#print tweet["text"] # uncomment to see streeamed tweet
	tweets.append(re.sub(pattern, ' ', tweet["text"].decode("utf-8"))) # appends the tweet that results from removing hash tags and puctuation symbols
	print counter+1, "tweets streamed so far"#, "Tweet:", tweet["text"]#, "Speaking Language: ", tweet["user"]["lang"], ["place"]["name"], "\n"# uncomment here to have access to extra information about the tweets:
	counter = counter + 1
	else:
	break
	except tweetstream.ConnectionError, e: # in case connection failled for some reason
	print "Disconnected from twitter. Reason:", e.reason , "We will resume shortly."
	for streamedtweet in tweets:
	#print streamedtweet
	bagoftweets.append(nltk.PunktWordTokenizer().tokenize(streamedtweet)) # add the streamed tweets tokenized into a bag of words
	return bagoftweets
	import enchant


	class Spelling(object):
	name = ""
	totalwords = 0.0
	nonwords = 0.0
	correctwords = 0.0
	dictionary = ""
	wordlists = []
	rate = 0.0


	def __init__(self, name, dictionary, wordlists):
	self.name = name
	self.dictionary = enchant.Dict(dictionary)
	self.wordlists = wordlists

	def checkWordLists(self):
	self.dictionary.add(["'t","'s","'d", "'ll", "'m"]) # this adds a list of contracted word forms given my the tokenizer as words
	for l in self.wordlists:
	for word in l:
	#print word
	try:
	if self.dictionary.check(word):
	self.correctwords = self.correctwords + 1
	else:
	self.nonwords = self.nonwords + 1
	self.totalwords = self.totalwords + 1
	except:
	print "Sorry, we couldn't parse this word, probably it doesn't contain a unicode character"
	if self.totalwords > 0:
	self.rate = self.nonwords / self.totalwords
	else:
	print "Warning total number of words is 0, something funny happened!"

	def compareSpelling(self, o):
	try:
	print "Number of spelling mistakes : ", self.name , ": ", self.nonwords, ", ", o.name, ": ", o.nonwords
	print "Number of correct words : ", self.name , ": ", self.correctwords, ", ", o.name, ": ", o.correctwords
	print "Total number of words : ", self.name , ": ", self.totalwords, ", ", o.name, ": ", o.totalwords
	print "The rate of error (mispellings / total number of words) words is: ", self.name , ": ", self.rate, ", ", o.name, ": ", o.rate
	if self.rate == o.rate:
	print "It was a tie"
	elif self.rate > o.rate:
	print self.name, " has the most mispellings"
	else:
	print o.name, " has the most mispellings"
	except:
	print "You are probably not making a comparison to a spelling object"