This is a simple python program that streams tweets from 2 locations, London and Exeter, in our example, and compares which one has the greatest number of spelling mistakes.
1 – Set-up used:
*Ubuntu 11.04 Natty AMD64
*Python 2.7.3
*python re library
*python nltk 2.0 library and the required NumPy and PyYaml (For NLP tasks)
*python tweeterstream 1.1.1 library (For Tweeter Manipulation)
*python enchant 1.6 library (For spelling verification)
Installation Instructions:
- Python and python installation packages: from command prompt run:
sudo apt-get install python python-pip python-setuptools
- NLTK, NumPy, PyYaml library: from command prompt run
sudo pip install -U numpy
sudo pip install -U pyyaml nltk
Test installation: run python then type import nltk
*Tweeterstream 1.1.1
Downoald at:
http://pypi.python.org/pypi/tweetstream/
decompress the file, enter directory and run:
sudo python setup.py install
*Python Enchant:
Downoald at:
http://packages.python.org/pyenchant/
decompress the file, enter directory and run:
sudo python setup.py install
Assumptions:
Geospatial:
We use the ‘locations’ parameter of the twiter streaming API do determine the location of a tweet. This move was made, because it centralizes the dependency of external services into a single API, hence reducing possible problems. Areas are determined by a box which is determined by two points:
(i) a south-west corner: determined by a lon-lat tuple (NOTE: the order of points is the inverse from google maps)
(ii) a north-east corner: determined by another lon- lat tuple (NOTE: the order of points is the inverse from google maps)
The resulting area coding is a list containing the 4 points, e.g. [SW-lon, NE-lat, SW-lon, NE-lat]
London’s south-west corner has been set between Felthan and Croydon; lon = -0.32 , lat = 51.4
London’s north-east corner has been set between Enfield and Hornchurch; lon = 0.132, lat = 51.65
Exeter’s south-west corner has been set between Ide and Alpthinton
Exeter’s north-east corner has been set between Stoke Hill and Whipton
Linguistic Assumptions:
The twiter API gives plenty of information about the user. One of them being the user language. London hosts many language communities, when compared to Exeter, for example. I found a strong correlation between users whose language parameter is not English and the production of tweets in a language different from English (which should not participate in our count of English mispellings. As a result, the process of gathering the tweets is slower, but the data is cleaner.
Analysis assumptions:
In addition to filtering by a language parameter, our analysis of the spelling is done in two stages.
Stage one: words starting with ‘@’ or ‘#’ (justified by the requirements of the exercise) and sequences of punctuation symbols are replaced by white spaces (justified by the fact that the tokenizer used at a later stage preserves punctuation, e.g. “New York.” is parsed as “New” and “York.”, but is much better for tokenizing contractions e.g. “can’t” becomes “can” and “ ‘t ” ).
Stage two: we tokenize the tweets using nltk.PunktWordTokenizer() method, which is much better for contracted forms. The forms themselves, e.g. (‘d, ‘s, ‘m) are not words according to the Enchant dictionary, but they can be added to an instance of the dictionary, hence counting as words.
It’s important to observe that we are using a rate (number of spelling mistakes / total number of words) as a benchmark, so we normalise the data (more words used would inevitably contain more mistakes).
Running:
The problem was broken down into two subtasks (i) streaming data and providing a bag of tweets, (ii) checking for spelling mistakes and keeping track of them:
First we stream the data using the twitter api, clean it up a bit, tokenize it and return a bag of tweet words, a list of lists representing the tokens inside a tweet string. This is done by the getBagOfTweets function which takes a username, password, area, limit, and a language as arguments, and returns a list containing lists of tokens. This procedure is kept in a separate file (tweetwords2.py) for modularisation purposes.
Second, I defined a spelling class which defines objects that contains number of correct words, spelling mistakes, total number of words checked, and so on. The constructor of this object requires a list containing lists of tokens, so a Spelling object can be instantiated. For modularisation purposes the class was kept in a separate file (wordcheck2.py). The checkWordsLists method gets a list represeting the tokens in a tweek, and checks whether it is in the dictionary (returning a boolean True or False). If true, we increment the number of correct words, if not we increment the number of spelling mistakes. In either case, we increment the number of total words.
Finally, I defined a main file (tweetwards2.py) that loads the two python files (and the libraries they call). The main asks for the number of tweets the user would like to stream. This number is used as limit for data-stream. We call the getBagOfTweets function that returns the lists of tokens in a list, we then use this list to calculate the spelling mistakes. Finally we divide the number of mistakes by the number of total words, which gives us a rate. The bigger the rate, the more mistakes you have.