Last active
August 29, 2015 13:57
-
-
Save mac389/9569264 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import nltk, os, json, csv | |
from scipy.stats import scoreatpercentile | |
#read the stopwrod file | |
#do stopwords have to be in the same file as the json files? | |
READ = 'rb' | |
stopwords = open('../data/stopwords',READ).readlines() | |
#lemmatizer | |
lmtzr = WordNetLemmatizer() | |
#get the names of the files in a list | |
json_list = os.listir('C:\Users\Carrie0731\Desktop\JSON FILES')#probably should modify this so everyone can open it on his/her computer | |
sanitize = lambda text: lmtzer.lemmatize(text) if text not in stopwords else '' | |
string = ' ' .join(map(sanitize,[tweet['text'] for tweet in json.load(open(filename,READ)) for filename in json_list])) | |
freq = FreqDist(string) | |
cutoff = scoreatpercentile(freq.values(),15) | |
vocab = [word for word,f in freq.items() if f > cutoff] #Items returns the tuple (key,value), in this case (word, frequency) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
As to the comment about how much of the FreqDist to use: On one hand any truncation of FreqDist decreases the accuracy of P(token=t). On the other hand our estimates of the frequency of occurrence of rare words are inaccurate. It's not clear that including them help us better estimate P(token=t).
My inclination is to use a cutoff established from the empirical cumulative distribution function (Reference: http://en.wikipedia.org/wiki/Empirical_distribution_function) In Python, you don't have to estimate the full function. If you use the standard 85% cutoff, use the SciPy function
scoreatpercentile
to find the word frequency at 15th percentile. Include all words with frequencies greater than that.A benefit of using the empirical cumulative distribution function is that we make no assumption about the distribution of word frequencies. Btw, the 85th percentile thing arises because if we assume that the frequencies are normally distributed then the 85th percentile of the CDF corresponds to a signal-to-noise ratio of 5-to-1, which itself is an old rule of thumb in electrical engineering.