Skip to content

Instantly share code, notes, and snippets.

@mac389
Last active August 29, 2015 13:57
Show Gist options
  • Save mac389/9569264 to your computer and use it in GitHub Desktop.
Save mac389/9569264 to your computer and use it in GitHub Desktop.
import nltk, os, json, csv
from scipy.stats import scoreatpercentile
#read the stopwrod file
#do stopwords have to be in the same file as the json files?
READ = 'rb'
stopwords = open('../data/stopwords',READ).readlines()
#lemmatizer
lmtzr = WordNetLemmatizer()
#get the names of the files in a list
json_list = os.listir('C:\Users\Carrie0731\Desktop\JSON FILES')#probably should modify this so everyone can open it on his/her computer
sanitize = lambda text: lmtzer.lemmatize(text) if text not in stopwords else ''
string = ' ' .join(map(sanitize,[tweet['text'] for tweet in json.load(open(filename,READ)) for filename in json_list]))
freq = FreqDist(string)
cutoff = scoreatpercentile(freq.values(),15)
vocab = [word for word,f in freq.items() if f > cutoff] #Items returns the tuple (key,value), in this case (word, frequency)
@mac389
Copy link
Author

mac389 commented Mar 15, 2014

In Python testing membership in sets is faster than testing membership in lists. (Reference: http://stackoverflow.com/questions/7110276/faster-membership-testing-in-python-than-set)

You may to have remove newlines from the stopwords file after reading it.

@mac389
Copy link
Author

mac389 commented Mar 15, 2014

As to the comment about how much of the FreqDist to use: On one hand any truncation of FreqDist decreases the accuracy of P(token=t). On the other hand our estimates of the frequency of occurrence of rare words are inaccurate. It's not clear that including them help us better estimate P(token=t).

My inclination is to use a cutoff established from the empirical cumulative distribution function (Reference: http://en.wikipedia.org/wiki/Empirical_distribution_function) In Python, you don't have to estimate the full function. If you use the standard 85% cutoff, use the SciPy function scoreatpercentile to find the word frequency at 15th percentile. Include all words with frequencies greater than that.

A benefit of using the empirical cumulative distribution function is that we make no assumption about the distribution of word frequencies. Btw, the 85th percentile thing arises because if we assume that the frequencies are normally distributed then the 85th percentile of the CDF corresponds to a signal-to-noise ratio of 5-to-1, which itself is an old rule of thumb in electrical engineering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment