Skip to content

Instantly share code, notes, and snippets.

@kachok
Created May 29, 2012 15:55
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kachok/2829198 to your computer and use it in GitHub Desktop.
Save kachok/2829198 to your computer and use it in GitHub Desktop.
pickling of words from spanish tweets
import codecs
import pickle
file = "/Users/dkachaev/repos/hltcoe/tweets-es/data/oov.vocab"
out = codecs.open(file, "r", "utf-8")
vocab={}
f=open("/Users/dkachaev/repos/hltcoe/tweets-es/data/tweets_es_vocabulary.pickle","w")
for line in out:
try:
line=line.strip()
freq, word = line.split(" ")
#print word, " - " ,freq
vocab[word]={"frequency":int(freq),"context":[""]}
# Context - "" <- need text of original tweet where word occurred, or 3 tweets ["tweet1", "tweet2", "tweet3"]
except:
print "skipping line"
pickle.dump(vocab,f)
f.close()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment