Skip to content

Instantly share code, notes, and snippets.

@iandioch
Created November 26, 2015 05:43
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save iandioch/6f0e9ea91946792aa4a2 to your computer and use it in GitHub Desktop.
Save iandioch/6f0e9ea91946792aa4a2 to your computer and use it in GitHub Desktop.
Grabs just the tweet text from sentiment140 tweet corpus and discards the noise
# loads the CSV corpus file from the links here (http://help.sentiment140.com/for-students) and outputs just the tweet text
lines = []
with open("testdata.manual.2009.06.14.clean.txt", "w") as outfile:
with open("testdata.manual.2009.06.14.csv", 'r') as infile:
lines = infile.readlines()
for line in lines:
bits = line.split(",")
text = ",".join(bits[5:]).replace("&amp;", "&").replace("&lt;", "<").replace("&gt;", ">")
text = text[1:-2]
print text
outfile.write(text + "\n")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment