Skip to content

Instantly share code, notes, and snippets.

@neilkod
Created June 3, 2012 14:58
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save neilkod/2863831 to your computer and use it in GitHub Desktop.
Save neilkod/2863831 to your computer and use it in GitHub Desktop.
strip entities (urls, hashtags, usernames) from a tweet
note: tweets are in json format, coming from STDIN.
for each entity in entities, grab the start and end position. because they can appear in any order, put the (start, end) on a list. after extracting all of the entities, reverse the list and trim the string(tweet text) appropriately.
I'll clean this up and put it in a proper repo. it's some yak-shaving i needed to do for my latest data project.
#!/bin/python
import json, sys
def strip_items(str, start_pos, end_pos):
return str[0:start_pos]+str[end_pos:]
for itm in sys.stdin:
line = itm.strip()
data = json.loads(line)
txt=data['text']
tostrip=[]
print '-'*50
print 'original tweet: %s' % txt
entities=data['entities']
for k,v in entities.iteritems():
for ent in v:
try:
(start_pos,end_pos)=ent['indices']
tostrip.append((start_pos,end_pos))
except KeyError:
# no entities/indicies. pass
print "error hre"
pass
for x in sorted(tostrip, reverse=True):
txt = strip_items(txt, *x)
print 'modified tweet: %s' % txt
print '-'*50
##### sample output
--------------------------------------------------
original tweet: RT @Aepul_Drama: RT @azlanR: #NowListening; Drama Band - Cerita Dia
modified tweet: RT : RT : ; Drama Band - Cerita Dia
--------------------------------------------------
original tweet: hooolis, me duele el ojo
modified tweet: hooolis, me duele el ojo
--------------------------------------------------
original tweet: gonna listen to some Adele and fall asleep now :) http://t.co/Ie1JfC8e
modified tweet: gonna listen to some Adele and fall asleep now :)
--------------------------------------------------
original tweet: Diseño de cuartos de baño Interiorismoonlinenet Vic: http://t.co/ZjOsbRie mailto:ventas@interiorismoonline... http://t.co/tSC6epL5
modified tweet: Diseño de cuartos de baño Interiorismoonlinenet Vic: mailto:ventas@interiorismoonline...
--------------------------------------------------
original tweet: Sans contenir, au contraire des cinq autres, le moindre signe ou allusion religieuse, Rocky 4 est de loin le plus mystique. #PolitiqueEtFoi
modified tweet: Sans contenir, au contraire des cinq autres, le moindre signe ou allusion religieuse, Rocky 4 est de loin le plus mystique.
--------------------------------------------------
original tweet: “@__LickMyChucks All About Dem COWBOYS”
modified tweet: “ All About Dem COWBOYS”
--------------------------------------------------
original tweet: #imsickof people wanting people to be 'real' but not being able to handle the truth !
modified tweet: people wanting people to be 'real' but not being able to handle the truth !
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment