Skip to content

Instantly share code, notes, and snippets.

@edsu
Created January 18, 2013 18:19
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save edsu/4566860 to your computer and use it in GitHub Desktop.
Save edsu/4566860 to your computer and use it in GitHub Desktop.
get unshortened URLs out of large batches of Twitter JSON data
#!/usr/bin/env python
"""
Feed this program line-oriented JSON tweet data (as received from the API)
on STDIN and get unshortened URLs mentioned in the tweets on STDOUT.
This module will look up multiple urls at once using the multiprocessing
library. Change CONCURRENCY to have more or less processes, defaults to 10.
"""
CONCURRENCY = 10
import json
import requests
import fileinput
import multiprocessing
seen = {}
def unshorten(url):
if url in seen:
return seen[url]
new_url = url
try:
r = requests.get(url)
if r.status_code == 200:
new_url = r.url
except:
pass # oh well
seen[url] = new_url
return new_url
def urls():
for line in fileinput.input():
tweet = json.loads(line)
for url in tweet["entities"]["urls"]:
yield url["expanded_url"]
if __name__ == "__main__":
pool = multiprocessing.Pool(processes=CONCURRENCY)
for url in pool.imap_unordered(unshorten, urls()):
print url
pool.close()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment