Skip to content

Instantly share code, notes, and snippets.

@fragglet
Created September 4, 2018 02:50
Show Gist options
  • Save fragglet/988a911130fa5b1a2cf30f2fb7e0cfbc to your computer and use it in GitHub Desktop.
Save fragglet/988a911130fa5b1a2cf30f2fb7e0cfbc to your computer and use it in GitHub Desktop.
"""Extracts media URLs from a tweet archive."""
import glob
import json
for filename in glob.glob("tweets/*.js"):
with open(filename, "r") as f:
data = f.read()
_, data = data.split("=", 1)
tweets = json.loads(data)
for tweet in tweets:
medias = tweet.get("entities", {}).get("media", [])
for m in medias:
url = m.get("media_url_https", "")
if "pbs.twimg.com" in url:
url += ":orig"
if url:
print(url)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment