Skip to content

Instantly share code, notes, and snippets.

@duner
Last active April 28, 2022 19:48
Show Gist options
  • Save duner/8b1dc63c26eb774d43a21c0faa2fa9aa to your computer and use it in GitHub Desktop.
Save duner/8b1dc63c26eb774d43a21c0faa2fa9aa to your computer and use it in GitHub Desktop.
Twitter Archive to JSON

If you download your personal Twitter archive, you don't quite get the data as JSON, but as a series of .js files, one for each month (there are meant to replicate the Twitter API respones for the front-end part of the downloadable archive.)

But if you want to be able to use the data in those files, which is far richer than the CSV data, for some analysis or app just run this script.

Run sh ./twitter-archive-to-json.sh in the same directory as the /tweets folder that comes with the archive download, and you'll get two files:

  • tweets.json — a JSON list of the objects
  • tweets_dict.json — a JSON dictionary where each Tweet's key is its id_str

You'll also get a /json-tweets directory which has the individual JSON files for each month of tweets.

#!/usr/bin/env bash
mkdir json-tweets
mkdir .tmp-json-tweets
touch .tmp-tweets.json
touch tweets.json
echo "" > tweets.json
echo "" > .tmp-tweets.json
echo "Processing Tweet.js files..."
for f in tweets/*.js; do
tail -n +2 "$f" > json-"${f%.js}".json
done
echo "Creating tweets.json..."
echo "[ {" >> .tmp-tweets.json
for f in json-tweets/*.json; do
tail -n +2 "$f" | sed '$d' > .tmp-"${f%.js}"
echo "}, {" >> .tmp-"${f%.js}"
cat .tmp-"${f%.js}" >> .tmp-tweets.json
rm .tmp-"${f%.js}"
done
rmdir .tmp-json-tweets
cat .tmp-tweets.json | sed '$d' > tweets.json
echo "} ]" >> tweets.json
rm .tmp-tweets.json
cat tweets.json | jq '. | map({"key": .id_str | tostring, "value": .}) | from_entries' > tweets_dict.json
echo "DONE"
@almereyda
Copy link

almereyda commented Mar 15, 2021

A batch job for creating json digests from the js archive distribution of the Twitter archive from within the data directory could look like:

rsync -I --backup --suffix='.json' --backup-dir='json' --exclude='manifest.js' ./*.js ./
sed -i -r 's/^window.*\ \=\ (.*)$/\1/' json/*

You can then dig into your data at will:

jq '.[] | .tweet | select(.entities.urls != []) | .entities | .urls | map(.expanded_url) | .[]' tweet.js.json | cut -d'/' -f3 | sed 's/\"//g' | sort | uniq -c | sort -g

Please note this will update the file modification times for the *.js files from the ones provided by the archive to the moment of running the command, due to the -I ignore switch, which makes rsync copy every file over itself.

Adapted from https://unix.stackexchange.com/a/527037/79223

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment