Skip to content

Instantly share code, notes, and snippets.

@diyism
Last active April 6, 2024 06:05
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save diyism/c398ccdfc03bb4019251b0a4e5cf8c6f to your computer and use it in GitHub Desktop.
Save diyism/c398ccdfc03bb4019251b0a4e5cf8c6f to your computer and use it in GitHub Desktop.
download twitter data and reduce size
1. login twitter and visit https://twitter.com/settings/download_your_data
click "Request archive" button, input password and let twitter to prepare data for you
2. wait for few days
3. visit https://twitter.com/settings/your_twitter_data/data
Right click "Download archive" button, click "Copy Link" to get the url: https://ton.twitter.com/i/ton/data/archives/<uid>/twitter-2021-04-14-.....zip
open a new tab, paste the https://ton.twitter.com/... url, click enter, start download cancel it, while "Firefox Developer" open, and click cancel download, the "Firefox Developer/Network" will show the failed url.
click "Download archive" in the download page and keep this download running(downloading session maintain),
firefox "Web Develope/Network/Mouse Right Click an item/Copy/Copy as cURL" can get a curl command line(with cookie/session id etc),
append " -LOC -" at its end. (break resume)
4. run the curl command in your vps(google colab or github codespace or huggingface docker jupyterlab) to get the twitter-...zip file, its size may be over 3GB (10GB for 30000 tweets)
now we can cancel the running download in the download page(the vps has inherited the same downloading session)
5. unzip twitter-*.zip -d tweets >/dev/null 2>&1 //unzipping the 10GB zip file(10GB+10GB=20GB) maybe burst the VPS disk, to use google colab or github codespace or huggingface docker jupyterlab
(github codespaces: "/workspaces" left 18GB, '/tmp' left 112GB, so put zip in 1st and unzip to 2nd)
#5. sudo apt install archivemount //"fuse-zip" may show error: "unable to open ZIP file: bad file name (two slashes): assets//" so to use "archivemount"
#6. mkdir tweets ; archivemount twitter-*.zip ./tweets //"archivemount" also has error, it seems twitter zip file is bad-format
7. rm -rf tweets/data/tweets_media ; rm -rf tweets/data/deleted_tweets_media //ago: rm -rf tweets/data/tweet_media
8. rm tweets/data/ad-*.js
9. rm -rf tweets/assets/images/twemoji
10. cd tweets/ && zip -rq $OLDPWD/tweets.zip . && cd - //umount ./tweets
11. #scp the VPS to download tweets.zip, the huggingface docker jupyterlab can download it from jupyterlab left sidebar
12. cp tweets.zip /bak_dir/tweets.2024-01.zip; mv tweets tweets.bak ; mkdir tweets ; unzip tweets.zip -d tweets
the final tweets.zip file may be less than 15MB
================================================================
#Request archive
curl 'https://twitter.com/i/api/1.1/account/user_twitter_data'
-H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:83.0) Gecko/20100101 Firefox/83.0'
-H 'Accept: */*'
-H 'Accept-Language: en-US,en;q=0.5'
-H 'Referer: https://twitter.com/settings/download_your_data'
-H 'content-type: application/json'
-H 'x-twitter-auth-type: OAuth2Session'
-H 'x-twitter-client-language: en'
-H 'x-twitter-active-user: yes'
-H 'x-csrf-token: ...'
-H 'Origin: https://twitter.com'
-H 'authorization: Bearer ...'
-H 'Connection: keep-alive'
-H 'Cookie: ...'
--data-raw ''
--compressed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment