Skip to content

Instantly share code, notes, and snippets.

@wuyongzheng
Created November 12, 2017 14:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save wuyongzheng/f05a812fdf95b4bb51a25360c1a7ea11 to your computer and use it in GitHub Desktop.
Save wuyongzheng/f05a812fdf95b4bb51a25360c1a7ea11 to your computer and use it in GitHub Desktop.
Tweeter user timeline crawler without authentication
$ bash crawl.sh
pos 0
pos 923679343724838912
pos 916681099698364416
pos 913192547035488257
...
pos 695457113452126208
pos 695457113452126208
$ ls
crawl-0.json
crawl-753078410369437696.json
crawl-793631421369757697.json
...
#!/bin/bash
# change this
user=SMRT_Singapore
while true ; do
if [ -f crawl-0.json ] ; then
pos=`cat crawl-*.json | tr ',' '\n' | grep 'min_position.:.[0-9]' | sed -e 's/.*:"//' -e 's/".*//' | sort -n | head -n 1`
url="https://twitter.com/i/profiles/show/$user/timeline/tweets?include_available_features=1&include_entities=1&max_position=$pos&reset_error_state=false"
else
pos=0
url="https://twitter.com/i/profiles/show/$user/timeline/tweets?include_available_features=1&include_entities=1&reset_error_state=false"
fi
echo "pos $pos"
if [ -f crawl-$pos.json ] ; then break ; fi
wget -q -O crawl-$pos.json "$url"
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment