Skip to content

Instantly share code, notes, and snippets.

@aaronhoffman
Last active June 23, 2021 12:33
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save aaronhoffman/1f753c660d7364bb594a36af350b227c to your computer and use it in GitHub Desktop.
Save aaronhoffman/1f753c660d7364bb594a36af350b227c to your computer and use it in GitHub Desktop.
hacker news firebase download
#!/bin/bash
# generate file containing all URIs to execute to retrieve data from hacker news firebase API
# api docs: https://github.com/HackerNews/API
echo generating file hn-uri.txt
URICOUNT=10000000
echo file will contain $URICOUNT lines
# get current maxid from https://hacker-news.firebaseio.com/v0/maxitem.json
MAXID=12739717
echo max id $MAXID
let "MINID=$MAXID-$URICOUNT"
echo min id $MINID
# for 10MM rows, this took 10 minutes on my machine and produced a 560MB file
for ((x=$MINID; x<$MAXID; x++))
do
# append a uri for the current id to the file
echo "https://hacker-news.firebaseio.com/v0/item/$x.json" >> hn-uri.txt
if ! ((x % 10000)); then
echo current id $x
fi
done
# note: on my macbook, for 10MM uris, this took just over 30 hours to complete
cat hn-uri.txt | xargs -P 100 -n 100 wget --quiet
@aaronhoffman
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment