Skip to content

Instantly share code, notes, and snippets.

@gousiosg
Last active November 8, 2023 05:20
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save gousiosg/e16f4348d64fb907e5d8306401f36fa6 to your computer and use it in GitHub Desktop.
Save gousiosg/e16f4348d64fb907e5d8306401f36fa6 to your computer and use it in GitHub Desktop.
Restoring the GHTorrent MongoDB database

This is a collection of scripts to restore a full GHTorrent MongoDB database from the dumps available at http://ghtorrent-downloads.ewi.tudelft.nl.

To do the restore:

  1. Open a MongoDB terminal and run the createCollections.js script to create the necessary collections. You can block_compressor to either snappy or zlib to make your databases compressed. I am using none here, as I am using compression at the filesystem level.

  2. Run restore-cummulative-dumps.sh to restore the cummulative dumps. Wait 3-4 days.

  3. Run restore-daily-dumps.sh to restore all daily dumps. Run the restore-daily-dump.sh with a date argument to restore a single daily dump

db.createCollection("commit_comments", {storageEngine:{wiredTiger:{configString:'block_compressor=none'}}} );
db.createCollection("commits", {storageEngine:{wiredTiger:{configString:'block_compressor=none'}}} );
db.createCollection("events", {storageEngine:{wiredTiger:{configString:'block_compressor=none'}}} );
db.createCollection("followers", {storageEngine:{wiredTiger:{configString:'block_compressor=none'}}} );
db.createCollection("forks", {storageEngine:{wiredTiger:{configString:'block_compressor=none'}}} );
db.createCollection("geo_cache", {storageEngine:{wiredTiger:{configString:'block_compressor=none'}}} );
db.createCollection("issue_comments", {storageEngine:{wiredTiger:{configString:'block_compressor=none'}}} );
db.createCollection("issue_events", {storageEngine:{wiredTiger:{configString:'block_compressor=none'}}} );
db.createCollection("issues", {storageEngine:{wiredTiger:{configString:'block_compressor=none'}}} );
db.createCollection("org_members", {storageEngine:{wiredTiger:{configString:'block_compressor=none'}}} );
db.createCollection("pull_request_comments", {storageEngine:{wiredTiger:{configString:'block_compressor=none'}}} );
db.createCollection("pull_requests", {storageEngine:{wiredTiger:{configString:'block_compressor=none'}}} );
db.createCollection("repo_collaborators", {storageEngine:{wiredTiger:{configString:'block_compressor=none'}}} );
db.createCollection("repo_labels", {storageEngine:{wiredTiger:{configString:'block_compressor=none'}}} );
db.createCollection("repos", {storageEngine:{wiredTiger:{configString:'block_compressor=none'}}} );
db.createCollection("topics", {storageEngine:{wiredTiger:{configString:'block_compressor=none'}}} );
db.createCollection("users", {storageEngine:{wiredTiger:{configString:'block_compressor=none'}}} );
db.createCollection("watchers", {storageEngine:{wiredTiger:{configString:'block_compressor=none'}}} );
#!/usr/bin/env bash
#
# Restore cummulative GHTorrent MongoDB dumps (per collection)
#
# (c) 2018 Georgios Gousios <gousiosg@gmail.com>
#
curl -s http://ghtorrent-downloads.ewi.tudelft.nl/mongo-full/|
cut -f2 -d'"'|
grep gz$|
while read dump; do
col=`echo $dump|cut -f1 -d'-'`
echo $dump
curl -s http://ghtorrent-downloads.ewi.tudelft.nl/mongo-full/$dump|
tar zxOv - dump/github/$col.bson |
mongorestore -d github -c $col - 2>&1
done| tee large-dumps.log
#!/usr/bin/env bash
#
# Restore a daily GHTorrent MongoDB backup
#
# (c) 2018 Georgios Gousios <gousiosg@gmail.com>
#
if [ -z "$1" ]; then
echo "usage: $0 [yyyy-mm-dd]"
exit 1
fi
if ! date +'%Y-%m-%d' --date=$1; then
exit 1
fi
mkdir -p $1
curl -s http://ghtorrent-downloads.ewi.tudelft.nl/mongo-daily/mongo-dump-$1.tar.gz|
tar xzv -C $1|
xargs -I {} echo $1: {}
cd $1
mongorestore 2>&1 | xargs -I {} echo $1: {}
cd ..
rm -Rf $1
#!/usr/bin/env bash
curl -s http://ghtorrent-downloads.ewi.tudelft.nl/mongo-daily/|
cut -f 2 -d'"'|
grep mongo-dump|
cut -f 3,4,5 -d'-'|
cut -f 1 -d '.' |
xargs -P 2 -I {} ./restore-daily-dump.sh {} 2>&1 |
tee restore-daily-dumps.log
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment