Skip to content

Instantly share code, notes, and snippets.

@dioxias
Created January 20, 2023 00:11
Show Gist options
  • Save dioxias/225b239e91f04a25a6270da1dc10d845 to your computer and use it in GitHub Desktop.
Save dioxias/225b239e91f04a25a6270da1dc10d845 to your computer and use it in GitHub Desktop.
This bash script helps downloading a whole subreddit wiki. Thanks to u/ThePixelHunter (https://www.reddit.com/user/ThePixelHunter/) for the original script adapted to work on Windows machine with added line 10.
#! /bin/bash
set -x
# Requires: bash coreutils curl jq
USER_AGENT='superagent/1.0'
EXPORTDIR="exports"
while read -r line; do
SUBREDDIT="$line"
SUBREDDIT=`echo $SUBREDDIT | sed 's/\\r//g'`
while read -r line; do
PAGE="$line" ; PAGE=$(echo $PAGE | sed 's|\/|\-\-|g')
mkdir -p "./$EXPORTDIR/$SUBREDDIT/wiki/$PAGE"
curl -s --user-agent "$USER_AGENT" "https://www.reddit.com/r/$SUBREDDIT/wiki/$PAGE.json" > "./$EXPORTDIR/$SUBREDDIT/wiki/$PAGE.json"
printf "$SUBREDDIT/wiki/$PAGE " ; echo $?
jq -r '.data.content_md' "./$EXPORTDIR/$SUBREDDIT/wiki/$PAGE.json" > "./$EXPORTDIR/$SUBREDDIT/wiki/$PAGE.md"
find . -type d -exec rmdir '{}' \; > /dev/null 2>&1
done < <(curl -S --user-agent "$USER_AGENT" "https://www.reddit.com/r/$SUBREDDIT/wiki/pages.json" | jq -r '.data | .[]')
done < subreddits.list
@dioxias
Copy link
Author

dioxias commented Jan 20, 2023

Basic idea :

The Reddit API lets you fetch a subreddit's complete list of wiki pages: https://www.reddit.com/r/DataHoarder/wiki/pages.json
And you can fetch each page as a JSON object which includes both the rendered HTML and the original Markdown source: https://www.reddit.com/r/DataHoarder/wiki/zfs.json

Guide

It exports to a directory structure of each subreddit's wiki, leaving behind a json (raw API response) and markdown (raw wiki page text) file for each wiki page.

The script expects a subreddits.list file in the format:

DataHoarder
Homelab
DHExchange

...separated by unix newlines \n, not Windows newlines \r\n.

There's no error handling, and you'll definitely hit the API rate limits at some point.

Edit work on Windows with git-bash or WSL (Windows Subsystem for Linux) :

SUBREDDIT=echo $SUBREDDIT | sed 's/\\r//g'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment