Skip to content

Instantly share code, notes, and snippets.

@unhammer
Last active December 25, 2022 18:34
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save unhammer/3372222878580d1e4c6f to your computer and use it in GitHub Desktop.
Save unhammer/3372222878580d1e4c6f to your computer and use it in GitHub Desktop.
Turn wikipedia dumps into plaintext using pandoc
#!/bin/bash
declare OUTFORMAT=plain
set -e -u
run () {
local text=false
local -i i=0
local out
trap 'rm -rf "${out}"' EXIT
out=$(mktemp -d -t pandir.XXXXXXXXXXX)
while read -r l; do
if [[ $l =~ \<text[^\>]*\> ]]; then
text=true
(( ++i ))
fi
if $text; then
t="${l//<text*>}"
echo "${t//<\/text*>}">>"${out}/$i.wiki"
fi
if [[ $l =~ \</text\> ]]; then
text=false
pandoc -f mediawiki -t "${OUTFORMAT}" "${out}/$i.wiki" && rm -f "${out}/$i.wiki"
fi
done
}
run
chmod +x mwdump-to-pandoc
# Get the newest release of pandoc from https://github.com/jgm/pandoc/releases/latest , e.g.:
wget https://github.com/jgm/pandoc/releases/download/1.16.0.2/pandoc-1.16.0.2-1-amd64.deb
# grab a wikipedia dump, e.g.
wget https://dumps.wikimedia.org/dawiki/20160111/dawiki-20160111-pages-articles.xml.bz2
# run conversion:
bzcat dawiki-20160111-pages-articles.xml.bz2 | ./mwdump-to-pandoc | xz - > dawiki-20160111-pandoc.txt.xz
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment