Skip to content

Instantly share code, notes, and snippets.

@unhammer unhammer/mwdump-to-pandoc
Last active May 16, 2017

Embed
What would you like to do?
Turn wikipedia dumps into plaintext using pandoc
#!/bin/bash
declare OUTFORMAT=plain
set -e -u
run () {
local text=false
local -i i=0
local out
trap 'rm -rf "${out}"' EXIT
out=$(mktemp -d -t pandir.XXXXXXXXXXX)
while read -r l; do
if [[ $l =~ \<text[^\>]*\> ]]; then
text=true
(( ++i ))
fi
if $text; then
t="${l//<text*>}"
echo "${t//<\/text*>}">>"${out}/$i.wiki"
fi
if [[ $l =~ \</text\> ]]; then
text=false
pandoc -f mediawiki -t "${OUTFORMAT}" "${out}/$i.wiki" && rm -f "${out}/$i.wiki"
fi
done
}
run
chmod +x mwdump-to-pandoc
# Get the newest release of pandoc from https://github.com/jgm/pandoc/releases/latest , e.g.:
wget https://github.com/jgm/pandoc/releases/download/1.16.0.2/pandoc-1.16.0.2-1-amd64.deb
# grab a wikipedia dump, e.g.
wget https://dumps.wikimedia.org/dawiki/20160111/dawiki-20160111-pages-articles.xml.bz2
# run conversion:
bzcat dawiki-20160111-pages-articles.xml.bz2 | ./mwdump-to-pandoc | xz - > dawiki-20160111-pandoc.txt.xz
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.