Skip to content

Instantly share code, notes, and snippets.

@mosuka
Created March 16, 2019 07:43
Show Gist options
  • Save mosuka/44ab0357505d258990896fa6938394e9 to your computer and use it in GitHub Desktop.
Save mosuka/44ab0357505d258990896fa6938394e9 to your computer and use it in GitHub Desktop.
# download wikipedia dump
curl -o ~/tmp/enwiki-20190101-pages-articles.xml.bz2 https://dumps.wikimedia.org/enwiki/20190101/enwiki-20190101-pages-articles.xml.bz2
# clone wikiextractor
git clone git@github.com:attardi/wikiextractor.git
# parse wikipedia dump
$ cd wikiextractor
$ ./WikiExtractor.py -o ~/tmp/enwiki --json ~/tmp/enwiki-20190101-pages-articles.xml.bz2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment