Skip to content

Instantly share code, notes, and snippets.

Created November 11, 2017 15:36
Show Gist options
  • Save piskvorky/deb615ed4400114d8f8238ed95cc8790 to your computer and use it in GitHub Desktop.
Save piskvorky/deb615ed4400114d8f8238ed95cc8790 to your computer and use it in GitHub Desktop.

CLI script for extracting plain text out of a raw Wikipedia dump. This is a xml.bz2 file provided by MediaWiki and looks like wiki--pages-articles.xml.bz2 or wiki-latest-pages-articles.xml.bz2 (e.g. 14 GB:

It streams through all the XML articles using multiple cores (#cores - 1, by default), decompressing on the fly and extracting plain text article sections from each article.

For each extracted article, it prints its title, section names and plaintext section contents, in json-line format.


bash python -m gensim.scripts.segment_wiki -f enwiki-latest-pages-articles.xml.bz2 -o enwiki-latest.json.gz

Processing the entire English Wikipedia dump takes 2 hours (about 2.5 million articles per hour, on 8 core Intel Xeon E3-1275@3.60GHz).

You can then read the created output (~6.1 GB gzipped) with:

# iterate over the plain text data we just created
for line in smart_open('enwiki-latest.json.gz'):
   # decode each JSON line into a Python dictionary object
   article = json.loads(line)
   # each article has a "title" and a list of "section_titles" and "section_texts".
   print("Article title: %s" % article['title'])
   for section_title, section_text in zip(article['section_titles'], article['section_texts']):
       print("Section title: %s" % section_title)
       print("Section text: %s" % section_text)

optional arguments: -h, --help show this help message and exit -f FILE, --file FILE Path to MediaWiki database dump (read-only). -o OUTPUT, --output OUTPUT Path to output file (stdout if not specified). If ends in .gz or .bz2, the output file will be automatically compressed (recommended!). -w WORKERS, --workers WORKERS Number of parallel workers for multi-core systems. Default: 7. -m MIN_ARTICLE_CHARACTER, --min-article-character MIN_ARTICLE_CHARACTER Ignore articles with fewer characters than this (article stubs). Default: 200.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment