CLI script for extracting plain text out of a raw Wikipedia dump. This is a xml.bz2 file provided by MediaWiki and looks like wiki--pages-articles.xml.bz2 or wiki-latest-pages-articles.xml.bz2 (e.g. 14 GB: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2).
It streams through all the XML articles using multiple cores (#cores - 1, by default), decompressing on the fly and extracting plain text article sections from each article.
For each extracted article, it prints its title, section names and plaintext section contents, in json-line format.
bash
python -m gensim.scripts.segment_wiki -f enwiki-latest-pages-articles.xml.bz2 -o enwiki-latest.json.gz
Processing the entire English Wikipedia dump takes 2 hours (about 2.5 million articles per hour, on 8 core Intel Xeon E3-1275@3.60GHz).
You can then read the created output (~6.1 GB gzipped) with:
# iterate over the plain text data we just created
for line in smart_open('enwiki-latest.json.gz'):
# decode each JSON line into a Python dictionary object
article = json.loads(line)
# each article has a "title" and a list of "section_titles" and "section_texts".
print("Article title: %s" % article['title'])
for section_title, section_text in zip(article['section_titles'], article['section_texts']):
print("Section title: %s" % section_title)
print("Section text: %s" % section_text)
optional arguments: -h, --help show this help message and exit -f FILE, --file FILE Path to MediaWiki database dump (read-only). -o OUTPUT, --output OUTPUT Path to output file (stdout if not specified). If ends in .gz or .bz2, the output file will be automatically compressed (recommended!). -w WORKERS, --workers WORKERS Number of parallel workers for multi-core systems. Default: 7. -m MIN_ARTICLE_CHARACTER, --min-article-character MIN_ARTICLE_CHARACTER Ignore articles with fewer characters than this (article stubs). Default: 200.