piskvorky/segment_wiki.md

## segment_wiki.md

      
    Raw
  

              segment_wiki.md
            
          
    CLI script for extracting plain text out of a raw Wikipedia dump. This is a xml.bz2 file provided by MediaWiki and looks like wiki--pages-articles.xml.bz2 or wiki-latest-pages-articles.xml.bz2 (e.g. 14 GB: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2).
It streams through all the XML articles using multiple cores (#cores - 1, by default), decompressing on the fly and extracting plain text article sections from each article.
For each extracted article, it prints its title, section names and plaintext section contents, in json-line format.
Examples

bash
python -m gensim.scripts.segment_wiki -f enwiki-latest-pages-articles.xml.bz2 -o enwiki-latest.json.gz
Processing the entire English Wikipedia dump takes 2 hours (about 2.5 million articles per hour, on 8 core Intel Xeon E3-1275@3.60GHz).
You can then read the created output (~6.1 GB gzipped) with:
# iterate over the plain text data we just created
for line in smart_open('enwiki-latest.json.gz'):
   # decode each JSON line into a Python dictionary object
   article = json.loads(line)
   
   # each article has a "title" and a list of "section_titles" and "section_texts".
   print("Article title: %s" % article['title'])
   for section_title, section_text in zip(article['section_titles'], article['section_texts']):
       print("Section title: %s" % section_title)
       print("Section text: %s" % section_text)
optional arguments:
-h, --help            show this help message and exit
-f FILE, --file FILE  Path to MediaWiki database dump (read-only).
-o OUTPUT, --output OUTPUT
Path to output file (stdout if not specified). If ends in .gz or .bz2, the output file will be automatically compressed (recommended!).
-w WORKERS, --workers WORKERS
Number of parallel workers for multi-core systems. Default: 7.
-m MIN_ARTICLE_CHARACTER, --min-article-character MIN_ARTICLE_CHARACTER
Ignore articles with fewer characters than this (article stubs). Default: 200.