Skip to content

Instantly share code, notes, and snippets.

@xecutioner
Last active August 29, 2015 13:57
Show Gist options
  • Save xecutioner/9596695 to your computer and use it in GitHub Desktop.
Save xecutioner/9596695 to your computer and use it in GitHub Desktop.
Wikimedia article extractions

STEP 1: Use media labs tool to generate doc based xml from the wikipedia dump.


  • Get the latest copy of the articles from wikipedia download page.
> wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

> bzcat enwiki-latest-pages-articles.xml.bz2 | WikiExtractor.py -cb 250K -o extracted
 
  • In order to combine the whole extracted text into a single file one can issue:
> find extracted -name '*bz2' -exec bunzip2 -c {} \; > wiki_parsed.xml
> rm -rf extracted

STEP 2 (preparing for extraction)


  • Remove any untouched incomplete tags (few missing or incomplete html tags may be passed down unconverted by the above python script.)
  • Also creates a backup of original if you want to skip the backup issue command flag -i with '' arguement.
> sed -i.bak -e 's/<[^d][^o>]*[^c>]*>//g' wiki_parsed.xml

STEP 3 : Split into articles based files.


> mkdir -p splitted
> awk 'BEGIN { system("mkdir -p splitted/sub"++j) }
/<doc/{x="F"++i".xml";}{
  
  if (i%1995==0 ){
    ++i;
    system("mkdir -p splitted/sub"++j"/");
  }
  else{
    print >> ("splitted/sub"j"/"x);
    close("splitted/sub"j"/"x);
  }
 
}' wiki_parsed.xml

STEP 4

: Run the rake task to get the data out and dump into the database.

Considering the splitted docs are in the 'splitted' directory

> bundle exec rake 'wiki_extractor:extract_to_db[splitted]'

And we should have all the articles populated into the wiki_articles table. Phew , Now this way easy once we figure out the proper tools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment