xecutioner/Wikimeida_extraction.md

## Wikimeida_extraction.md

      
    Raw
  

              Wikimeida_extraction.md
            
          
    STEP 1: Use media labs tool to generate doc based xml from the wikipedia dump.


Get the latest copy of the articles from wikipedia download page.

> wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2


Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
Download and save the python script as WikiExtractor.py http://medialab.di.unipi.it/Project/SemaWiki/Tools/WikiExtractor.py
Run the python script to extract the articles with the wikipedia based markup removed and into doc xml nodes.
This might take some time depending upon the processing capacity of your computer.

> bzcat enwiki-latest-pages-articles.xml.bz2 | WikiExtractor.py -cb 250K -o extracted
 

In order to combine the whole extracted text into a single file one can issue:

> find extracted -name '*bz2' -exec bunzip2 -c {} \; > wiki_parsed.xml
> rm -rf extracted

STEP 2 (preparing for extraction)


Remove any untouched incomplete tags (few missing or incomplete html tags may be passed down unconverted by the above python script.)
Also creates a backup of original if you want to skip the backup issue command flag -i with '' arguement.

> sed -i.bak -e 's/<[^d][^o>]*[^c>]*>//g' wiki_parsed.xml


STEP 3 : Split into articles based files.


> mkdir -p splitted
> awk 'BEGIN { system("mkdir -p splitted/sub"++j) }
/<doc/{x="F"++i".xml";}{
  
  if (i%1995==0 ){
    ++i;
    system("mkdir -p splitted/sub"++j"/");
  }
  else{
    print >> ("splitted/sub"j"/"x);
    close("splitted/sub"j"/"x);
  }
 
}' wiki_parsed.xml


STEP 4

: Run the rake task to get the data out and dump into the database.

Considering the splitted docs are in the 'splitted' directory
> bundle exec rake 'wiki_extractor:extract_to_db[splitted]'
And we should have all the articles populated into the wiki_articles table.
Phew , Now this way easy once we figure out the proper tools.