- Get the latest copy of the articles from wikipedia download page.
> wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
- Use medialab's extractor tool http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
- Download and save the python script as WikiExtractor.py http://medialab.di.unipi.it/Project/SemaWiki/Tools/WikiExtractor.py
- Run the python script to extract the articles with the wikipedia based markup removed and into doc xml nodes.
- This might take some time depending upon the processing capacity of your computer.
> bzcat enwiki-latest-pages-articles.xml.bz2 | WikiExtractor.py -cb 250K -o extracted
- In order to combine the whole extracted text into a single file one can issue:
> find extracted -name '*bz2' -exec bunzip2 -c {} \; > wiki_parsed.xml
> rm -rf extracted
- Remove any untouched incomplete tags (few missing or incomplete html tags may be passed down unconverted by the above python script.)
- Also creates a backup of original if you want to skip the backup issue command flag -i with '' arguement.
> sed -i.bak -e 's/<[^d][^o>]*[^c>]*>//g' wiki_parsed.xml
> mkdir -p splitted
> awk 'BEGIN { system("mkdir -p splitted/sub"++j) }
/<doc/{x="F"++i".xml";}{
if (i%1995==0 ){
++i;
system("mkdir -p splitted/sub"++j"/");
}
else{
print >> ("splitted/sub"j"/"x);
close("splitted/sub"j"/"x);
}
}' wiki_parsed.xml
Considering the splitted docs are in the 'splitted' directory
> bundle exec rake 'wiki_extractor:extract_to_db[splitted]'
And we should have all the articles populated into the wiki_articles table. Phew , Now this way easy once we figure out the proper tools.