Skip to content

Instantly share code, notes, and snippets.

@atomic77
Created March 18, 2017 20:16
Show Gist options
  • Save atomic77/9d4bf43bd1516d0fbb7e9de160bf03f3 to your computer and use it in GitHub Desktop.
Save atomic77/9d4bf43bd1516d0fbb7e9de160bf03f3 to your computer and use it in GitHub Desktop.
Loading wikimedia dumps into Elasticsearch

Wikipedia uses elasticsearch in production for full-text search after moving from a homegrown tool based on Lucene. Snapshots for easy bulk import are available for all the various datasets - much easier to work with than the SQL and XML dumps!

Tested with italian wikinews - everything seems to be loaded into a page document type. Not entirely sure what the timestamp field is, but seems like the last time the page was changed?

docker run --name elasticsearch -d -p 9200:9200 -p 9300:9300 elasticsearch
docker run --link elasticsearch:elasticsearch -p 5601:5601 -d kibana
zcat itwikinews-20170313-cirrussearch-content.json.gz | parallel --pipe -L 2 -N 2000 'curl -s http://localhost:9200/itwiki_content/_bulk --data-binary @- > /dev/null'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment