Skip to content

Instantly share code, notes, and snippets.

@gose
Created June 13, 2020 14:45
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gose/a99db218112e7413b8ae2ceed93580a9 to your computer and use it in GitHub Desktop.
Save gose/a99db218112e7413b8ae2ceed93580a9 to your computer and use it in GitHub Desktop.
require 'elastic-app-search'
require 'json'
client = Elastic::AppSearch::Client.new(
api_key: 'private-my-key',
api_endpoint: 'https://my-endpoint/api/as/v1/')
engine_name = 'wikipedia'
documents = []
id = nil
i = 0
gzfile = open("/mnt/data/enwiki-20200518-cirrussearch-content.json.gz")
data = Zlib::GzipReader.new(gzfile)
data.each_line do |line|
parsed = JSON.parse(line)
if parsed['index'] && parsed['index']['_type'] == "page" && parsed['index']['_id']
id = parsed['index']['_id']
next
end
i += 1
doc = {}
if parsed['title'] == nil
puts "Skipping line #{i} with id #{id} since the TITLE is empty."
next
end
doc['id'] = id
doc['title'] = parsed['title']
doc['timestamp'] = parsed['timestamp']
doc['create_timestamp'] = parsed['create_timestamp']
doc['incoming_links'] = parsed['incoming_links']
doc['category'] = parsed['category']
doc['text'] = parsed['text']
doc['text_bytes'] = parsed['text_bytes']
doc['content_model'] = parsed['content_model']
doc['heading'] = parsed['heading']
doc['opening_text'] = parsed['opening_text']
doc['popularity_score'] = parsed['popularity_score']
doc['url'] = "https://en.wikipedia.org/wiki/#{doc['title'].gsub(/ /, '_')}"
documents << doc
if i % 100 == 0
document_receipts = client.index_documents(engine_name, documents)
puts "Uploaded #{i} documents"
documents = []
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment