Skip to content

Instantly share code, notes, and snippets.

@gose
Last active June 29, 2020 17:44
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gose/02a73ae226adecf9b55a752ffa34b244 to your computer and use it in GitHub Desktop.
Save gose/02a73ae226adecf9b55a752ffa34b244 to your computer and use it in GitHub Desktop.
require 'elastic-workplace-search'
require 'json'
Elastic::WorkplaceSearch.access_token = 'my-access-token'
client = Elastic::WorkplaceSearch::Client.new
Elastic::WorkplaceSearch.endpoint = 'https://my-endpoint.ent-search.us-central1.gcp.cloud.es.io/api/ws/v1'
content_source_key = 'my-source-key'
documents = []
id = nil
i = 0
# https://dumps.wikimedia.org/other/cirrussearch/20200518/enwiki-20200518-cirrussearch-content.json.gz
gzfile = open("enwiki-20200518-cirrussearch-content.json.gz")
data = Zlib::GzipReader.new(gzfile)
data.each_line do |line|
parsed = JSON.parse(line)
if parsed['index'] && parsed['index']['_type'] == "page" && parsed['index']['_id']
id = parsed['index']['_id']
next
end
i += 1
doc = {}
if parsed['title'] == nil
puts "Skipping line #{i} with id #{id} since the TITLE is empty."
next
end
doc['id'] = id
doc['title'] = parsed['title']
doc['timestamp'] = parsed['timestamp']
doc['create_timestamp'] = parsed['create_timestamp']
doc['incoming_links'] = parsed['incoming_links']
doc['category'] = parsed['category']
doc['text'] = parsed['text']
doc['text_bytes'] = parsed['text_bytes']
doc['content_model'] = parsed['content_model']
doc['heading'] = parsed['heading']
doc['opening_text'] = parsed['opening_text']
doc['popularity_score'] = parsed['popularity_score']
doc['url'] = "https://en.wikipedia.org/wiki/#{doc['title'].gsub(/ /, '_')}"
documents << doc
if i % 100 == 0
begin
document_receipts = client.index_documents(content_source_key, documents)
puts "Uploaded #{i} documents"
rescue Elastic::WorkplaceSearch::ClientException => e
puts e
end
documents = []
end
end
@gose
Copy link
Author

gose commented Jun 29, 2020

Increasing the timeout fixed this:

client = Elastic::WorkplaceSearch::Client.new(overall_timeout: 300)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment