Skip to content

Instantly share code, notes, and snippets.

@gose
Last active June 29, 2020 17:44
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gose/02a73ae226adecf9b55a752ffa34b244 to your computer and use it in GitHub Desktop.
Save gose/02a73ae226adecf9b55a752ffa34b244 to your computer and use it in GitHub Desktop.
require 'elastic-workplace-search'
require 'json'
Elastic::WorkplaceSearch.access_token = 'my-access-token'
client = Elastic::WorkplaceSearch::Client.new
Elastic::WorkplaceSearch.endpoint = 'https://my-endpoint.ent-search.us-central1.gcp.cloud.es.io/api/ws/v1'
content_source_key = 'my-source-key'
documents = []
id = nil
i = 0
# https://dumps.wikimedia.org/other/cirrussearch/20200518/enwiki-20200518-cirrussearch-content.json.gz
gzfile = open("enwiki-20200518-cirrussearch-content.json.gz")
data = Zlib::GzipReader.new(gzfile)
data.each_line do |line|
parsed = JSON.parse(line)
if parsed['index'] && parsed['index']['_type'] == "page" && parsed['index']['_id']
id = parsed['index']['_id']
next
end
i += 1
doc = {}
if parsed['title'] == nil
puts "Skipping line #{i} with id #{id} since the TITLE is empty."
next
end
doc['id'] = id
doc['title'] = parsed['title']
doc['timestamp'] = parsed['timestamp']
doc['create_timestamp'] = parsed['create_timestamp']
doc['incoming_links'] = parsed['incoming_links']
doc['category'] = parsed['category']
doc['text'] = parsed['text']
doc['text_bytes'] = parsed['text_bytes']
doc['content_model'] = parsed['content_model']
doc['heading'] = parsed['heading']
doc['opening_text'] = parsed['opening_text']
doc['popularity_score'] = parsed['popularity_score']
doc['url'] = "https://en.wikipedia.org/wiki/#{doc['title'].gsub(/ /, '_')}"
documents << doc
if i % 100 == 0
begin
document_receipts = client.index_documents(content_source_key, documents)
puts "Uploaded #{i} documents"
rescue Elastic::WorkplaceSearch::ClientException => e
puts e
end
documents = []
end
end
@gose
Copy link
Author

gose commented Jun 28, 2020

I'm trying to track down what's causing this timeout. It's not consistent in when it times out. In this run, I was able to ingest ~1,154,000 before it timed out. In the previous run, it timed out after ~74,000 documents.

I'm ingesting an English Wikipedia export into Workplace Search. It contains ~6M documents (see comment in code for src).

Screenshots of the cluster size are attached.

$ ruby ws-index.rb 
Uploaded 100 documents
Uploaded 200 documents
Uploaded 300 documents
...
Uploaded 1153800 documents
Uploaded 1153900 documents
Uploaded 1154000 documents
Traceback (most recent call last):
	21: from ws-index.rb:17:in `<main>'
	20: from ws-index.rb:17:in `each_line'
	19: from ws-index.rb:52:in `block in <main>'
	18: from /var/lib/gems/2.7.0/gems/elastic-workplace-search-0.4.1/lib/elastic/workplace-search/client/content_source_documents.rb:18:in `index_documents'
	17: from /var/lib/gems/2.7.0/gems/elastic-workplace-search-0.4.1/lib/elastic/workplace-search/client/content_source_documents.rb:36:in `async_create_or_update_documents'
	16: from /var/lib/gems/2.7.0/gems/elastic-workplace-search-0.4.1/lib/elastic/workplace-search/request.rb:17:in `post'
	15: from /var/lib/gems/2.7.0/gems/elastic-workplace-search-0.4.1/lib/elastic/workplace-search/request.rb:32:in `request'
	14: from /usr/lib/ruby/2.7.0/timeout.rb:110:in `timeout'
	13: from /var/lib/gems/2.7.0/gems/elastic-workplace-search-0.4.1/lib/elastic/workplace-search/request.rb:57:in `block in request'
	12: from /usr/lib/ruby/2.7.0/net/http.rb:1483:in `request'
	11: from /usr/lib/ruby/2.7.0/net/http.rb:933:in `start'
	10: from /usr/lib/ruby/2.7.0/net/http.rb:1485:in `block in request'
	 9: from /usr/lib/ruby/2.7.0/net/http.rb:1492:in `request'
	 8: from /usr/lib/ruby/2.7.0/net/http.rb:1519:in `transport_request'
	 7: from /usr/lib/ruby/2.7.0/net/http.rb:1519:in `catch'
	 6: from /usr/lib/ruby/2.7.0/net/http.rb:1528:in `block in transport_request'
	 5: from /usr/lib/ruby/2.7.0/net/http/response.rb:31:in `read_new'
	 4: from /usr/lib/ruby/2.7.0/net/http/response.rb:42:in `read_status_line'
	 3: from /usr/lib/ruby/2.7.0/net/protocol.rb:201:in `readline'
	 2: from /usr/lib/ruby/2.7.0/net/protocol.rb:191:in `readuntil'
	 1: from /usr/lib/ruby/2.7.0/net/protocol.rb:217:in `rbuf_fill'
/usr/lib/ruby/2.7.0/net/protocol.rb:217:in `wait_readable': execution expired (Timeout::Error)

Screen Shot 2020-06-28 at 1 35 14 PM

Screen Shot 2020-06-28 at 1 35 27 PM

Screen Shot 2020-06-28 at 1 35 36 PM

Screen Shot 2020-06-28 at 1 34 08 PM

@gose
Copy link
Author

gose commented Jun 29, 2020

Increasing the timeout fixed this:

client = Elastic::WorkplaceSearch::Client.new(overall_timeout: 300)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment