Skip to content

Instantly share code, notes, and snippets.

@hmldd
Last active August 8, 2024 23:41
Show Gist options
  • Save hmldd/44d12d3a61a8d8077a3091c4ff7b9307 to your computer and use it in GitHub Desktop.
Save hmldd/44d12d3a61a8d8077a3091c4ff7b9307 to your computer and use it in GitHub Desktop.
Example of Elasticsearch scrolling using Python client
# coding:utf-8
from elasticsearch import Elasticsearch
import json
# Define config
host = "127.0.0.1"
port = 9200
timeout = 1000
index = "index"
doc_type = "type"
size = 1000
body = {}
# Init Elasticsearch instance
es = Elasticsearch(
[
{
'host': host,
'port': port
}
],
timeout=timeout
)
# Process hits here
def process_hits(hits):
for item in hits:
print(json.dumps(item, indent=2))
# Check index exists
if not es.indices.exists(index=index):
print("Index " + index + " not exists")
exit()
# Init scroll by search
data = es.search(
index=index,
doc_type=doc_type,
scroll='2m',
size=size,
body=body
)
# Get the scroll ID
sid = data['_scroll_id']
scroll_size = len(data['hits']['hits'])
while scroll_size > 0:
"Scrolling..."
# Before scroll, process current batch of hits
process_hits(data['hits']['hits'])
data = es.scroll(scroll_id=sid, scroll='2m')
# Update the scroll ID
sid = data['_scroll_id']
# Get the number of results that returned in the last scroll
scroll_size = len(data['hits']['hits'])
es.clear_scroll(scroll_id=sid)
@rmatte
Copy link

rmatte commented Jun 21, 2022

This is a nice example, but it's missing this at the very end after the while loop finishes...

es.clear_scroll(scroll_id=sid)

If you don't do that it leaves the scroll id open and these can build up over time until it reaches the global server side limit and then other queries which rely on scroll can start failing across the board until the open scroll ids eventually timeout.

@hmldd
Copy link
Author

hmldd commented Jul 6, 2022

Thank you for your feedback, I have updated the gist.

@abdullah-alnahas
Copy link

Thanks for the gist.
I also found an answer on SOF talking about the helper function scan that abstracts away the scroll logic.

@ooplease
Copy link

ooplease commented Aug 8, 2024

Thanks for the gist. I also found an answer on SOF talking about the helper function scan that abstracts away the scroll logic.

Does scan just start wherever the server says the scroll_id is? My concern is that scan just yields an arbitrary-length generator full of hits, so if your script terminates unexpectedly you could gracefully dump to JSON but you might have no idea where exactly you were in the scroll when you wanted to resume.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment