Skip to content

Instantly share code, notes, and snippets.

@hmldd
Last active Aug 6, 2021
Embed
What would you like to do?
Example of Elasticsearch scrolling using Python client
# coding:utf-8
from elasticsearch import Elasticsearch
import json
# Define config
host = "127.0.0.1"
port = 9200
timeout = 1000
index = "index"
doc_type = "type"
size = 1000
body = {}
# Init Elasticsearch instance
es = Elasticsearch(
[
{
'host': host,
'port': port
}
],
timeout=timeout
)
# Process hits here
def process_hits(hits):
for item in hits:
print(json.dumps(item, indent=2))
# Check index exists
if not es.indices.exists(index=index):
print("Index " + index + " not exists")
exit()
# Init scroll by search
data = es.search(
index=index,
doc_type=doc_type,
scroll='2m',
size=size,
body=body
)
# Get the scroll ID
sid = data['_scroll_id']
scroll_size = len(data['hits']['hits'])
while scroll_size > 0:
"Scrolling..."
# Before scroll, process current batch of hits
process_hits(data['hits']['hits'])
data = es.scroll(scroll_id=sid, scroll='2m')
# Update the scroll ID
sid = data['_scroll_id']
# Get the number of results that returned in the last scroll
scroll_size = len(data['hits']['hits'])
@lhzw

This comment has been minimized.

Copy link

@lhzw lhzw commented Nov 27, 2018

Thanks for sharing.

@vikash423q

This comment has been minimized.

Copy link

@vikash423q vikash423q commented Dec 4, 2018

Thanks

@stone2014

This comment has been minimized.

Copy link

@stone2014 stone2014 commented Dec 20, 2018

Thanks! 谢谢!

@chkpcs

This comment has been minimized.

Copy link

@chkpcs chkpcs commented Feb 14, 2019

Thanks

@houhashv

This comment has been minimized.

Copy link

@houhashv houhashv commented Mar 13, 2019

it does not work for me, i'm getting this error:

Connected to pydev debugger (build 183.5429.31)
GET http://aaelk:9200/_search/scroll?scroll=2m [status:503 request:0.047s]
GET http://aaelk:9200/_search/scroll?scroll=2m [status:503 request:0.031s]
GET http://aaelk:9200/_search/scroll?scroll=2m [status:503 request:0.031s]
GET http://aaelk:9200/_search/scroll?scroll=2m [status:503 request:0.001s]
Traceback (most recent call last):
File "C:\Program Files\JetBrains\PyCharm Community Edition 2018.3.4\helpers\pydev\pydevd.py", line 1741, in
main()
File "C:\Program Files\JetBrains\PyCharm Community Edition 2018.3.4\helpers\pydev\pydevd.py", line 1735, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "C:\Program Files\JetBrains\PyCharm Community Edition 2018.3.4\helpers\pydev\pydevd.py", line 1135, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "C:\Program Files\JetBrains\PyCharm Community Edition 2018.3.4\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/yossih/PycharmProjects/zap/elastic.py", line 60, in
data = es.scroll(scroll_id=sid, scroll='2m')
File "C:\Users\yossih\AppData\Local\Continuum\anaconda3\lib\site-packages\elasticsearch\client\utils.py", line 76, in wrapped
return func(*args, params=params, **kwargs)
File "C:\Users\yossih\AppData\Local\Continuum\anaconda3\lib\site-packages\elasticsearch\client_init
.py", line 1016, in scroll
params=params, body=body)
File "C:\Users\yossih\AppData\Local\Continuum\anaconda3\lib\site-packages\elasticsearch\transport.py", line 318, in perform_request
status, headers_response, data = connection.perform_request(method, url, params, body, headers=headers, ignore=ignore, timeout=timeout)
File "C:\Users\yossih\AppData\Local\Continuum\anaconda3\lib\site-packages\elasticsearch\connection\http_urllib3.py", line 186, in perform_request
self._raise_error(response.status, raw_data)
File "C:\Users\yossih\AppData\Local\Continuum\anaconda3\lib\site-packages\elasticsearch\connection\base.py", line 125, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.TransportError: TransportError(503, '{"_scroll_id":"DnF1ZXJ5VGhlbkZldGNoCgAAAAAAZBRSFmtmNjJjcGctVFJTbVBYZXd6VDlDRUEAAAAAAGQUURZrZjYyY3BnLVRSU21QWGV3elQ5Q0VBAAAAAABkFFkWa2Y2MmNwZy1UUlNtUFhld3pUOUNFQQAAAAAAZBRVFmtmNjJjcGctVFJTbVBYZXd6VDlDRUEAAAAAAGQUWhZrZjYyY3BnLVRSU21QWGV3elQ5Q0VBAAAAAABkFFYWa2Y2MmNwZy1UUlNtUFhld3pUOUNFQQAAAAAAZBRXFmtmNjJjcGctVFJTbVBYZXd6VDlDRUEAAAAAAGQUUxZrZjYyY3BnLVRSU21QWGV3elQ5Q0VBAAAAAABkFFgWa2Y2MmNwZy1UUlNtUFhld3pUOUNFQQAAAAAAZBRUFmtmNjJjcGctVFJTbVBYZXd6VDlDRUE=","took":1,"timed_out":false,"_shards":{"total":10,"successful":0,"failed":0},"hits":{"total":0,"max_score":0.0,"hits":[]}}')

@robertbrooker

This comment has been minimized.

Copy link

@robertbrooker robertbrooker commented May 15, 2019

You can remove the 1st call to process_hits if you put the 2nd call to process_hits before es.scroll

@nschmeller

This comment has been minimized.

Copy link

@nschmeller nschmeller commented Jul 11, 2019

This helped me out a lot! I needed to change port to 80 and doc_type is deprecated, but otherwise it's golden.

@hmldd

This comment has been minimized.

Copy link
Owner Author

@hmldd hmldd commented Aug 29, 2019

You can remove the 1st call to process_hits if you put the 2nd call to process_hits before es.scroll

Good idea!

@hmldd

This comment has been minimized.

Copy link
Owner Author

@hmldd hmldd commented Aug 29, 2019

TransportError

Checking if the index is already present or not

@rflume

This comment has been minimized.

Copy link

@rflume rflume commented Sep 22, 2019

Thank you! :)

@gauravkoradiya

This comment has been minimized.

Copy link

@gauravkoradiya gauravkoradiya commented Oct 21, 2019

thanks. insightful.

@YushengFeng01

This comment has been minimized.

Copy link

@YushengFeng01 YushengFeng01 commented Dec 13, 2019

Thanks, it's really helpful!

@mincong-h

This comment has been minimized.

Copy link

@mincong-h mincong-h commented Jan 19, 2020

Thank you @hmldd. Inspired by your example, I created a Java version here: https://mincong.io/2020/01/19/elasticsearch-scroll-api/

@pvalois

This comment has been minimized.

Copy link

@pvalois pvalois commented Feb 2, 2020

the for sharing, really useful

@s50600822

This comment has been minimized.

Copy link

@s50600822 s50600822 commented Feb 22, 2020

it doesn't stop, really, it doesnt

@timberswift

This comment has been minimized.

Copy link

@timberswift timberswift commented Mar 4, 2020

cost soooooo much time when search results got a large number

@hmldd

This comment has been minimized.

Copy link
Owner Author

@hmldd hmldd commented Mar 9, 2020

cost soooooo much time when search results got a large number

It's not for search, maybe you need Query, see: https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html

@alterego808

This comment has been minimized.

Copy link

@alterego808 alterego808 commented Mar 11, 2020

Doesn't work for me, I'm getting this error "Unexpected keyword argument 'scroll' in method call" in this line
data = es.search(
index=index,
doc_type=doc_type,
scroll='2m',
size=size,
body=body
)

elasticsearch package for python already installed.

@dzhitomirsky

This comment has been minimized.

Copy link

@dzhitomirsky dzhitomirsky commented Mar 24, 2020

Thanks man.

@diegosoaresslvp

This comment has been minimized.

Copy link

@diegosoaresslvp diegosoaresslvp commented Apr 30, 2020

thx.

@Just-blue

This comment has been minimized.

Copy link

@Just-blue Just-blue commented Jun 10, 2020

amazing!

@Soufiane-Fartit

This comment has been minimized.

Copy link

@Soufiane-Fartit Soufiane-Fartit commented Jul 8, 2020

you saved my life, thanks

@xuyuntian

This comment has been minimized.

Copy link

@xuyuntian xuyuntian commented Aug 18, 2020

nice! thx

@wajika

This comment has been minimized.

Copy link

@wajika wajika commented Jul 8, 2021

I have a requirement. In "process_hits", instead of using print, what "containers" can be used to store data instead (if use list or dict to store 10000+ pieces of data, is there a big performance overhead)? I need to get all the data because I need to compare and aggregate the data.

@hmldd

This comment has been minimized.

Copy link
Owner Author

@hmldd hmldd commented Jul 23, 2021

I have a requirement. In "process_hits", instead of using print, what "containers" can be used to store data instead (if use list or dict to store 10000+ pieces of data, is there a big performance overhead)? I need to get all the data because I need to compare and aggregate the data.

It depends on single data size,10000+ pieces of data is a piece of cake for modern computer.

@roschel

This comment has been minimized.

Copy link

@roschel roschel commented Aug 6, 2021

Very helpful. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment