Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Minimal Working example of Elasticsearch scrolling using Python client
# Initialize the scroll
page = es.search(
index = 'yourIndex',
doc_type = 'yourType',
scroll = '2m',
search_type = 'scan',
size = 1000,
body = {
# Your query's body
})
sid = page['_scroll_id']
scroll_size = page['hits']['total']
# Start scrolling
while (scroll_size > 0):
print "Scrolling..."
page = es.scroll(scroll_id = sid, scroll = '2m')
# Update the scroll ID
sid = page['_scroll_id']
# Get the number of results that we returned in the last scroll
scroll_size = len(page['hits']['hits'])
print "scroll size: " + str(scroll_size)
# Do something with the obtained page
@timnugent

This comment has been minimized.

Copy link

commented Jul 15, 2015

Very handy, thanks for this :)

@evasilenko-light

This comment has been minimized.

Copy link

commented Jul 17, 2015

Thank you, really useful

@cestinger

This comment has been minimized.

Copy link

commented Aug 3, 2015

+1 - thanks so much!

@tbolis

This comment has been minimized.

Copy link

commented Aug 19, 2015

+1 Very helpfull
You need to correct it a bit in order to catch the case where there are not any data

so probably check scroll size before continue

scroll_size = page['hits']['total']
if scroll_size == 0:
return 'something'

sid = page['_scroll_id']
...

@muelli

This comment has been minimized.

Copy link

commented Nov 18, 2015

mind you that elasticsearch.helpers.scan exists.

@soggychips

This comment has been minimized.

Copy link

commented Jan 22, 2016

+1 Thanks! Way faster than elasticsearch.helpers.scan...

Do you know why the page results are 5x the "size" parameter?

Cheers

@UnderGreen

This comment has been minimized.

Copy link

commented Jan 28, 2016

@soggychips
From: https://www.elastic.co/guide/en/elasticsearch/guide/1.x/scan-scroll.html
When scanning, the size is applied to each shard, so you will get back a maximum of size * number_of_primary_shards documents in each batch.

@cnwarden

This comment has been minimized.

Copy link

commented Mar 7, 2016

+1 helpful

@chishaku

This comment has been minimized.

Copy link

commented Apr 5, 2016

+1 Thank you

@sporty

This comment has been minimized.

Copy link

commented May 25, 2016

+1 Thank you!

@naoko

This comment has been minimized.

Copy link

commented Jun 26, 2016

Thank you, very helpful.
alternatively, helper function will return generator and it can be efficient

import elasticsearch
elasticsearch.helpers.scan(
            es,
            query={ <your-query>
            },
            index=<index>,
            doc_type=<doc-type>
        )
@bridgesra

This comment has been minimized.

Copy link

commented Sep 23, 2016

It doesn't work for me. I get a RequestError: TransportError(400, u'search_phase_execution_exception')

For example:


In [106]:  res = client.search(index = index, doc_type = doc_type, body = q)
In [107]: len(res["hits"]["hits"]) ## how many did it return? 
Out[107]: 10000
In [108]: res['hits']['total'] ## how many are there total? 
Out[108]: 16670920

Ok, so I'd like to see all the results using the scroll api. following your example:

In[109]:    res = client.search(index = index, doc_type = doc_type, body = q, scroll = '1m', search_type = 'scan', size = 10000) 
---------------------------------------------------------------------------
RequestError                              Traceback (most recent call last)
<ipython-input-105-26bd30048c65> in <module>()
----> 1 res = client.search(index = index, doc_type = doc_type, body = q, scroll = '1m', search_type = 'scan', size = 10000)

/usr/local/lib/python2.7/site-packages/elasticsearch/client/utils.pyc in _wrapped(*args, **kwargs)
     67                 if p in kwargs:
     68                     params[p] = kwargs.pop(p)
---> 69             return func(*args, params=params, **kwargs)
     70         return _wrapped
     71     return _wrapper

/usr/local/lib/python2.7/site-packages/elasticsearch/client/__init__.pyc in search(self, index, doc_type, body, params)
    529             index = '_all'
    530         _, data = self.transport.perform_request('GET', _make_path(index,
--> 531             doc_type, '_search'), params=params, body=body)
    532         return data
    533

/usr/local/lib/python2.7/site-packages/elasticsearch/transport.pyc in perform_request(self, method, url, params, body)
    305
    306             try:
--> 307                 status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
    308
    309             except TransportError as e:

/usr/local/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.pyc in perform_request(self, method, url, params, body, timeout, ignore)
     91         if not (200 <= response.status < 300) and response.status not in ignore:
     92             self.log_request_fail(method, url, body, duration, response.status)
---> 93             self._raise_error(response.status, raw_data)
     94
     95         self.log_request_success(method, full_url, url, body, response.status,

/usr/local/lib/python2.7/site-packages/elasticsearch/connection/base.pyc in _raise_error(self, status_code, raw_data)
    103             pass
    104
--> 105         raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
    106
    107

RequestError: TransportError(400, u'search_phase_execution_exception')

Any ideas?

@bridgesra

This comment has been minimized.

Copy link

commented Sep 23, 2016

Ok, i got it. two steps to fix the above problem:

  1. pip install --upgrade elasticsearch.
  2. then the error i get is about an aggregation that's not allowed. Turns out my query body had "aggs": {} in it. Remove the empty agg and it doesn't throw the error.
@sriharichava

This comment has been minimized.

Copy link

commented Nov 14, 2016

Helped me a lot, Thanks.

@andypern

This comment has been minimized.

Copy link

commented Dec 28, 2016

Thanks for making a simple example, very useful.

For others who use this example, keep in mind that the initial es.search not only returns the first scroll_id that you'll use for scrolling, but also contains hits that you'll want to process before initiating your first scroll. For most people this is probably obvious, but for the 'challenged' (like me), be sure to do something like:


page = es.search(
.....
    })

sid = page['_scroll_id']
scroll_size = page['hits']['total']

#before you scroll, process your current batch of hits  
for hit in page['hits']['hits']:
    do_stuff

  # Start scrolling
while (scroll_size > 0)
...


@sebbASF

This comment has been minimized.

Copy link

commented Jan 5, 2017

If search_type = scan, the first response contains no hits - see [1] which says:

"A scanning scroll request differs from a standard scroll request in four ways:
...
The response of the initial search request will not contain any results in the hits array. The first results will be returned by the first scroll request.
..."

There are other differences when the scan option is used, see [1].

Note that search_type = scan was deprecated in 2.1 and removed in ES 5.0.
It now causes a parsing error.
So if you are using ES 5.x the first set of results will allways appear in the initial scroll response.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/2.0/search-request-scroll.html#scroll-scan

@sebbASF

This comment has been minimized.

Copy link

commented Jan 5, 2017

Note that there are two counts involved:

page['hits']['total'] - this is the total docs that match the query, and does not change between scroll responses
len(page['hits']['hits']) - this is the number of hits actually included on the page.

The initial code sets scroll_size to the total initially, and then resets it to the page hits later.

@mikhilmohanan123

This comment has been minimized.

Copy link

commented Mar 14, 2017

i have implemented this code with a scroll size 1000 and after querying in elastic even though the total hits is around 1600 it skips the first page and it loops through the remaining pages.is there something am missing out?

@tjeubaoit

This comment has been minimized.

Copy link

commented Apr 3, 2017

You missed all hits in first page

@gyli

This comment has been minimized.

Copy link

commented May 25, 2017

I am surprised that so many people still handling es scroll manually. Take a look of elasticsearch DSL package http://elasticsearch-dsl.readthedocs.io/en/latest/search_dsl.html#hits

@ihor-nahuliak

This comment has been minimized.

Copy link

commented Jul 19, 2017

Thank you! It works.

@sarojdongol

This comment has been minimized.

Copy link

commented Sep 2, 2017

Thank you! it worked like a charm.

@henrikno

This comment has been minimized.

Copy link

commented Sep 27, 2017

You should also clear the scroll when done to free memory in Elasticsearch. Otherwise it will keep the memory until the scroll timeout.
E.g. es.clear_scroll(body={'scroll_id': [sid]}, ignore=(404, ))

@NinaSalimi

This comment has been minimized.

Copy link

commented Oct 21, 2017

Very helpful! thanks

@alanwds

This comment has been minimized.

Copy link

commented Jan 5, 2018

+1 work's like a charm. Thank you!

@jjjbushjjj

This comment has been minimized.

Copy link

commented Jan 18, 2018

+100 Thank you!

@hardikgw

This comment has been minimized.

Copy link

commented Feb 2, 2018

in 6.x no need to es.clear_scroll the default is clear_scroll=True

@krishthotem

This comment has been minimized.

Copy link

commented Feb 20, 2018

I have something like this in my python program:

response = requests.get('https://example.com/_search?q=@version:2&scroll=2m')

data = json.loads(response.text) 

sid = data['_scroll_id']
scroll_size = data['hits']['total']

// Until this line it works as expected

while (scroll_size > 0):
 data = {"scroll" : "2m", "scroll_id" :  sid}
 response = requests.post('https://example.com/_search/scroll', data=data)
 data = json.loads(response.text) 
 # do something
 sid = data['_scroll_id']

// But here in requests.post() line, I get error: {'message': 'Not Found', 'code': 404}

Any thoughts on what I am doing wrong and what should be changed? Thanks!

@evrycollin

This comment has been minimized.

Copy link

commented Apr 4, 2018

Working with python generator

works with ES >= 5

1. utility generator method

def scroll(index, doc_type, query_body, page_size=100, debug=False, scroll='2m'):
    page = es.search(index=index, doc_type=doc_type, scroll=scroll, size=page_size, body=query_body)
    sid = page['_scroll_id']
    scroll_size = page['hits']['total']
    total_pages = math.ceil(scroll_size/page_size)
    page_counter = 0
    if debug: 
        print('Total items : {}'.format(scroll_size))
        print('Total pages : {}'.format( math.ceil(scroll_size/page_size) ) )
    # Start scrolling
    while (scroll_size > 0):
        # Get the number of results that we returned in the last scroll
        scroll_size = len(page['hits']['hits'])
        if scroll_size>0:
            if debug: 
                print('> Scrolling page {} : {} items'.format(page_counter, scroll_size))
            yield total_pages, page_counter, scroll_size, page
        # get next page
        page = es.scroll(scroll_id = sid, scroll = '2m')
        page_counter += 1
        # Update the scroll ID
        sid = page['_scroll_id']

Usage :

index = 'cases_*'
doc_type = 'detail'
query = { "query": { "match_all": {} }, "_source": ['caseId'] }
page_size = 1000

for total_pages, page_counter, page_items, page_data in scroll(index, doc_type, query, page_size=page_size):
    print('total_pages={}, page_counter={}, page_items={}'.format(total_pages, page_counter, page_items))
    # do what you need with page_data
@mikej165

This comment has been minimized.

Copy link

commented Apr 6, 2018

+1 Saved me a lot of time. Thanks!

@abdulwahid24

This comment has been minimized.

Copy link

commented Apr 13, 2018

Thank you @evrycollin +1

@AnalystNidhi

This comment has been minimized.

Copy link

commented Apr 24, 2018

In my project requirement, I need to fetch more than 10k documents. I used ElasticSearch scroll api with python to do that. Here is my sample code -

url = 'http://hostname:portname/_search/scroll'
scroll_url='http://hostname:portname//_search?scroll=2m'
Query= {"query": {"bool": {"must": [{"match_all": { }},{ "range": { "@timestamp": { "gt": "now-24h", "lt": "now-1h", "time_zone": "-06:00" } } }],"must_not": [ ],"should": [ ]}},"from": 0,"size":10,"sort": [ ],"aggs": { }}
response=requests.post(scroll_url, json=query)
sid = response['_scroll_id']
hits=response['hits']
total=hits["total"]
while(total>0):
scroll = '2m'
scroll_query=json.dumps({"scroll" : scroll, "scroll_id" : sid })
response1=rquests.post(url,data=scroll_query)

sid = response1['_scroll_id']

hits=response1['hits']
total=len(response1['hits']['hits'])
for each in hits['hits']:

Scroll work perfect the way I wanted to, but later I was informed that because of this scroll elasticsearch schema got corrupted and it recreated the indexes.

Is it true that scroll modify the ES structure or something wrong with my code. Please let me know.

@rain1024

This comment has been minimized.

Copy link

commented Jun 22, 2018

+1 helpful

@NguyenHauHN

This comment has been minimized.

Copy link

commented Jul 12, 2018

Thank you so much!

@tade0726

This comment has been minimized.

Copy link

commented Jul 19, 2018

@muelli TKS, you are brilliant, hope more people checkout this api !

@sbrb

This comment has been minimized.

Copy link

commented Aug 8, 2018

ES 6.3. This example makes my Elasticsearch service to crash, trying to scroll 110k documents with size=10000, at somewhere between 5th-7th iterations.

systemctl status elasticsearch

 elasticsearch.service - Elasticsearch
   Loaded: loaded (/usr/lib/systemd/system/elasticsearch.service; disabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Wed 2018-08-08 20:58:10 EEST; 21s ago
     Docs: http://www.elastic.co
  Process: 5860 ExecStart=/usr/share/elasticsearch/bin/elasticsearch -p ${PID_DIR}/elasticsearch.pid --quiet (code=exited, status=127)
 Main PID: 5860 (code=exited, status=127)

Aug 08 20:57:18 myhost elasticsearch[5860]:         at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:112)
Aug 08 20:57:18 myhost elasticsearch[5860]:         at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86)
Aug 08 20:57:18 myhost elasticsearch[5860]:         at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124)
Aug 08 20:57:18 myhost elasticsearch[5860]:         at org.elasticsearch.cli.Command.main(Command.java:90)
Aug 08 20:57:18 myhost elasticsearch[5860]:         at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92)
Aug 08 20:57:18 myhost elasticsearch[5860]:         at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:85)
Aug 08 20:57:18 myhost elasticsearch[5860]: 2018-08-08 20:57:18,490 main ERROR Null object returned for RollingFile in Appenders.
Aug 08 20:57:18 myhost elasticsearch[5860]: 2018-08-08 20:57:18,491 main ERROR Unable to locate appender "rolling" for logger config "root"
Aug 08 20:58:10 myhost systemd[1]: elasticsearch.service: Main process exited, code=exited, status=127/n/a
Aug 08 20:58:10 myhost systemd[1]: elasticsearch.service: Failed with result 'exit-code'.

No logs in /var/log/elasticsearch/elasticsearch.log

@venkz

This comment has been minimized.

Copy link

commented Sep 28, 2018

Thanks for making a simple example, very useful.

For others who use this example, keep in mind that the initial es.search not only returns the first scroll_id that you'll use for scrolling, but also contains hits that you'll want to process before initiating your first scroll. For most people this is probably obvious, but for the 'challenged' (like me), be sure to do something like:


page = es.search(
.....
    })

sid = page['_scroll_id']
scroll_size = page['hits']['total']

#before you scroll, process your current batch of hits  
for hit in page['hits']['hits']:
    do_stuff

  # Start scrolling
while (scroll_size > 0)
...

Excellent! Important point to keep in mind 👍

@sibblegp

This comment has been minimized.

Copy link

commented Oct 15, 2018

This is extremely slow for me. I used elasticsearch.helpers.scan instead and not only did it not crash my server, but it was much faster.

@tomaszhlawiczka

This comment has been minimized.

Copy link

commented Oct 22, 2018

This is extremely slow for me. I used elasticsearch.helpers.scan instead and not only did it not crash my server, but it was much faster.

@sibblegp please see: https://www.elastic.co/guide/en/elasticsearch/reference/5.1/breaking_50_search_changes.html#_literal_search_type_scan_literal_removed

Scroll requests sorted by _doc have been optimized to more efficiently resume from where the previous request stopped, so this will have the same performance characteristics as the former scan search type.

@vadirajjahagirdar

This comment has been minimized.

Copy link

commented Oct 23, 2018

Thanks a lot!!

@whtsead213

This comment has been minimized.

Copy link

commented Nov 20, 2018

nice

@croepke

This comment has been minimized.

Copy link

commented Dec 8, 2018

Many thanks! Very handy!

@akras-apixio

This comment has been minimized.

Copy link

commented Dec 21, 2018

Warning. This code has a bug, it will throw out first search result (aka first 1000 items). A co-worker of mine copy pasted this causing us to waste a few hours.

@mybluedog24

This comment has been minimized.

Copy link

commented Mar 8, 2019

This code doesn't work anymore in ES 6.4. I found another solution here: https://stackoverflow.com/questions/28537547/how-to-correctly-check-for-scroll-end

response = es.search(
    index='index_name',
    body=<your query here>,
    scroll='10m'
)
scroll_id = response['_scroll_id']

while len(response['hits']['hits']):
    # process results
    print([item["_id"] for item in response["hits"]["hits"]])
    response = es.scroll(scroll_id=scroll_id, scroll='10m')

Process the result right at the beginning of the while loop to avoid missing the first search result.

@feydan

This comment has been minimized.

Copy link

commented Apr 9, 2019

The scroll id can change: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html

The initial search request and each subsequent scroll request each return a _scroll_id. While the _scroll_id may change between requests, it doesn’t always change — in any case, only the most recently received _scroll_id should be used.

Here is a simplified version that will work if the scroll id changes

response = es.search(
    index='index_name',
    body=<your query here>,
    scroll='10m'
)

while len(response['hits']['hits']):
    # process results
    print([item["_id"] for item in response["hits"]["hits"]])
    response = es.scroll(scroll_id=response['_scroll_id'], scroll='10m')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.