Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Django efficient queryset iterator (by dividing in chunks). Taked from https://djangosnippets.org/snippets/1949/
import gc
def queryset_iterator(queryset, chunksize=1000):
'''''
Iterate over a Django Queryset ordered by the primary key
This method loads a maximum of chunksize (default: 1000) rows in it's
memory at the same time while django normally would load all rows in it's
memory. Using the iterator() method only causes it to not preload all the
classes.
Note that the implementation of the iterator does not support ordered query sets.
'''
pk = 0
last_pk = queryset.order_by('-pk')[0].pk
queryset = queryset.order_by('pk')
while pk < last_pk:
for row in queryset.filter(pk__gt=pk)[:chunksize]:
pk = row.pk
yield row
gc.collect()
@bgits

This comment has been minimized.

Copy link

@bgits bgits commented Aug 8, 2016

Why gc collect? Doesn't python handle gc and wouldn't this likely interfere with auto gc and be less performant?

@belyak

This comment has been minimized.

Copy link

@belyak belyak commented Mar 27, 2017

Your code does not work with empty queryset correctly.
Version below does:

def queryset_iterator(queryset, chunk_size=1000):
    """
    Iterate over a Django Queryset ordered by the primary key
    This method loads a maximum of chunk_size (default: 1000) rows in it's
    memory at the same time while django normally would load all rows in it's
    memory. Using the iterator() method only causes it to not preload all the
    classes.
    Note that the implementation of the iterator does not support ordered query sets.
    """
    try:
        last_pk = queryset.order_by('-pk')[:1].get().pk
    except ObjectDoesNotExist:
        return

    pk = 0
    queryset = queryset.order_by('pk')
    while pk < last_pk:
        for row in queryset.filter(pk__gt=pk)[:chunk_size]:
            pk = row.pk
            yield row
        gc.collect()
@FadiBakoura

This comment has been minimized.

Copy link

@FadiBakoura FadiBakoura commented Feb 26, 2020

The approach in this link is more efficient and stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment