Skip to content

Instantly share code, notes, and snippets.

Last active December 1, 2023 06:42
  • Star 91 You must be signed in to star a gist
  • Fork 26 You must be signed in to fork a gist
Star You must be signed in to star a gist
What would you like to do?
Django - remove duplicate objects where there is more than one field to compare
from django.db.models import Count, Max
unique_fields = ['field_1', 'field_2']
duplicates = (
.annotate(max_id=Max('id'), count_id=Count('id'))
for duplicate in duplicates:
.filter(**{x: duplicate[x] for x in unique_fields})
Copy link

Thanks ! Very helpful snippet

Copy link

jpmcpe commented Sep 26, 2018


Copy link

stasius12 commented Nov 28, 2018

Why this is working? Because I checked, and it does, but why? You are counting id's ?? I thought each object has only one 'id'
Could you please explain it to me?

Copy link

Nice and elegant. Thank you!

@stasius12 we're filtering a queryset which (potentially) contains multiple objects. Since ids are unique per object a count of ids will give you the number of objects matching the query.

Copy link

edkohler commented Sep 4, 2019

Thanks for posting this. It helped me clean up a nagging duplicate issue. I made one change for my use: switched to Min rather than max for the dupe to exclude from deletion. This was a better fit for me since I may edit some corresponding fields before a dupe is added to the database so I'd rather stick with the original. If anyone else is interested in this approach, swapping out Max for Min and max_id for min_id does the trick.

Copy link

patrik7 commented Apr 23, 2020

Amazing script, thanks a lot!

Copy link

@stasius12 this works because adding annotate over values does a group_by, check the mentioned link!

Copy link


Copy link

Super.thanks a lot!

Copy link


Copy link

Thanks 👍

Copy link

@victorono Neat. Got a challenge though. Perhaps an eye on this could help clarify?

>>> duplicate_books = Book.objects.values('title').annotate(title_count=Count('title')).filter(title_count__gt=1)
>>> for duplicate in duplicates:
             Book.objects.filter(**{x: duplicate[x] for x in unique_fields}).exclude(id=duplicate['max_id']).delete()

However, I get:

    raise ConnectionError("N/A", str(e), e)
elasticsearch.exceptions.ConnectionError: ConnectionError(<urllib3.connection.HTTPConnection object at 0x7f32e8c94a30>: Failed to establish a new connection: [Errno 111] Connection refused) caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x7f32e8c94a30>: Failed to establish a new connection: [Errno 111] Connection refused)

Copy link

Helpful code snippet! 👍

Copy link

vhtkrk commented Jul 28, 2022

When you have lots of duplicates this becomes very slow. Imagine a table where every row is duplicated, so that you have table_size/2 items in the duplicates QuerySet after the first query and then need to do the delete for each of those one by one.

It gave a really good starting point though. This is what I ended up with, does it all in one query. Running time went from hours to minutes on a large table.

from django.db import connection

def remove_duplicates(model, unique_fields):
    fields  = ', '.join(f't."{f}"' for f in unique_fields)
    sql = f"""
    DELETE FROM {model._meta.db_table} 
    WHERE id IN (
            UNNEST(ARRAY_REMOVE(dupe_ids, max_id))
        FROM (
                MAX( AS max_id,
                ARRAY_AGG( AS dupe_ids
                {model._meta.db_table} t
            GROUP BY
                COUNT( > 1
        ) a
    with connection.cursor() as cursor:

remove_duplicates(MyModel, ['field_1', 'field_2'])

Copy link

Thanks a lot! Excellent dimensioning to improve execution performance

Copy link

Great starting point. Tried to reduce the number of queries without using raw SQL.

def remove_duplicates_from_table(model, lookup_fields):
    duplicates = (
        .annotate(min_id=Min('id'), count_id=Count('id'))

    fields_lookup = Q()
    duplicate_fields_values = duplicates.values(*lookup_fields)
    for val in duplicate_fields_values:
        fields_lookup |= Q(**val)
    min_ids_list = duplicates.values_list('min_id', flat=True)

    if fields_lookup:

Ended up using Q object to avoid making select query in each iteration while looping over the duplicates list.

Copy link

Thank you so much! This is EXACTLY what I needed to fix an issue with one of my migrations taking forever.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment