Skip to content

Instantly share code, notes, and snippets.

@victorono
Last active April 26, 2024 17:57
Show Gist options
  • Save victorono/cd9d14b013487b8b22975512225c5b4c to your computer and use it in GitHub Desktop.
Save victorono/cd9d14b013487b8b22975512225c5b4c to your computer and use it in GitHub Desktop.
Django - remove duplicate objects where there is more than one field to compare
from django.db.models import Count, Max
unique_fields = ['field_1', 'field_2']
duplicates = (
MyModel.objects.values(*unique_fields)
.order_by()
.annotate(max_id=Max('id'), count_id=Count('id'))
.filter(count_id__gt=1)
)
for duplicate in duplicates:
(
MyModel.objects
.filter(**{x: duplicate[x] for x in unique_fields})
.exclude(id=duplicate['max_id'])
.delete()
)
@inmate37
Copy link

Thanks ! Very helpful snippet

@jpmcpe
Copy link

jpmcpe commented Sep 26, 2018

genius!

@stasius12
Copy link

stasius12 commented Nov 28, 2018

Why this is working? Because I checked, and it does, but why? You are counting id's ?? I thought each object has only one 'id'
Could you please explain it to me?

@pdvorchik
Copy link

Nice and elegant. Thank you!

@stasius12 we're filtering a queryset which (potentially) contains multiple objects. Since ids are unique per object a count of ids will give you the number of objects matching the query.

@edkohler
Copy link

edkohler commented Sep 4, 2019

Thanks for posting this. It helped me clean up a nagging duplicate issue. I made one change for my use: switched to Min rather than max for the dupe to exclude from deletion. This was a better fit for me since I may edit some corresponding fields before a dupe is added to the database so I'd rather stick with the original. If anyone else is interested in this approach, swapping out Max for Min and max_id for min_id does the trick.

@patrik7
Copy link

patrik7 commented Apr 23, 2020

Amazing script, thanks a lot!

@KarthikNayak
Copy link

https://docs.djangoproject.com/en/3.0/topics/db/aggregation/#order-of-annotate-and-values-clauses

@stasius12 this works because adding annotate over values does a group_by, check the mentioned link!

@sandeep7410
Copy link

👍

@cliftonavil
Copy link

Super.thanks a lot!

@Thutmose3
Copy link

Amazing!!

@gokhanyildiz9535
Copy link

Thanks 👍

@MarvinKweyu
Copy link

@victorono Neat. Got a challenge though. Perhaps an eye on this could help clarify?

>>> duplicate_books = Book.objects.values('title').annotate(title_count=Count('title')).filter(title_count__gt=1)
>>> for duplicate in duplicates:
             Book.objects.filter(**{x: duplicate[x] for x in unique_fields}).exclude(id=duplicate['max_id']).delete()

However, I get:

    raise ConnectionError("N/A", str(e), e)
elasticsearch.exceptions.ConnectionError: ConnectionError(<urllib3.connection.HTTPConnection object at 0x7f32e8c94a30>: Failed to establish a new connection: [Errno 111] Connection refused) caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x7f32e8c94a30>: Failed to establish a new connection: [Errno 111] Connection refused)

@mallamsripooja
Copy link

Helpful code snippet! 👍

@vhtkrk
Copy link

vhtkrk commented Jul 28, 2022

When you have lots of duplicates this becomes very slow. Imagine a table where every row is duplicated, so that you have table_size/2 items in the duplicates QuerySet after the first query and then need to do the delete for each of those one by one.

It gave a really good starting point though. This is what I ended up with, does it all in one query. Running time went from hours to minutes on a large table.

from django.db import connection

def remove_duplicates(model, unique_fields):
    fields  = ', '.join(f't."{f}"' for f in unique_fields)
    sql = f"""
    DELETE FROM {model._meta.db_table} 
    WHERE id IN (
        SELECT 
            UNNEST(ARRAY_REMOVE(dupe_ids, max_id))
        FROM (
            SELECT 
                {fields},
                MAX(t.id) AS max_id,
                ARRAY_AGG(t.id) AS dupe_ids
            FROM
                {model._meta.db_table} t
            GROUP BY
                {fields}
            HAVING
                COUNT(t.id) > 1
        ) a
    )
    """
    with connection.cursor() as cursor:
        cursor.execute(sql)


remove_duplicates(MyModel, ['field_1', 'field_2'])

@victorono
Copy link
Author

Thanks a lot! Excellent dimensioning to improve execution performance

@ahmed-zubair-1998
Copy link

Great starting point. Tried to reduce the number of queries without using raw SQL.

def remove_duplicates_from_table(model, lookup_fields):
    duplicates = (
        model.objects.values(*lookup_fields)
        .order_by()
        .annotate(min_id=Min('id'), count_id=Count('id'))
        .filter(count_id__gt=1)
    )

    fields_lookup = Q()
    duplicate_fields_values = duplicates.values(*lookup_fields)
    for val in duplicate_fields_values:
        fields_lookup |= Q(**val)
    min_ids_list = duplicates.values_list('min_id', flat=True)

    if fields_lookup:
        model.objects.filter(fields_lookup).exclude(id__in=min_ids_list).delete()

Ended up using Q object to avoid making select query in each iteration while looping over the duplicates list.

@DArmstrong87
Copy link

Thank you so much! This is EXACTLY what I needed to fix an issue with one of my migrations taking forever.

@victorono
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment