Skip to content

Instantly share code, notes, and snippets.

@victorono
Last active April 26, 2024 17:57
Show Gist options
  • Save victorono/cd9d14b013487b8b22975512225c5b4c to your computer and use it in GitHub Desktop.
Save victorono/cd9d14b013487b8b22975512225c5b4c to your computer and use it in GitHub Desktop.
Django - remove duplicate objects where there is more than one field to compare
from django.db.models import Count, Max
unique_fields = ['field_1', 'field_2']
duplicates = (
MyModel.objects.values(*unique_fields)
.order_by()
.annotate(max_id=Max('id'), count_id=Count('id'))
.filter(count_id__gt=1)
)
for duplicate in duplicates:
(
MyModel.objects
.filter(**{x: duplicate[x] for x in unique_fields})
.exclude(id=duplicate['max_id'])
.delete()
)
@patrik7
Copy link

patrik7 commented Apr 23, 2020

Amazing script, thanks a lot!

@KarthikNayak
Copy link

https://docs.djangoproject.com/en/3.0/topics/db/aggregation/#order-of-annotate-and-values-clauses

@stasius12 this works because adding annotate over values does a group_by, check the mentioned link!

@sandeep7410
Copy link

👍

@cliftonavil
Copy link

Super.thanks a lot!

@Thutmose3
Copy link

Amazing!!

@gokhanyildiz9535
Copy link

Thanks 👍

@MarvinKweyu
Copy link

@victorono Neat. Got a challenge though. Perhaps an eye on this could help clarify?

>>> duplicate_books = Book.objects.values('title').annotate(title_count=Count('title')).filter(title_count__gt=1)
>>> for duplicate in duplicates:
             Book.objects.filter(**{x: duplicate[x] for x in unique_fields}).exclude(id=duplicate['max_id']).delete()

However, I get:

    raise ConnectionError("N/A", str(e), e)
elasticsearch.exceptions.ConnectionError: ConnectionError(<urllib3.connection.HTTPConnection object at 0x7f32e8c94a30>: Failed to establish a new connection: [Errno 111] Connection refused) caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x7f32e8c94a30>: Failed to establish a new connection: [Errno 111] Connection refused)

@mallamsripooja
Copy link

Helpful code snippet! 👍

@vhtkrk
Copy link

vhtkrk commented Jul 28, 2022

When you have lots of duplicates this becomes very slow. Imagine a table where every row is duplicated, so that you have table_size/2 items in the duplicates QuerySet after the first query and then need to do the delete for each of those one by one.

It gave a really good starting point though. This is what I ended up with, does it all in one query. Running time went from hours to minutes on a large table.

from django.db import connection

def remove_duplicates(model, unique_fields):
    fields  = ', '.join(f't."{f}"' for f in unique_fields)
    sql = f"""
    DELETE FROM {model._meta.db_table} 
    WHERE id IN (
        SELECT 
            UNNEST(ARRAY_REMOVE(dupe_ids, max_id))
        FROM (
            SELECT 
                {fields},
                MAX(t.id) AS max_id,
                ARRAY_AGG(t.id) AS dupe_ids
            FROM
                {model._meta.db_table} t
            GROUP BY
                {fields}
            HAVING
                COUNT(t.id) > 1
        ) a
    )
    """
    with connection.cursor() as cursor:
        cursor.execute(sql)


remove_duplicates(MyModel, ['field_1', 'field_2'])

@victorono
Copy link
Author

Thanks a lot! Excellent dimensioning to improve execution performance

@ahmed-zubair-1998
Copy link

Great starting point. Tried to reduce the number of queries without using raw SQL.

def remove_duplicates_from_table(model, lookup_fields):
    duplicates = (
        model.objects.values(*lookup_fields)
        .order_by()
        .annotate(min_id=Min('id'), count_id=Count('id'))
        .filter(count_id__gt=1)
    )

    fields_lookup = Q()
    duplicate_fields_values = duplicates.values(*lookup_fields)
    for val in duplicate_fields_values:
        fields_lookup |= Q(**val)
    min_ids_list = duplicates.values_list('min_id', flat=True)

    if fields_lookup:
        model.objects.filter(fields_lookup).exclude(id__in=min_ids_list).delete()

Ended up using Q object to avoid making select query in each iteration while looping over the duplicates list.

@DArmstrong87
Copy link

Thank you so much! This is EXACTLY what I needed to fix an issue with one of my migrations taking forever.

@victorono
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment