Надо было удалить дубликаты и сделать уникальность по двум полям. Коллеги написали миграцию, запстили её на stagings - всё отработало, запустили на production - работает, час работает, два работает....
Пошли посмотрели что происходит:
wgleaguenet=> SELECT * from pg_stat_activity ;
А там запросики вида
SELECT * FROM "player_application"
WHERE ("player_application"."realm" = 'RU'
AND "player_application"."estb_id" = NNNNNN )
ORDER BY "player_application"."id" DESC OFFSET 1
Код миграции, на удаление...
for realm in realm_list:
bad_applications = (orm['player.Application']
.objects.filter(realm=realm).values('estb_id')
.annotate(Count('estb_id')).order_by()
.filter(estb_id__count__gt=1))
for app in bad_applications:
app_estb_id = app['estb_id']
duplicated_applications = (
orm['player.Application'].objects
.filter(realm=realm, estb_id=app_estb_id)
.order_by('-id'))
remove_app_ids = [i.id for i in duplicated_applications[1:]]
(orm['player.Application']
.objects.filter(realm=realm, id__in=remove_app_ids)
.delete()
)
Кому, что не нравится? Циклы - 4 региона по 300 000 записей, 1 200 000 запросов- вуххху! В идеале сделать перебор по регионам и всё.
Вот этот запрос у нас выпоняется чаще всего
duplicated_applications = (orm['player.Application'].objects
.filter(realm=realm, estb_id=app_estb_id)
.order_by('-id'))
remove_app_ids = [i.id for i in duplicated_applications[1:]]
Явно, что надо убрать перебор по estb_id-ам. Но нам всеравно нужно получить список дубликатов.
Для того чтобы сделать вот это
orm['player.Application'].
objects.filter(realm=realm, id__in=remove_app_ids).
delete()
Надо разабраться, что происходит в выборке remove_app_ids.
- мы выбираем все дубли для "realm, estb_id", которые имеют дубли
- оставляем только самую последюю запись
А зачем мы выбираем дубли дважды? Что если попробовать выбрать дубли сразу!
SELECT realm, estb_id, COUNT(estb_id)
FROM player_application
GROUP BY realm, estb_id HAVING COUNT(estb_id) > 1;
А может сразу можем и "посчитать" последнюю запись? Чтобы удалить все дубликаты, кроме последних. HAVING сразу убираем, нам он больше ненужен, т.к. могут быть записи без дубликатов, они просто будут единичными и мы их не будем удалять.
SELECT realm, estb_id, MAX(id)
FROM player_application
GROUP BY realm, estb_id;
Хочется чтобы выглядело так в конечном варианте.
DELET FROM player_application
WHERE realm = %s
AND id NOT IN (SELECT realm, estb_id, MAX(id)
FROM player_application GROUP BY realm, estb_id);
Давайте посмотрим во что нам это стенет.
EXPLAIN ANALYZE
SELECT realm, estb_id, MAX(id)
FROM player_application
WHERE realm='RU'
GROUP BY realm, estb_id;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=112446.45..124472.50 rows=498044 width=15) (actual time=7663.251..12321.322 rows=658481 loops=1)
-> Sort (cost=112446.45..114207.85 rows=704561 width=15) (actual time=7663.103..9766.382 rows=702915 loops=1)
Sort Key: realm, estb_id
Sort Method: external merge Disk: 20616kB
-> Seq Scan on player_application (cost=0.00..19924.15 rows=704561 width=15) (actual time=0.012..1474.219 rows=702915 loops=1)
Filter: ((realm)::text = 'RU'::text)
Rows Removed by Filter: 109977
Total runtime: 13486.108 ms
Самое плохое Sort, Sort Key, Seq Scan on player_application. GROUP BY по полям без индексов - это плохо, надо добавить.
CREATE INDEX player_application_realm_estb_id__index ON player_application (realm, estb_id);
Пересчитываем.
EXPLAIN ANALYZE
SELECT realm, estb_id, MAX(id)
FROM player_application
WHERE realm='RU'
GROUP BY realm, estb_id;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=0.00..63717.60 rows=498044 width=15) (actual time=0.102..4253.202 rows=658481 loops=1)
-> Index Scan using player_application_realm_estb_id_compound on player_application (cost=0.00..53452.95 rows=704561 width=15) (actual time=0.088..1639.585 rows=702915 loops=1)
Index Cond: ((realm)::text = 'RU'::text)
Total runtime: 5421.231 ms
Сделано!
На самом деле 13 секунд или 5 в рамках одноразовой операции не критично, и это делать необязательно, но если у вас данных еще больше, тогда оптимизируем всё!
Давайте же удалим дубликаты! Запускаем, и ... думает, пошли на обед... думает. От блин!
DELETE FROM player_application
WHERE realm='RU'
AND id NOT IN (SELECT T.id
FROM (SELECT realm, estb_id, MAX(id) as id
FROM player_application
WHERE realm='RU'
GROUP BY realm, estb_id) as T);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Delete on player_application (cost=0.00..26202445903.48 rows=352280 width=6)
-> Index Scan using player_application_realm on player_application (cost=0.00..26202445903.48 rows=352280 width=6)
Index Cond: ((realm)::text = 'RU'::text)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..73134.26 rows=498044 width=4)
-> Subquery Scan on t (cost=0.00..68698.04 rows=498044 width=4)
-> GroupAggregate (cost=0.00..63717.60 rows=498044 width=15)
-> Index Scan using player_application_realm_estb_id_compound on player_application (cost=0.00..53452.95 rows=704561 width=15)
Index Cond: ((realm)::text = 'RU'::text)
(10 rows)
ОХ ТЫЖ БЛИН!
SELECT id
FROM player_application
WHERE realm='RU'
AND id NOT IN (SELECT T.id
FROM (SELECT realm, estb_id, MAX(id) as id
FROM player_application where realm='RU'
GROUP BY realm, estb_id) as T);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using player_application_realm on player_application (cost=0.00..26202445903.48 rows=352280 width=4)
Index Cond: ((realm)::text = 'RU'::text)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..73134.26 rows=498044 width=4)
-> Subquery Scan on t (cost=0.00..68698.04 rows=498044 width=4)
-> GroupAggregate (cost=0.00..63717.60 rows=498044 width=15)
-> Index Scan using player_application_realm_estb_id_compound on player_application (cost=0.00..53452.95 rows=704561 width=15)
Index Cond: ((realm)::text = 'RU'::text)
(9 rows)
СВЯТАЯ КОРОВА!
Надо оптимизировать! У нас ну дохрена записей получается и поиск через NOT IN - очень очень плох - Нужен JOIN (=
SELECT player_application.id
FROM player_application
FULL OUTER JOIN (SELECT realm, estb_id, MAX(id) as id
FROM player_application
WHERE realm='RU'
GROUP BY realm, estb_id) as T ON t.id = player_application.id
WHERE player_application.realm = 'RU'
AND t.id IS NULL;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Anti Join (cost=129449.73..176377.27 rows=352280 width=8) (actual time=10615.369..21029.185 rows=44434 loops=1)
Merge Cond: (public.player_application.id = t.id)
-> Index Scan using player_application_pkey on player_application (cost=0.00..37114.10 rows=704561 width=4) (actual time=0.101..2177.743 rows=702915 loops=1)
Filter: ((realm)::text = 'RU'::text)
Rows Removed by Filter: 109977
-> Materialize (cost=129449.73..131939.95 rows=498044 width=4) (actual time=10569.717..15430.285 rows=658481 loops=1)
-> Sort (cost=129449.73..130694.84 rows=498044 width=4) (actual time=10569.711..12331.187 rows=658481 loops=1)
Sort Key: t.id
Sort Method: external merge Disk: 9024kB
-> Subquery Scan on t (cost=0.00..68698.04 rows=498044 width=4) (actual time=0.064..8563.492 rows=658481 loops=1)
-> GroupAggregate (cost=0.00..63717.60 rows=498044 width=15) (actual time=0.058..5508.390 rows=658481 loops=1)
-> Index Scan using player_application_realm_estb_id_compound on player_application (cost=0.00..53452.95 rows=704561 width=15) (actual time=0.043..2138.019 rows=702915 loops=1)
Index Cond: ((realm)::text = 'RU'::text)
Total runtime: 21131.383 ms
Уже лучше.
WITH appliations_to_delete as (
SELECT player_application.id as app_id
FROM player_application
FULL OUTER JOIN (SELECT realm, estb_id, MAX(id) as id
FROM player_application
WHERE realm='RU'
GROUP BY realm, estb_id) as T ON t.id = player_application.id
WHERE player_application.realm = 'RU'
AND t.id IS NULL
)
DELETE FROM player_application
USING appliations_to_delete
WHERE appliations_to_delete.app_id = player_application.id;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Delete on player_application (cost=208400.34..229958.49 rows=352280 width=34) (actual time=19732.897..19732.897 rows=0 loops=1)
CTE appliations_to_delete
-> Merge Anti Join (cost=129449.73..176377.27 rows=352280 width=4) (actual time=8166.996..15969.613 rows=44434 loops=1)
Merge Cond: (public.player_application.id = t.id)
-> Index Scan using player_application_pkey on player_application (cost=0.00..37114.10 rows=704561 width=4) (actual time=0.066..1627.652 rows=702915 loops=1)
Filter: ((realm)::text = 'RU'::text)
Rows Removed by Filter: 109977
-> Materialize (cost=129449.73..131939.95 rows=498044 width=4) (actual time=8132.022..11773.520 rows=658481 loops=1)
-> Sort (cost=129449.73..130694.84 rows=498044 width=4) (actual time=8131.979..9445.551 rows=658481 loops=1)
Sort Key: t.id
Sort Method: external merge Disk: 9024kB
-> Subquery Scan on t (cost=0.00..68698.04 rows=498044 width=4) (actual time=0.040..6576.660 rows=658481 loops=1)
-> GroupAggregate (cost=0.00..63717.60 rows=498044 width=15) (actual time=0.036..4249.104 rows=658481 loops=1)
-> Index Scan using player_application_realm_estb_id_compound on player_application (cost=0.00..53452.95 rows=704561 width=15) (actual time=0.027..1643.514 rows=702915 loops=1)
Index Cond: ((realm)::text = 'RU'::text)
-> Hash Join (cost=32023.07..53581.22 rows=352280 width=34) (actual time=11349.150..19480.719 rows=44434 loops=1)
Hash Cond: (appliations_to_delete.id = public.player_application.id)
-> CTE Scan on appliations_to_delete (cost=0.00..7045.60 rows=352280 width=32) (actual time=8167.010..16138.956 rows=44434 loops=1)
-> Hash (cost=17891.92..17891.92 rows=812892 width=10) (actual time=3082.443..3082.443 rows=812892 loops=1)
Buckets: 4096 Batches: 64 Memory Usage: 560kB
-> Seq Scan on player_application (cost=0.00..17891.92 rows=812892 width=10) (actual time=0.005..1520.998 rows=812892 loops=1)
Total runtime: 19735.940 ms
С 3 часов, до 20 секунд... Неплохо.