Skip to content

Instantly share code, notes, and snippets.

@srusskih
Last active August 29, 2015 14:25
Show Gist options
  • Save srusskih/ef441b4b58a59ced4435 to your computer and use it in GitHub Desktop.
Save srusskih/ef441b4b58a59ced4435 to your computer and use it in GitHub Desktop.

Надо было удалить дубликаты и сделать уникальность по двум полям. Коллеги написали миграцию, запстили её на stagings - всё отработало, запустили на production - работает, час работает, два работает....

Пошли посмотрели что происходит:

wgleaguenet=> SELECT * from pg_stat_activity ;

А там запросики вида

SELECT * FROM "player_application" 
WHERE ("player_application"."realm" = 'RU'  
       AND "player_application"."estb_id" = NNNNNN )
ORDER BY "player_application"."id" DESC OFFSET 1

Код миграции, на удаление...

for realm in realm_list:
    bad_applications = (orm['player.Application']
        .objects.filter(realm=realm).values('estb_id')
        .annotate(Count('estb_id')).order_by()
        .filter(estb_id__count__gt=1))

    for app in bad_applications:
        app_estb_id = app['estb_id']

        duplicated_applications = (
            orm['player.Application'].objects
			.filter(realm=realm, estb_id=app_estb_id)
			.order_by('-id'))

        remove_app_ids = [i.id for i in duplicated_applications[1:]]
        (orm['player.Application']
            .objects.filter(realm=realm, id__in=remove_app_ids)
            .delete()
        )

Кому, что не нравится? Циклы - 4 региона по 300 000 записей, 1 200 000 запросов- вуххху! В идеале сделать перебор по регионам и всё.

Вот этот запрос у нас выпоняется чаще всего

duplicated_applications = (orm['player.Application'].objects
							.filter(realm=realm, estb_id=app_estb_id)
							.order_by('-id'))

remove_app_ids = [i.id for i in duplicated_applications[1:]]

Явно, что надо убрать перебор по estb_id-ам. Но нам всеравно нужно получить список дубликатов.

Для того чтобы сделать вот это

orm['player.Application'].
objects.filter(realm=realm, id__in=remove_app_ids).
delete()

Надо разабраться, что происходит в выборке remove_app_ids.

  1. мы выбираем все дубли для "realm, estb_id", которые имеют дубли
  2. оставляем только самую последюю запись

А зачем мы выбираем дубли дважды? Что если попробовать выбрать дубли сразу!

SELECT realm, estb_id, COUNT(estb_id) 
FROM player_application 
GROUP BY realm, estb_id HAVING COUNT(estb_id) > 1;

А может сразу можем и "посчитать" последнюю запись? Чтобы удалить все дубликаты, кроме последних. HAVING сразу убираем, нам он больше ненужен, т.к. могут быть записи без дубликатов, они просто будут единичными и мы их не будем удалять.

SELECT realm, estb_id, MAX(id) 
FROM player_application 
GROUP BY realm, estb_id;

Хочется чтобы выглядело так в конечном варианте.

DELET FROM player_application 
WHERE realm = %s 
AND id NOT IN (SELECT realm, estb_id, MAX(id) 
FROM player_application GROUP BY realm, estb_id);

Давайте посмотрим во что нам это стенет.

EXPLAIN ANALYZE 
SELECT realm, estb_id, MAX(id) 
FROM player_application 
WHERE realm='RU' 
GROUP BY realm, estb_id;
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=112446.45..124472.50 rows=498044 width=15) (actual time=7663.251..12321.322 rows=658481 loops=1)
   ->  Sort  (cost=112446.45..114207.85 rows=704561 width=15) (actual time=7663.103..9766.382 rows=702915 loops=1)
         Sort Key: realm, estb_id
         Sort Method: external merge  Disk: 20616kB
         ->  Seq Scan on player_application  (cost=0.00..19924.15 rows=704561 width=15) (actual time=0.012..1474.219 rows=702915 loops=1)
               Filter: ((realm)::text = 'RU'::text)
               Rows Removed by Filter: 109977
 Total runtime: 13486.108 ms

Самое плохое Sort, Sort Key, Seq Scan on player_application. GROUP BY по полям без индексов - это плохо, надо добавить.

CREATE INDEX player_application_realm_estb_id__index ON player_application (realm, estb_id);

Пересчитываем.

EXPLAIN ANALYZE 
SELECT realm, estb_id, MAX(id) 
FROM player_application
WHERE realm='RU' 
GROUP BY realm, estb_id;
                                                                                      QUERY PLAN                                                                                      
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=0.00..63717.60 rows=498044 width=15) (actual time=0.102..4253.202 rows=658481 loops=1)
   ->  Index Scan using player_application_realm_estb_id_compound on player_application  (cost=0.00..53452.95 rows=704561 width=15) (actual time=0.088..1639.585 rows=702915 loops=1)
         Index Cond: ((realm)::text = 'RU'::text)
 Total runtime: 5421.231 ms

Сделано!

На самом деле 13 секунд или 5 в рамках одноразовой операции не критично, и это делать необязательно, но если у вас данных еще больше, тогда оптимизируем всё!

Давайте же удалим дубликаты! Запускаем, и ... думает, пошли на обед... думает. От блин!

DELETE FROM player_application 
WHERE realm='RU' 
   AND id NOT IN (SELECT T.id
				  FROM (SELECT realm, estb_id, MAX(id) as id 
						FROM player_application 
						WHERE realm='RU' 
						GROUP BY realm, estb_id) as T);
                                                                          QUERY PLAN                                                                          
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Delete on player_application  (cost=0.00..26202445903.48 rows=352280 width=6)
   ->  Index Scan using player_application_realm on player_application  (cost=0.00..26202445903.48 rows=352280 width=6)
         Index Cond: ((realm)::text = 'RU'::text)
         Filter: (NOT (SubPlan 1))
         SubPlan 1
           ->  Materialize  (cost=0.00..73134.26 rows=498044 width=4)
                 ->  Subquery Scan on t  (cost=0.00..68698.04 rows=498044 width=4)
                       ->  GroupAggregate  (cost=0.00..63717.60 rows=498044 width=15)
                             ->  Index Scan using player_application_realm_estb_id_compound on player_application  (cost=0.00..53452.95 rows=704561 width=15)
                                   Index Cond: ((realm)::text = 'RU'::text)
(10 rows)

ОХ ТЫЖ БЛИН!

SELECT id 
FROM player_application 
WHERE realm='RU' 
AND id NOT IN (SELECT T.id 
			   FROM (SELECT realm, estb_id, MAX(id) as id 
					 FROM player_application where realm='RU' 
					 GROUP BY realm, estb_id) as T);
                                                                       QUERY PLAN                                                                       
--------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using player_application_realm on player_application  (cost=0.00..26202445903.48 rows=352280 width=4)
   Index Cond: ((realm)::text = 'RU'::text)
   Filter: (NOT (SubPlan 1))
   SubPlan 1
     ->  Materialize  (cost=0.00..73134.26 rows=498044 width=4)
           ->  Subquery Scan on t  (cost=0.00..68698.04 rows=498044 width=4)
                 ->  GroupAggregate  (cost=0.00..63717.60 rows=498044 width=15)
                       ->  Index Scan using player_application_realm_estb_id_compound on player_application  (cost=0.00..53452.95 rows=704561 width=15)
                             Index Cond: ((realm)::text = 'RU'::text)
(9 rows)

СВЯТАЯ КОРОВА!

Надо оптимизировать! У нас ну дохрена записей получается и поиск через NOT IN - очень очень плох - Нужен JOIN (=

SELECT player_application.id
FROM player_application 
FULL OUTER JOIN (SELECT realm, estb_id, MAX(id) as id 
                 FROM player_application 
                 WHERE realm='RU'
                 GROUP BY realm, estb_id) as T ON t.id = player_application.id 
WHERE player_application.realm = 'RU' 
  AND t.id IS NULL;

                                                                       QUERY PLAN                                                                       
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Anti Join  (cost=129449.73..176377.27 rows=352280 width=8) (actual time=10615.369..21029.185 rows=44434 loops=1)
   Merge Cond: (public.player_application.id = t.id)
   ->  Index Scan using player_application_pkey on player_application  (cost=0.00..37114.10 rows=704561 width=4) (actual time=0.101..2177.743 rows=702915 loops=1)
         Filter: ((realm)::text = 'RU'::text)
         Rows Removed by Filter: 109977
   ->  Materialize  (cost=129449.73..131939.95 rows=498044 width=4) (actual time=10569.717..15430.285 rows=658481 loops=1)
         ->  Sort  (cost=129449.73..130694.84 rows=498044 width=4) (actual time=10569.711..12331.187 rows=658481 loops=1)
               Sort Key: t.id
               Sort Method: external merge  Disk: 9024kB
               ->  Subquery Scan on t  (cost=0.00..68698.04 rows=498044 width=4) (actual time=0.064..8563.492 rows=658481 loops=1)
                     ->  GroupAggregate  (cost=0.00..63717.60 rows=498044 width=15) (actual time=0.058..5508.390 rows=658481 loops=1)
                           ->  Index Scan using player_application_realm_estb_id_compound on player_application  (cost=0.00..53452.95 rows=704561 width=15) (actual time=0.043..2138.019 rows=702915 loops=1)
                                 Index Cond: ((realm)::text = 'RU'::text)
 Total runtime: 21131.383 ms

Уже лучше.

WITH appliations_to_delete as (
SELECT player_application.id as app_id
FROM player_application 
FULL OUTER JOIN (SELECT realm, estb_id, MAX(id) as id 
                 FROM player_application 
                 WHERE realm='RU'
                 GROUP BY realm, estb_id) as T ON t.id = player_application.id 
WHERE player_application.realm = 'RU' 
  AND t.id IS NULL
)

DELETE FROM player_application 
USING appliations_to_delete
WHERE appliations_to_delete.app_id = player_application.id;

                                                                                                      QUERY PLAN                                                                                                      
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Delete on player_application  (cost=208400.34..229958.49 rows=352280 width=34) (actual time=19732.897..19732.897 rows=0 loops=1)
   CTE appliations_to_delete
     ->  Merge Anti Join  (cost=129449.73..176377.27 rows=352280 width=4) (actual time=8166.996..15969.613 rows=44434 loops=1)
           Merge Cond: (public.player_application.id = t.id)
           ->  Index Scan using player_application_pkey on player_application  (cost=0.00..37114.10 rows=704561 width=4) (actual time=0.066..1627.652 rows=702915 loops=1)
                 Filter: ((realm)::text = 'RU'::text)
                 Rows Removed by Filter: 109977
           ->  Materialize  (cost=129449.73..131939.95 rows=498044 width=4) (actual time=8132.022..11773.520 rows=658481 loops=1)
                 ->  Sort  (cost=129449.73..130694.84 rows=498044 width=4) (actual time=8131.979..9445.551 rows=658481 loops=1)
                       Sort Key: t.id
                       Sort Method: external merge  Disk: 9024kB
                       ->  Subquery Scan on t  (cost=0.00..68698.04 rows=498044 width=4) (actual time=0.040..6576.660 rows=658481 loops=1)
                             ->  GroupAggregate  (cost=0.00..63717.60 rows=498044 width=15) (actual time=0.036..4249.104 rows=658481 loops=1)
                                   ->  Index Scan using player_application_realm_estb_id_compound on player_application  (cost=0.00..53452.95 rows=704561 width=15) (actual time=0.027..1643.514 rows=702915 loops=1)
                                         Index Cond: ((realm)::text = 'RU'::text)
   ->  Hash Join  (cost=32023.07..53581.22 rows=352280 width=34) (actual time=11349.150..19480.719 rows=44434 loops=1)
         Hash Cond: (appliations_to_delete.id = public.player_application.id)
         ->  CTE Scan on appliations_to_delete  (cost=0.00..7045.60 rows=352280 width=32) (actual time=8167.010..16138.956 rows=44434 loops=1)
         ->  Hash  (cost=17891.92..17891.92 rows=812892 width=10) (actual time=3082.443..3082.443 rows=812892 loops=1)
               Buckets: 4096  Batches: 64  Memory Usage: 560kB
               ->  Seq Scan on player_application  (cost=0.00..17891.92 rows=812892 width=10) (actual time=0.005..1520.998 rows=812892 loops=1)
 Total runtime: 19735.940 ms

С 3 часов, до 20 секунд... Неплохо.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment