Skip to content

Instantly share code, notes, and snippets.

@keithmgould
Last active February 29, 2016 15:59
Show Gist options
  • Save keithmgould/d318c67b737b40666b13 to your computer and use it in GitHub Desktop.
Save keithmgould/d318c67b737b40666b13 to your computer and use it in GitHub Desktop.

OK. In an effort to learn this stuff (on a Saturday night of course), I also built the tables, indexes, and created fake data.

Here is the big realization: The join is expensive and pointless - it is not needed for the heavy lifting (counting/sorting). Split up into two queries (or a subquery).

First lets do it the original way:

# explain ANALYZE SELECT races.id, races.title, count(participants.id) AS participant_count
FROM races
   INNER JOIN participants ON races.id=participants.race_id
GROUP BY races.id, races.title
ORDER BY participant_count DESC
LIMIT 5;
                                                                  QUERY PLAN                                                                  
----------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=42872.35..42872.37 rows=5 width=18) (actual time=851.796..851.797 rows=5 loops=1)
   ->  Sort  (cost=42872.35..42884.85 rows=5000 width=18) (actual time=851.794..851.794 rows=5 loops=1)
         Sort Key: (count(participants.id))
         Sort Method: top-N heapsort  Memory: 25kB
         ->  HashAggregate  (cost=42739.31..42789.31 rows=5000 width=18) (actual time=851.393..851.593 rows=1000 loops=1)
               Group Key: races.id, races.title
               ->  Hash Join  (cost=149.50..35244.52 rows=999305 width=18) (actual time=6.069..511.539 rows=999199 loops=1)
                     Hash Cond: (participants.race_id = races.id)
                     ->  Seq Scan on participants  (cost=0.00..16358.05 rows=999305 width=8) (actual time=0.013..134.292 rows=999199 loops=1)
                     ->  Hash  (cost=87.00..87.00 rows=5000 width=14) (actual time=6.040..6.040 rows=5000 loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 229kB
                           ->  Seq Scan on races  (cost=0.00..87.00 rows=5000 width=14) (actual time=0.010..2.221 rows=5000 loops=1)
 Planning time: 0.448 ms
 Execution time: 852.004 ms

OK. Now, the first thing we can do is just make the hash smaller. We could remove the race.title from the scene and things will speed up just a little bit:

# explain ANALYZE SELECT races.id, count(participants.race_id) AS participant_count
FROM races
  INNER JOIN participants ON races.id=participants.race_id
GROUP BY races.id
ORDER BY participant_count DESC
LIMIT 5;
                                                                 QUERY PLAN                                                                  
----------------------------------------------------------------------------------------------------------------------------------------------
Limit  (cost=40374.09..40374.10 rows=5 width=8) (actual time=709.217..709.217 rows=5 loops=1)
  ->  Sort  (cost=40374.09..40386.59 rows=5000 width=8) (actual time=709.215..709.215 rows=5 loops=1)
        Sort Key: (count(participants.race_id))
        Sort Method: top-N heapsort  Memory: 25kB
        ->  HashAggregate  (cost=40241.04..40291.04 rows=5000 width=8) (actual time=708.870..709.046 rows=1000 loops=1)
              Group Key: races.id
              ->  Hash Join  (cost=149.50..35244.52 rows=999305 width=8) (actual time=4.004..476.482 rows=999199 loops=1)
                    Hash Cond: (participants.race_id = races.id)
                    ->  Seq Scan on participants  (cost=0.00..16358.05 rows=999305 width=4) (actual time=0.012..129.991 rows=999199 loops=1)
                    ->  Hash  (cost=87.00..87.00 rows=5000 width=4) (actual time=3.975..3.975 rows=5000 loops=1)
                          Buckets: 1024  Batches: 1  Memory Usage: 176kB
                          ->  Seq Scan on races  (cost=0.00..87.00 rows=5000 width=4) (actual time=0.009..1.659 rows=5000 loops=1)
Planning time: 0.447 ms
Execution time: 709.316 ms
(14 rows)

But thats not good enough. Lets just do what we have to get the expensive work done. That means we don't need the races table:

# explain analyze select participants.race_id, count(participants.race_id) as participant_count from participants group by race_id order by participant_count desc limit 5;
                                                              QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
Limit  (cost=21381.05..21381.06 rows=5 width=4) (actual time=380.555..380.556 rows=5 loops=1)
  ->  Sort  (cost=21381.05..21383.54 rows=995 width=4) (actual time=380.553..380.553 rows=5 loops=1)
        Sort Key: (count(race_id))
        Sort Method: top-N heapsort  Memory: 25kB
        ->  HashAggregate  (cost=21354.58..21364.53 rows=995 width=4) (actual time=380.223..380.404 rows=1000 loops=1)
              Group Key: race_id
              ->  Seq Scan on participants  (cost=0.00..16358.05 rows=999305 width=4) (actual time=0.015..119.637 rows=999199 loops=1)
Planning time: 0.139 ms
Execution time: 380.620 ms
(9 rows)

From here we can do a select race.title from races..., which will cost nothing. Or you could work the results from the heavy lifting into the select race.title query if you HAD to use just one query.

Conclusion: We went from 852ms to 380ms. This is with about 1mm rows on participants. I'll let the sim generator run overnight to get to 5mm and see what the delta is then. I bet it will be bigger.


PS: Here are the table details:

# \d races;
                                    Table "public.races"
  Column   |            Type             |                     Modifiers                      
------------+-----------------------------+----------------------------------------------------
id         | integer                     | not null default nextval('races_id_seq'::regclass)
title      | character varying           | 
created_at | timestamp without time zone | 
updated_at | timestamp without time zone | 
Indexes:
   "races_pkey" PRIMARY KEY, btree (id)
   "index_races_on_title" btree (title)

# \d participants;
                                    Table "public.participants"
  Column   |            Type             |                         Modifiers                         
------------+-----------------------------+-----------------------------------------------------------
id         | integer                     | not null default nextval('participants_id_seq'::regclass)
race_id    | integer                     | not null
created_at | timestamp without time zone | 
updated_at | timestamp without time zone | 
Indexes:
   "participants_pkey" PRIMARY KEY, btree (id)
   "index_participants_on_race_id" btree (race_id)

# select count(0) from races;
count 
-------
 5000
(1 row)

# select count(0) from participants;
count  
--------
999199
(1 row)
@TheDudeWithTheThing
Copy link

SELECT races.id, races.title, participant_counts.count as participant_count
FROM races, (  select race_id, count(*) as count from participants group by race_id) as participant_counts
where races.id = participant_counts.race_id
ORDER BY participant_count DESC
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=55745.21..55755.21 rows=3999 width=30) (actual time=620.160..620.375 rows=4000 loops=1)
   Sort Key: (count(*))
   Sort Method: quicksort  Memory: 409kB
   ->  Hash Join  (cost=55356.00..55505.96 rows=3999 width=30) (actual time=617.210..619.124 rows=4000 loops=1)
         Hash Cond: (participants.race_id = races.id)
         ->  HashAggregate  (cost=55221.00..55260.99 rows=3999 width=4) (actual time=615.026..615.489 rows=4000 loops=1)
               ->  Seq Scan on participants  (cost=0.00..45221.00 rows=2000000 width=4) (actual time=0.036..210.820 rows=2000000 loops=1)
         ->  Hash  (cost=85.00..85.00 rows=4000 width=22) (actual time=2.174..2.174 rows=4000 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 214kB
               ->  Seq Scan on races  (cost=0.00..85.00 rows=4000 width=22) (actual time=0.007..0.960 rows=4000 loops=1)
 Total runtime: 620.621 ms
(11 rows)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment