Skip to content

Instantly share code, notes, and snippets.

@gingerwizard
Last active June 19, 2024 22:44
Show Gist options
  • Save gingerwizard/429dc2faed3468c2016e5a069939b90a to your computer and use it in GitHub Desktop.
Save gingerwizard/429dc2faed3468c2016e5a069939b90a to your computer and use it in GitHub Desktop.
Approach to speed up GROUP BY on house prices

How to Speed up UK Prices GROUP BY

Credit to Vadim Punski for this approach.

Note: timing here is on a Postgres instance hosted on a MacBook Pro (16-inch, 2021). Not Supabase free tier.

Original query from blog:

psql -c "\timing" -c "SELECT
	extract(year from date) as year,
	round(avg(price)) AS price
FROM uk_price_paid
WHERE type = 'flat'
GROUP BY year
ORDER BY year"
Timing is on.
 year | price
------+--------
 1995 |  59004
 1996 |  63913
…
 2021 | 310626
 2022 | 298977
(28 rows)

Time: 2368.416 ms (00:02.368)


 Finalize GroupAggregate  (cost=658169.56..659903.43 rows=6647 width=64) (actual time=4228.708..4285.444 rows=28 loops=1)
   Group Key: (EXTRACT(year FROM date))
   ->  Gather Merge  (cost=658169.56..659720.63 rows=13294 width=64) (actual time=4228.626..4285.316 rows=84 loops=1)
     	Workers Planned: 2
     	Workers Launched: 2
     	->  Sort  (cost=657169.53..657186.15 rows=6647 width=64) (actual time=4117.437..4117.445 rows=28 loops=3)
           	Sort Key: (EXTRACT(year FROM date))
           	Sort Method: quicksort  Memory: 27kB
           	Worker 0:  Sort Method: quicksort  Memory: 27kB
           	Worker 1:  Sort Method: quicksort  Memory: 27kB
           	->  Partial HashAggregate  (cost=656664.41..656747.50 rows=6647 width=64) (actual time=4117.150..4117.259 rows=28 loops=3)
                 	Group Key: EXTRACT(year FROM date)
                 	Batches: 1  Memory Usage: 217kB
                 	Worker 0:  Batches: 1  Memory Usage: 217kB
                 	Worker 1:  Batches: 1  Memory Usage: 217kB
                 	->  Parallel Seq Scan on uk_price_paid  (cost=0.00..646308.10 rows=2071262 width=36) (actual time=221.438..3860.130 rows=1660388 loops=3)
                       	Filter: (type = 'flat'::bpchar)
                       	Rows Removed by Filter: 7584601
 Planning Time: 3.900 ms
 JIT:
   Functions: 30
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 17.644 ms, Inlining 378.192 ms, Optimization 156.844 ms, Emission 125.381 ms, Total 678.062 ms

So it looks like the index is used. The query planner relies on tables statistics that may be shown using the query below:

SELECT relname,relpages,reltuples FROM pg_class WHERE relname='uk_price_paid';
    relname    | relpages |   reltuples
---------------+----------+---------------
 uk_price_paid |  1414908 | 3.4859372e+07
(1 row)

So for 34M rows we have 1.4M data blocks 8K each.

The cardinality of the type column is very low - 5 values. With 6.3M flats out of 34M rows ~1/6th of the total dataset

select type, count(1) from uk_price_paid group by type;
     type      |  count
---------------+----------
 detached      |  7991206
 flat          |  6389688
 other         |   561737
 semi-detached |  9557371
 terraced      | 10544964
(5 rows)

Because type='flat' rows are distributed in all data blocks, and we have a total of 1414908 blocks, we have 34M/1.4M=24 rows per block. That means the probability of having flat value in a single blocks is very high (it’s 1/6th with 24 rows per block).

We could try to force index usage with the enable_seqscan=false setting but this delivers no real improvement. Free free to test!

Note there is a trick we can do here to speed this query up.

We can build a compound index including an extracted year column (as the query planner doesn’t relate extract(year of date) with the actual year value we use in grouping) i.e.

alter table uk_price_paid add column price_year int;
update uk_price_paid  set price_year = extract(year from date);
CREATE INDEX ON uk_price_paid (type, price_year, price); 

Unfortunately the update command won’t complete on the Supabase free tier as it takes > 2mins to execute - free free to try on a local instance or re-insert the data with the column price_year added.

The resulting query is considerably faster as it ensures all of the required data is in the index.

SELECT
 price_year,
 round(avg(price)) AS price
FROM uk_price_paid
WHERE type = 'flat'
GROUP BY price_year
ORDER BY price_year;
 price_year | price
------------+--------
       1995 |  59004
       1996 |  63913
       ...
       2021 | 310626
       2022 | 298977
(28 rows)

Time: 226.319 ms

The data is now all available from the index, meaning the rows won't need to be fetched i.e.

-------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=1000.59..92921.62 rows=28 width=36)
   Group Key: price_year
   ->  Gather Merge  (cost=1000.59..92920.92 rows=56 width=36)
         Workers Planned: 2
         ->  Partial GroupAggregate  (cost=0.56..91914.43 rows=28 width=36)
               Group Key: price_year
               ->  Parallel Index Only Scan using uk_price_paid_type_price_year_price_idx on uk_price_paid  (cost=0.56..81203.61 rows=2142109 width=8)
                     Index Cond: (type = 'flat'::bpchar)
(8 rows)

Of course this methodology is not feasible as creating indexes for every permutation of fields for various queries will result in a huge index storage footprint as well as impact insert performance. We are effectively simulating a column orientated access pattern - not something we recommend you do on your production Postgres instances!

Further details here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment