Credit to Vadim Punski for this approach.
Note: timing here is on a Postgres instance hosted on a MacBook Pro (16-inch, 2021). Not Supabase free tier.
Original query from blog:
psql -c "\timing" -c "SELECT
extract(year from date) as year,
round(avg(price)) AS price
FROM uk_price_paid
WHERE type = 'flat'
GROUP BY year
ORDER BY year"
Timing is on.
year | price
------+--------
1995 | 59004
1996 | 63913
…
2021 | 310626
2022 | 298977
(28 rows)
Time: 2368.416 ms (00:02.368)
Finalize GroupAggregate (cost=658169.56..659903.43 rows=6647 width=64) (actual time=4228.708..4285.444 rows=28 loops=1)
Group Key: (EXTRACT(year FROM date))
-> Gather Merge (cost=658169.56..659720.63 rows=13294 width=64) (actual time=4228.626..4285.316 rows=84 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=657169.53..657186.15 rows=6647 width=64) (actual time=4117.437..4117.445 rows=28 loops=3)
Sort Key: (EXTRACT(year FROM date))
Sort Method: quicksort Memory: 27kB
Worker 0: Sort Method: quicksort Memory: 27kB
Worker 1: Sort Method: quicksort Memory: 27kB
-> Partial HashAggregate (cost=656664.41..656747.50 rows=6647 width=64) (actual time=4117.150..4117.259 rows=28 loops=3)
Group Key: EXTRACT(year FROM date)
Batches: 1 Memory Usage: 217kB
Worker 0: Batches: 1 Memory Usage: 217kB
Worker 1: Batches: 1 Memory Usage: 217kB
-> Parallel Seq Scan on uk_price_paid (cost=0.00..646308.10 rows=2071262 width=36) (actual time=221.438..3860.130 rows=1660388 loops=3)
Filter: (type = 'flat'::bpchar)
Rows Removed by Filter: 7584601
Planning Time: 3.900 ms
JIT:
Functions: 30
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 17.644 ms, Inlining 378.192 ms, Optimization 156.844 ms, Emission 125.381 ms, Total 678.062 ms
So it looks like the index is used. The query planner relies on tables statistics that may be shown using the query below:
SELECT relname,relpages,reltuples FROM pg_class WHERE relname='uk_price_paid';
relname | relpages | reltuples
---------------+----------+---------------
uk_price_paid | 1414908 | 3.4859372e+07
(1 row)
So for 34M rows we have 1.4M data blocks 8K each.
The cardinality of the type
column is very low - 5 values. With 6.3M flats out of 34M rows ~1/6th of the total dataset
select type, count(1) from uk_price_paid group by type;
type | count
---------------+----------
detached | 7991206
flat | 6389688
other | 561737
semi-detached | 9557371
terraced | 10544964
(5 rows)
Because type='flat'
rows are distributed in all data blocks, and we have a total of 1414908 blocks, we have 34M/1.4M=24 rows per block. That means the probability of having flat
value in a single blocks is very high (it’s 1/6th with 24 rows per block).
We could try to force index usage with the enable_seqscan=false
setting but this delivers no real improvement. Free free to test!
Note there is a trick we can do here to speed this query up.
We can build a compound index including an extracted year column (as the query planner doesn’t relate extract(year of date)
with the actual year value we use in grouping) i.e.
alter table uk_price_paid add column price_year int;
update uk_price_paid set price_year = extract(year from date);
CREATE INDEX ON uk_price_paid (type, price_year, price);
Unfortunately the update command won’t complete on the Supabase free tier as it takes > 2mins to execute -
free free to try on a local instance or re-insert the data with the column price_year
added.
The resulting query is considerably faster as it ensures all of the required data is in the index.
SELECT
price_year,
round(avg(price)) AS price
FROM uk_price_paid
WHERE type = 'flat'
GROUP BY price_year
ORDER BY price_year;
price_year | price
------------+--------
1995 | 59004
1996 | 63913
...
2021 | 310626
2022 | 298977
(28 rows)
Time: 226.319 ms
The data is now all available from the index, meaning the rows won't need to be fetched i.e.
-------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize GroupAggregate (cost=1000.59..92921.62 rows=28 width=36)
Group Key: price_year
-> Gather Merge (cost=1000.59..92920.92 rows=56 width=36)
Workers Planned: 2
-> Partial GroupAggregate (cost=0.56..91914.43 rows=28 width=36)
Group Key: price_year
-> Parallel Index Only Scan using uk_price_paid_type_price_year_price_idx on uk_price_paid (cost=0.56..81203.61 rows=2142109 width=8)
Index Cond: (type = 'flat'::bpchar)
(8 rows)
Of course this methodology is not feasible as creating indexes for every permutation of fields for various queries will result in a huge index storage footprint as well as impact insert performance. We are effectively simulating a column orientated access pattern - not something we recommend you do on your production Postgres instances!
Further details here