Skip to content

Instantly share code, notes, and snippets.

@amitlan
Last active April 17, 2024 10:28
Show Gist options
  • Save amitlan/cd13271142bb2d26ae46b69afb675a31 to your computer and use it in GitHub Desktop.
Save amitlan/cd13271142bb2d26ae46b69afb675a31 to your computer and use it in GitHub Desktop.
PG 11 Partitioning Optimization Features

PostgreSQL 11 release contains features to improve the performance of DML operations on partitioned tables by enhancing the planner and the executor to use partition metadata more effectively. Those features include a new implementation of partition pruning, execution-time partition pruning (also known as dynamic pruning), partition-wise join and aggregation.

Improved partition pruning

Partition pruning is the ability to skip scanning of partitions that would not contain data requested by queries. Such queries would contain conditions on a table's partition key columns. Server can filter partitions by comparing the values of the other parameter of such conditions to partition bounds.

Consider the following simplified example:

create table orders (id int, order_date date) partition by range (order_date);
create table orders_nov18 partition of orders for values from ('2018-11-01') to ('2018-12-01');
create table orders_dec18 partition of orders for values from ('2018-12-01') to ('2019-01-01');
create table orders_jan19 partition of orders for values from ('2019-01-01') to ('2019-02-01');

select count(*) from orders where order_date = '2018-11-05';

The query shown above would need to look at only the orders_nov18 partition, because the rows of other partitions would not contain the value 2018-11-05 in their order_date column. Partition pruning makes sure that the other partitions are not scanned.

Older PostgreSQL versions (including 10) don't compare the values directly with the partition bounds. Instead, it relies on a related optimization feature of PostgreSQL called constraint exclusion. Constraint exclusion compares query's conditions with partition's constraint (shaped like CHECK constraints) to mark partitions whose constraint contradicts the query conditions as excluded. For example, the partition constraint of orders_dec18 can be seen in psql as follows:

\d+ orders_dec18
                                 Table "public.orders_dec18"
   Column   │  Type   │ Collation │ Nullable │ Default │ Storage │ Stats target │ Description 
────────────┼─────────┼───────────┼──────────┼─────────┼─────────┼──────────────┼─────────────
 id         │ integer │           │          │         │ plain   │              │ 
 order_date │ date    │           │          │         │ plain   │              │ 
Partition of: orders FOR VALUES FROM ('2018-12-01') TO ('2019-01-01')
Partition constraint: ((order_date IS NOT NULL) AND (order_date >= '2018-12-01'::date) AND (order_date < '2019-01-01'::date))

It's clear that the query's WHERE condition order_date = 2018-11-05 contradicts partition constraints of orders_dec18 and orders_jan19, so they're both excluded from the query plan as shown below:

constraint-exclusion

You can find more details about constraint exclusion and how it relates to partitioning in the documentation:

https://www.postgresql.org/docs/10/ddl-partitioning.html#DDL-PARTITIONING-CONSTRAINT-EXCLUSION

As mentioned in the documentation, constraint exclusion needs to consider each partition individually and is itself computationally expensive, so it does not scale with the number of partitions.

PostgreSQL 11 implements a new mechanism that analyzes query's conditions (including table join conditions, about which more below), extracts parameters that are compared with the partitioning column from such conditions, and determines partition bounds that satisfy the query by comparing the parameter values directly with them. For example, in this case, it will find the value of the smallest range bound that is greater than the value being searched (2018-11-05) by looking it up in the sorted array of upper bounds of all partitions using a binary search.

partition-pruning

Since partition bounds are stored using an efficient internal representation (a sorted array of native values of bounds), the algorithm that searches those values is also quite efficient. In particular, the new algorithm is O(log N) because of using binary search, so it can scale better with the number of partitions. Overall, the new mechanism has less overhead compared to the old method, so it leads to significant improvement in the latency of this optimization.

Execution-time partition pruning

Earlier versions of PostgreSQL can't perform pruning if the values of the pruning parameters are not specified directly in the query text or if the values are handed directly to the executor without replanning a query, that is, if the query is executed using a cached plan. For example, when using prepared statements, an application may not provide the actual values of parameters initially. In that case, the planner doesn't know the values of the parameters, so it cannot perform partition pruning. When values are eventually provided in the parameter binding phase, they will be directly handed to the executor, but the older PostgreSQL servers cannot perform partition pruning in the executor.

PostgreSQL 11 implements a new feature whereby the planner will recognize such unbound parameters by analyzing the query and create a plan containing the information using which executor can perform partition pruning, and the executor itself can now perform pruning.

There are two types of execution-time parameters, which are described below. A query may contain any combination of parameters, type 1 or 2 or both. Following examples will illustrate where execution-time pruning is applicable.

create table foo (a int);
insert into foo values (2), (3);

create table p (a int) partition by list (a);
create index on p (a);
create table p1 partition of p for values in (1);
create table p2 partition of p for values in (2);
create table p3 partition of p for values in (3);
  • Parameters whose value is fixed for the whole execution

If query contains this type of parameter, then pruning can occur before executing the plan (actually, when initializing the execution state of the plan) and if pruning is successful, it is shown using Subplans Removed: in the EXPLAIN output.

Example:

prepare example_query as select * from p where a = $1;

In the above query, $1 is this type of parameter. Note that no values have been assigned to it yet. Values are assigned using execute operation on the above named prepared statement.

explain (costs off, timing off, analyze) execute example_query (1);
                                     QUERY PLAN

─────────────────────────────────────────────────────────────────────────────────────
 Append
   Subplans Removed: 2
   ->  Bitmap Heap Scan on p1
         Recheck Cond: (a = $1)
         ->  Bitmap Index Scan on p1_a_idx
               Index Cond: (a = $1)
 Planning Time: 0.019 ms
 Execution Time: 0.163 ms
(8 rows)

Here, parameter $1 is assigned the value of 1 and it's fixed for the entire execution of that query. So, pruning occurs using that value when initializing the execution state of the Append node. Since the p2 and p3 won't contain the values 1, they're pruned and hence Subplan Removed: 2 in the above output.

(Note that dynamic pruning would only be required if planner created a generic plan which ignores the values of parameters provided by execute. Currently, there is no way for a user to control whether the generic plan will be produced, so you will need to wait until planner switches to a generic plan based on its internal cost criteria to see dynamic pruning in action.)

  • Parameters whose value may change during execution

If a query contains this type of parameter, then pruning occurs every time its value changes during execution of the query. If a partition is pruned every time, then its subplan is showed as (never executed) in the EXPLAIN output.

Example:

select * from foo f where exists (select * from p where a = f.a);

Here f.a is the parameter for scanning p. The above query is actually implemented using a join between foo and p. If p is the inner table, then we can use dynamic pruning using a = f.a condition where f.a is the parameter. Pruning is executed for every value of f.a, that is, for all rows of foo.

explain (costs off, timing off, analyze) select * from foo f where exists
(select * from p where a = f.a);
                               QUERY PLAN
─────────────────────────────────────────────────────────────────────────
 Nested Loop Semi Join (actual rows=0 loops=1)
   ->  Seq Scan on foo f (actual rows=2 loops=1)
   ->  Append (actual rows=0 loops=2)
         ->  Bitmap Heap Scan on p1 (never executed)
               Recheck Cond: (a = f.a)
               ->  Bitmap Index Scan on p1_a_idx (never executed)
                     Index Cond: (a = f.a)
         ->  Bitmap Heap Scan on p2 (actual rows=0 loops=1)
               Recheck Cond: (a = f.a)
               ->  Bitmap Index Scan on p2_a_idx (actual rows=0 loops=1)
                     Index Cond: (a = f.a)
         ->  Bitmap Heap Scan on p3 (actual rows=0 loops=1)
               Recheck Cond: (a = f.a)
               ->  Bitmap Index Scan on p3_a_idx (actual rows=0 loops=1)
                     Index Cond: (a = f.a)
 Planning Time: 3.293 ms
 Execution Time: 0.397 ms
(17 rows)

Here p1's subplan is never executed, because foo contains only 2 and 3.

  • Example with both types of dynamic pruning in the same query

It's also possible for a query to contain both types of parameters, such as in the following example.

prepare example_query as
select * from foo f where exists (select * from p where a = f.a and a <> $1);

Above query contains both types of parameters due to conditions a = f.a and a <> $1.

explain (costs off, timing off, analyze) execute example_query (2);
                               QUERY PLAN
─────────────────────────────────────────────────────────────────────────
 Nested Loop Semi Join (actual rows=0 loops=1)
   ->  Seq Scan on foo f (actual rows=2 loops=1)
   ->  Append (actual rows=0 loops=2)
         Subplans Removed: 1
         ->  Bitmap Heap Scan on p1 (never executed)
               Recheck Cond: (a = f.a)
               Filter: (a <> $1)
               ->  Bitmap Index Scan on p1_a_idx (never executed)
                     Index Cond: (a = f.a)
         ->  Bitmap Heap Scan on p3 (actual rows=0 loops=1)
               Recheck Cond: (a = f.a)
               Filter: (a <> $1)
               ->  Bitmap Index Scan on p3_a_idx (actual rows=0 loops=1)
                     Index Cond: (a = f.a)
 Planning Time: 4.189 ms
 Execution Time: 0.331 ms
(16 rows)

Note that there are both Sublans Removed: and (never executed) in this case.:

Sublans Removed: exists because of the parameter $1. When it is executed with value 2, partition p2's subplan is removed, because the condition becomes a <> 2.

(never executed) is shown for p1, because the parameter f.a never becomes 1, because foo only contain 2 and 3, so p1's subplan is never executed. So, the only partition that's ends up getting scanned is p3.

Partition-wise join and aggregation

Join and aggregation are widely used operations in complex query workloads and their efficiency can severely affect the performance when the underlying datasets are large. PostgreSQL 11 implements two techniques to create efficient plans when the tables being joined and aggregated are partitioned tables, which should be a common scaling technique for large datasets anyway.

  • Partitionwise join

If two partitioned tables are joined on their common partition key using an equality join condition, it makes sense to join their partitions which respectively contain same set of data, instead of joining them in whole. Consider the following example, where the orders table used earlier in this article is now accompanied by an order_items table and both are HASH-partitioned on their id column.

create table orders (id int, order_date date) partition by hash (id);
create table orders_0 partition of orders for values with (modulus 3, remainder 0);
create table orders_1 partition of orders for values with (modulus 3, remainder 1);
create table orders_2 partition of orders for values with (modulus 3, remainder 2);

create table order_items (id int, content text) partition by hash (id);
create table order_items_0 partition of order_items for values with (modulus 3, remainder 0);
create table order_items_1 partition of order_items for values with (modulus 3, remainder 1);
create table order_items_2 partition of order_items for values with (modulus 3, remainder 2);

select count(*) from orders inner join order_items on orders.id = order_items.id

For the above query, older PostgreSQL versions would first combine the outputs of the partitions of orders and order_items, respectively, followed by joining them using condition orders.id = order_items.id, as shown below:

pwjoin-before

However, given how partitioning works, it should be clear in this case that any rows of the partition orders_0 won't match the rows of partitions on the other side order_items_1 and order_items_2. So, it makes sense to join orders_0 to only order_items_0, and similarly orders_1 to only order_items_1 and orders_2 to only order_items_2, as shown below:

pwjoin-after

PostgreSQL 11 can create a plan of this shape using newly implemented mechanism in the planner called partitionwise join. The main advantage of being able to create a plan like this is that the planner can select a better join algorithm for individual pairs of partitions than it can for the whole table join (due to better accuracy of statistics on individual smaller partitions than on the whole table), resulting in an overall better plan. A major limitation currently is that both partitioned tables must have exactly matching partition bounds. Also, planning takes O(N) time in the number of partitions and the current implementation consumes severe amounts of CPU and memory for each partition pair. Due to these inefficiencies, partitionwise join mechanism is disabled by default. Users who're willing to tolerate the long planning time and pay the memory cost for a better plan in return can enable it using the enable_partitionwise_join GUC parameter:

https://www.postgresql.org/docs/devel/runtime-config-query.html#GUC-ENABLE-PARTITIONWISE-JOIN

  • Partitionwise aggregation

Aggregation is another operation for which it helps to divide it into smaller independent units. For example, an aggregation step applied to a partitioned table can be divided into multiple aggregation steps, one for each of its partitions and their outputs combined if necessary. Consider the following query:

select count(*) from orders;

Older PostgreSQL versions would first scan all partitions of orders and only then it would start counting, so the plan would look like:

explain (costs off, verbose) select count(*) from orders;
                 QUERY PLAN                  
─────────────────────────────────────────────
 Aggregate
   Output: count(*)
   ->  Append
         ->  Seq Scan on public.orders_0
         ->  Seq Scan on public.orders_dec18
         ->  Seq Scan on public.orders_jan19
(6 rows)

With PostgreSQL 11, it looks like this:

explain (costs off, verbose) select count(*) from orders;
                    QUERY PLAN                     
───────────────────────────────────────────────────
 Finalize Aggregate
   Output: count(*)
   ->  Append
         ->  Partial Aggregate
               Output: PARTIAL count(*)
               ->  Seq Scan on public.orders_0
         ->  Partial Aggregate
               Output: PARTIAL count(*)
               ->  Seq Scan on public.orders_dec18
         ->  Partial Aggregate
               Output: PARTIAL count(*)
               ->  Seq Scan on public.orders_jan19
(12 rows)

So, it would first calculate the partial counts for individual partitions and finalize the count for the whole query once all partitions have been processed.

If the aggregation is grouped and the grouping key matches the partition key, then each partition's aggregation output is the final output for a given grouping key, because the same key cannot be present in more than one partition. So, PostgreSQL 11 can generate a plan like this:

explain (costs off, verbose) select id, count(*) from orders group by id;
                 QUERY PLAN                  
─────────────────────────────────────────────
 Append
   ->  HashAggregate
         Output: orders_0.id, count(*)
         Group Key: orders_0.id
         ->  Seq Scan on public.orders_0
               Output: orders_0.id
   ->  HashAggregate
         Output: orders_dec18.id, count(*)
         Group Key: orders_dec18.id
         ->  Seq Scan on public.orders_dec18
               Output: orders_dec18.id
   ->  HashAggregate
         Output: orders_jan19.id, count(*)
         Group Key: orders_jan19.id
         ->  Seq Scan on public.orders_jan19
               Output: orders_jan19.id
(16 rows)

Similar to joins, the main advantage of planner's ability to apply a given aggregation operation to individual partitions is that it can make better decisions regarding the algorithm to use for aggregation. For example, it it can prefer hash-based aggregation over sort-based aggregation, because the former is cheaper but is only considered for smaller input sizes.

The disadvantage is same as the partitionwise join technique, that is, planning takes O(N) time in the number of partitions and can consume excess amounts of CPU and memory. So, it is disabled by default. Users who're willing to tolerate that in return for a better plan can enable it using the enable_partitionwise_aggregate GUC parameter.

https://www.postgresql.org/docs/devel/runtime-config-query.html#GUC-ENABLE-PARTITIONWISE-AGGREGATE

Summary

To summarize, PostgreSQL 11 takes an important step in the direction of improving the efficiency of query execution on partitioned data by implementing optimization techniques based on partitioning metadata. More work will need to be done to ensure that these techniques are scalable in terms of the number of partitions, but this seems like a a good start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment