PostgreSQL 11 release contains features to improve the performance of DML operations on partitioned tables by enhancing the planner and the executor to use partition metadata more effectively. Those features include a new implementation of partition pruning, execution-time partition pruning (also known as dynamic pruning), partition-wise join and aggregation.
Partition pruning is the ability to skip scanning of partitions that would not contain data requested by queries. Such queries would contain conditions on a table's partition key columns. Server can filter partitions by comparing the values of the other parameter of such conditions to partition bounds.
Consider the following simplified example:
create table orders (id int, order_date date) partition by range (order_date);
create table orders_nov18 partition of orders for values from ('2018-11-01') to ('2018-12-01');
create table orders_dec18 partition of orders for values from ('2018-12-01') to ('2019-01-01');
create table orders_jan19 partition of orders for values from ('2019-01-01') to ('2019-02-01');
select count(*) from orders where order_date = '2018-11-05';
The query shown above would need to look at only the orders_nov18
partition,
because the rows of other partitions would not contain the value 2018-11-05
in
their order_date
column. Partition pruning makes sure that the other
partitions are not scanned.
Older PostgreSQL versions (including 10) don't compare the values directly
with the partition bounds. Instead, it relies on a related optimization feature
of PostgreSQL called constraint exclusion. Constraint exclusion compares
query's conditions with partition's constraint (shaped like CHECK
constraints)
to mark partitions whose constraint contradicts the query conditions as excluded.
For example, the partition constraint of orders_dec18
can be seen in psql
as
follows:
\d+ orders_dec18
Table "public.orders_dec18"
Column │ Type │ Collation │ Nullable │ Default │ Storage │ Stats target │ Description
────────────┼─────────┼───────────┼──────────┼─────────┼─────────┼──────────────┼─────────────
id │ integer │ │ │ │ plain │ │
order_date │ date │ │ │ │ plain │ │
Partition of: orders FOR VALUES FROM ('2018-12-01') TO ('2019-01-01')
Partition constraint: ((order_date IS NOT NULL) AND (order_date >= '2018-12-01'::date) AND (order_date < '2019-01-01'::date))
It's clear that the query's WHERE
condition order_date = 2018-11-05
contradicts partition constraints of orders_dec18
and orders_jan19
, so they're
both excluded from the query plan as shown below:
You can find more details about constraint exclusion and how it relates to partitioning in the documentation:
https://www.postgresql.org/docs/10/ddl-partitioning.html#DDL-PARTITIONING-CONSTRAINT-EXCLUSION
As mentioned in the documentation, constraint exclusion needs to consider each partition individually and is itself computationally expensive, so it does not scale with the number of partitions.
PostgreSQL 11 implements a new mechanism that analyzes query's conditions
(including table join conditions, about which more below), extracts parameters
that are compared with the partitioning column from such conditions, and
determines partition bounds that satisfy the query by comparing the parameter
values directly with them. For example, in this case, it will find the value
of the smallest range bound that is greater than the value being searched
(2018-11-05
) by looking it up in the sorted array of upper bounds of all
partitions using a binary search.
Since partition bounds are stored using an efficient internal representation (a sorted array of native values of bounds), the algorithm that searches those values is also quite efficient. In particular, the new algorithm is O(log N) because of using binary search, so it can scale better with the number of partitions. Overall, the new mechanism has less overhead compared to the old method, so it leads to significant improvement in the latency of this optimization.
Earlier versions of PostgreSQL can't perform pruning if the values of the pruning parameters are not specified directly in the query text or if the values are handed directly to the executor without replanning a query, that is, if the query is executed using a cached plan. For example, when using prepared statements, an application may not provide the actual values of parameters initially. In that case, the planner doesn't know the values of the parameters, so it cannot perform partition pruning. When values are eventually provided in the parameter binding phase, they will be directly handed to the executor, but the older PostgreSQL servers cannot perform partition pruning in the executor.
PostgreSQL 11 implements a new feature whereby the planner will recognize such unbound parameters by analyzing the query and create a plan containing the information using which executor can perform partition pruning, and the executor itself can now perform pruning.
There are two types of execution-time parameters, which are described below. A query may contain any combination of parameters, type 1 or 2 or both. Following examples will illustrate where execution-time pruning is applicable.
create table foo (a int);
insert into foo values (2), (3);
create table p (a int) partition by list (a);
create index on p (a);
create table p1 partition of p for values in (1);
create table p2 partition of p for values in (2);
create table p3 partition of p for values in (3);
- Parameters whose value is fixed for the whole execution
If query contains this type of parameter, then pruning can occur before
executing the plan (actually, when initializing the execution state of the plan)
and if pruning is successful, it is shown using Subplans Removed:
in the
EXPLAIN
output.
Example:
prepare example_query as select * from p where a = $1;
In the above query, $1
is this type of parameter. Note that no
values have been assigned to it yet. Values are assigned using
execute
operation on the above named prepared statement.
explain (costs off, timing off, analyze) execute example_query (1);
QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────
Append
Subplans Removed: 2
-> Bitmap Heap Scan on p1
Recheck Cond: (a = $1)
-> Bitmap Index Scan on p1_a_idx
Index Cond: (a = $1)
Planning Time: 0.019 ms
Execution Time: 0.163 ms
(8 rows)
Here, parameter $1
is assigned the value of 1
and it's fixed for the
entire execution of that query. So, pruning occurs using that value
when initializing the execution state of the Append
node. Since
the p2
and p3
won't contain the values 1
, they're pruned and hence
Subplan Removed: 2
in the above output.
(Note that dynamic pruning would only be required if planner created a
generic plan which ignores the values of parameters provided by execute
.
Currently, there is no way for a user to control whether the generic plan
will be produced, so you will need to wait until planner switches to a
generic plan based on its internal cost criteria to see dynamic pruning
in action.)
- Parameters whose value may change during execution
If a query contains this type of parameter, then pruning occurs every time
its value changes during execution of the query. If a partition is pruned
every time, then its subplan is showed as (never executed)
in the
EXPLAIN
output.
Example:
select * from foo f where exists (select * from p where a = f.a);
Here f.a
is the parameter for scanning p
. The above query is actually
implemented using a join between foo
and p
. If p
is the inner table, then
we can use dynamic pruning using a = f.a
condition where f.a
is the
parameter. Pruning is executed for every value of f.a
, that is, for all
rows of foo
.
explain (costs off, timing off, analyze) select * from foo f where exists
(select * from p where a = f.a);
QUERY PLAN
─────────────────────────────────────────────────────────────────────────
Nested Loop Semi Join (actual rows=0 loops=1)
-> Seq Scan on foo f (actual rows=2 loops=1)
-> Append (actual rows=0 loops=2)
-> Bitmap Heap Scan on p1 (never executed)
Recheck Cond: (a = f.a)
-> Bitmap Index Scan on p1_a_idx (never executed)
Index Cond: (a = f.a)
-> Bitmap Heap Scan on p2 (actual rows=0 loops=1)
Recheck Cond: (a = f.a)
-> Bitmap Index Scan on p2_a_idx (actual rows=0 loops=1)
Index Cond: (a = f.a)
-> Bitmap Heap Scan on p3 (actual rows=0 loops=1)
Recheck Cond: (a = f.a)
-> Bitmap Index Scan on p3_a_idx (actual rows=0 loops=1)
Index Cond: (a = f.a)
Planning Time: 3.293 ms
Execution Time: 0.397 ms
(17 rows)
Here p1
's subplan is never executed, because foo
contains only 2
and 3
.
- Example with both types of dynamic pruning in the same query
It's also possible for a query to contain both types of parameters, such as in the following example.
prepare example_query as
select * from foo f where exists (select * from p where a = f.a and a <> $1);
Above query contains both types of parameters due to conditions a = f.a
and
a <> $1
.
explain (costs off, timing off, analyze) execute example_query (2);
QUERY PLAN
─────────────────────────────────────────────────────────────────────────
Nested Loop Semi Join (actual rows=0 loops=1)
-> Seq Scan on foo f (actual rows=2 loops=1)
-> Append (actual rows=0 loops=2)
Subplans Removed: 1
-> Bitmap Heap Scan on p1 (never executed)
Recheck Cond: (a = f.a)
Filter: (a <> $1)
-> Bitmap Index Scan on p1_a_idx (never executed)
Index Cond: (a = f.a)
-> Bitmap Heap Scan on p3 (actual rows=0 loops=1)
Recheck Cond: (a = f.a)
Filter: (a <> $1)
-> Bitmap Index Scan on p3_a_idx (actual rows=0 loops=1)
Index Cond: (a = f.a)
Planning Time: 4.189 ms
Execution Time: 0.331 ms
(16 rows)
Note that there are both Sublans Removed:
and (never executed)
in this
case.:
Sublans Removed:
exists because of the parameter $1
. When it is
executed with value 2
, partition p2
's subplan is removed, because the
condition becomes a <> 2
.
(never executed)
is shown for p1
, because the parameter f.a
never becomes
1
, because foo
only contain 2
and 3
, so p1
's subplan is never executed.
So, the only partition that's ends up getting scanned is p3
.
Join and aggregation are widely used operations in complex query workloads and their efficiency can severely affect the performance when the underlying datasets are large. PostgreSQL 11 implements two techniques to create efficient plans when the tables being joined and aggregated are partitioned tables, which should be a common scaling technique for large datasets anyway.
- Partitionwise join
If two partitioned tables are joined on their common partition key using an
equality join condition, it makes sense to join their partitions which respectively
contain same set of data, instead of joining them in whole. Consider the following
example, where the orders
table used earlier in this article is now accompanied by
an order_items
table and both are HASH
-partitioned on their id
column.
create table orders (id int, order_date date) partition by hash (id);
create table orders_0 partition of orders for values with (modulus 3, remainder 0);
create table orders_1 partition of orders for values with (modulus 3, remainder 1);
create table orders_2 partition of orders for values with (modulus 3, remainder 2);
create table order_items (id int, content text) partition by hash (id);
create table order_items_0 partition of order_items for values with (modulus 3, remainder 0);
create table order_items_1 partition of order_items for values with (modulus 3, remainder 1);
create table order_items_2 partition of order_items for values with (modulus 3, remainder 2);
select count(*) from orders inner join order_items on orders.id = order_items.id
For the above query, older PostgreSQL versions would first combine the outputs of
the partitions of orders
and order_items
, respectively, followed by joining
them using condition orders.id = order_items.id
, as shown below:
However, given how partitioning works, it should be clear in this case that
any rows of the partition orders_0
won't match the rows of partitions on
the other side order_items_1
and order_items_2
. So, it makes sense to
join orders_0
to only order_items_0
, and similarly orders_1
to only
order_items_1
and orders_2
to only order_items_2
, as shown below:
PostgreSQL 11 can create a plan of this shape using newly implemented
mechanism in the planner called partitionwise join. The main advantage
of being able to create a plan like this is that the planner can select
a better join algorithm for individual pairs of partitions than it can for
the whole table join (due to better accuracy of statistics on individual
smaller partitions than on the whole table), resulting in an overall
better plan. A major limitation currently is that both partitioned tables
must have exactly matching partition bounds. Also, planning takes
O(N) time in the number of partitions and the current implementation
consumes severe amounts of CPU and memory for each partition pair.
Due to these inefficiencies, partitionwise join mechanism is disabled
by default. Users who're willing to tolerate the long planning time and
pay the memory cost for a better plan in return can enable it using
the enable_partitionwise_join
GUC parameter:
https://www.postgresql.org/docs/devel/runtime-config-query.html#GUC-ENABLE-PARTITIONWISE-JOIN
- Partitionwise aggregation
Aggregation is another operation for which it helps to divide it into smaller independent units. For example, an aggregation step applied to a partitioned table can be divided into multiple aggregation steps, one for each of its partitions and their outputs combined if necessary. Consider the following query:
select count(*) from orders;
Older PostgreSQL versions would first scan all partitions of orders
and
only then it would start counting, so the plan would look like:
explain (costs off, verbose) select count(*) from orders;
QUERY PLAN
─────────────────────────────────────────────
Aggregate
Output: count(*)
-> Append
-> Seq Scan on public.orders_0
-> Seq Scan on public.orders_dec18
-> Seq Scan on public.orders_jan19
(6 rows)
With PostgreSQL 11, it looks like this:
explain (costs off, verbose) select count(*) from orders;
QUERY PLAN
───────────────────────────────────────────────────
Finalize Aggregate
Output: count(*)
-> Append
-> Partial Aggregate
Output: PARTIAL count(*)
-> Seq Scan on public.orders_0
-> Partial Aggregate
Output: PARTIAL count(*)
-> Seq Scan on public.orders_dec18
-> Partial Aggregate
Output: PARTIAL count(*)
-> Seq Scan on public.orders_jan19
(12 rows)
So, it would first calculate the partial counts for individual partitions and finalize the count for the whole query once all partitions have been processed.
If the aggregation is grouped and the grouping key matches the partition key, then each partition's aggregation output is the final output for a given grouping key, because the same key cannot be present in more than one partition. So, PostgreSQL 11 can generate a plan like this:
explain (costs off, verbose) select id, count(*) from orders group by id;
QUERY PLAN
─────────────────────────────────────────────
Append
-> HashAggregate
Output: orders_0.id, count(*)
Group Key: orders_0.id
-> Seq Scan on public.orders_0
Output: orders_0.id
-> HashAggregate
Output: orders_dec18.id, count(*)
Group Key: orders_dec18.id
-> Seq Scan on public.orders_dec18
Output: orders_dec18.id
-> HashAggregate
Output: orders_jan19.id, count(*)
Group Key: orders_jan19.id
-> Seq Scan on public.orders_jan19
Output: orders_jan19.id
(16 rows)
Similar to joins, the main advantage of planner's ability to apply a given aggregation operation to individual partitions is that it can make better decisions regarding the algorithm to use for aggregation. For example, it it can prefer hash-based aggregation over sort-based aggregation, because the former is cheaper but is only considered for smaller input sizes.
The disadvantage is same as the partitionwise join technique, that is,
planning takes O(N) time in the number of partitions and can consume excess
amounts of CPU and memory. So, it is disabled by default. Users who're
willing to tolerate that in return for a better plan can enable it using
the enable_partitionwise_aggregate
GUC parameter.
https://www.postgresql.org/docs/devel/runtime-config-query.html#GUC-ENABLE-PARTITIONWISE-AGGREGATE
To summarize, PostgreSQL 11 takes an important step in the direction of improving the efficiency of query execution on partitioned data by implementing optimization techniques based on partitioning metadata. More work will need to be done to ensure that these techniques are scalable in terms of the number of partitions, but this seems like a a good start.