amitlan/pg11-partition-opt.md

## pg11-partition-opt.md

      
    Raw
  

              pg11-partition-opt.md
            
          
    PostgreSQL 11 release contains features to improve the performance of DML
operations on partitioned tables by enhancing the planner and the executor
to use partition metadata more effectively.  Those features include a new
implementation of partition pruning, execution-time partition pruning (also
known as dynamic pruning), partition-wise join and aggregation.
Improved partition pruning

Partition pruning is the ability to skip scanning of partitions that would
not contain data requested by queries. Such queries would contain conditions
on a table's partition key columns.  Server can filter partitions by comparing
the values of the other parameter of such conditions to partition bounds.
Consider the following simplified example:
create table orders (id int, order_date date) partition by range (order_date);
create table orders_nov18 partition of orders for values from ('2018-11-01') to ('2018-12-01');
create table orders_dec18 partition of orders for values from ('2018-12-01') to ('2019-01-01');
create table orders_jan19 partition of orders for values from ('2019-01-01') to ('2019-02-01');

select count(*) from orders where order_date = '2018-11-05';

The query shown above would need to look at only the orders_nov18 partition,
because the rows of other partitions would not contain the value 2018-11-05 in
their order_date column.  Partition pruning makes sure that the other
partitions are not scanned.
Older PostgreSQL versions (including 10) don't compare the values directly
with the partition bounds. Instead, it relies on a related optimization feature
of PostgreSQL called constraint exclusion.  Constraint exclusion compares
query's conditions with partition's constraint (shaped like CHECK constraints)
to mark partitions whose constraint contradicts the query conditions as excluded.
For example, the partition constraint of orders_dec18 can be seen in psql as
follows:
\d+ orders_dec18
                                 Table "public.orders_dec18"
   Column   │  Type   │ Collation │ Nullable │ Default │ Storage │ Stats target │ Description 
────────────┼─────────┼───────────┼──────────┼─────────┼─────────┼──────────────┼─────────────
 id         │ integer │           │          │         │ plain   │              │ 
 order_date │ date    │           │          │         │ plain   │              │ 
Partition of: orders FOR VALUES FROM ('2018-12-01') TO ('2019-01-01')
Partition constraint: ((order_date IS NOT NULL) AND (order_date >= '2018-12-01'::date) AND (order_date < '2019-01-01'::date))

It's clear that the query's WHERE condition order_date = 2018-11-05
contradicts partition constraints of orders_dec18 and orders_jan19, so they're
both excluded from the query plan as shown below:

You can find more details about constraint exclusion and how it relates to
partitioning in the documentation:
https://www.postgresql.org/docs/10/ddl-partitioning.html#DDL-PARTITIONING-CONSTRAINT-EXCLUSION
As mentioned in the documentation, constraint exclusion needs to consider each
partition individually and is itself computationally expensive, so it does not
scale with the number of partitions.
PostgreSQL 11 implements a new mechanism that analyzes query's conditions
(including table join conditions, about which more below), extracts parameters
that are compared with the partitioning column from such conditions, and
determines partition bounds that satisfy the query by comparing the parameter
values directly with them.  For example, in this case, it will find the value
of the smallest range bound that is greater than the value being searched
(2018-11-05) by looking it up in the sorted array of upper bounds of all
partitions using a binary search.

Since partition bounds are stored using an efficient internal
representation (a sorted array of native values of bounds), the algorithm
that searches those values is also quite efficient.  In particular, the
new algorithm is O(log N) because of using binary search, so it can scale
better with the number of partitions. Overall, the new mechanism has less
overhead compared to the old method, so it leads to significant improvement
in the latency of this optimization.
Execution-time partition pruning

Earlier versions of PostgreSQL can't perform pruning if the values of the
pruning parameters are not specified directly in the query text or if the
values are handed directly to the executor without replanning a query, that
is, if the query is executed using a cached plan.  For example, when using
prepared statements, an application may not provide the actual values of
parameters initially.  In that case, the planner doesn't know the values
of the parameters, so it cannot perform partition pruning.  When values are
eventually provided in the parameter binding phase, they will be directly
handed to the executor, but the older PostgreSQL servers cannot perform partition
pruning in the executor.
PostgreSQL 11 implements a new feature whereby the planner will recognize such
unbound parameters by analyzing the query and create a plan containing the
information using which executor can perform partition pruning, and the
executor itself can now perform pruning.
There are two types of execution-time parameters, which are described below.
A query may contain any combination of parameters, type 1 or 2 or both.
Following examples will illustrate where execution-time pruning is applicable.
create table foo (a int);
insert into foo values (2), (3);

create table p (a int) partition by list (a);
create index on p (a);
create table p1 partition of p for values in (1);
create table p2 partition of p for values in (2);
create table p3 partition of p for values in (3);


Parameters whose value is fixed for the whole execution

If query contains this type of parameter, then pruning can occur before
executing the plan (actually, when initializing the execution state of the plan)
and if pruning is successful, it is shown using Subplans Removed:  in the
EXPLAIN output.
Example:
prepare example_query as select * from p where a = $1;

In the above query, $1 is this type of parameter.  Note that no
values have been assigned to it yet.  Values are assigned using
execute operation on the above named prepared statement.
explain (costs off, timing off, analyze) execute example_query (1);
                                     QUERY PLAN

─────────────────────────────────────────────────────────────────────────────────────
 Append
   Subplans Removed: 2
   ->  Bitmap Heap Scan on p1
         Recheck Cond: (a = $1)
         ->  Bitmap Index Scan on p1_a_idx
               Index Cond: (a = $1)
 Planning Time: 0.019 ms
 Execution Time: 0.163 ms
(8 rows)

Here, parameter $1 is assigned the value of 1 and it's fixed for the
entire execution of that query.  So, pruning occurs using that value
when initializing the execution state of the Append node.  Since
the p2 and p3 won't contain the values 1, they're pruned and hence
Subplan Removed: 2 in the above output.
(Note that dynamic pruning would only be required if planner created a
generic plan which ignores the values of parameters provided by execute.
Currently, there is no way for a user to control whether the generic plan
will be produced, so you will need to wait until planner switches to a
generic plan based on its internal cost criteria to see dynamic pruning
in action.)

Parameters whose value may change during execution

If a query contains this type of parameter, then pruning occurs every time
its value changes during execution of the query.  If a partition is pruned
every time, then its subplan is showed as (never executed) in the
EXPLAIN output.
Example:
select * from foo f where exists (select * from p where a = f.a);

Here f.a is the parameter for scanning p.  The above query is actually
implemented using a join between foo and p.  If p is the inner table, then
we can use dynamic pruning using a = f.a condition where f.a is the
parameter.  Pruning is executed for every value of f.a, that is, for all
rows of foo.
explain (costs off, timing off, analyze) select * from foo f where exists
(select * from p where a = f.a);
                               QUERY PLAN
─────────────────────────────────────────────────────────────────────────
 Nested Loop Semi Join (actual rows=0 loops=1)
   ->  Seq Scan on foo f (actual rows=2 loops=1)
   ->  Append (actual rows=0 loops=2)
         ->  Bitmap Heap Scan on p1 (never executed)
               Recheck Cond: (a = f.a)
               ->  Bitmap Index Scan on p1_a_idx (never executed)
                     Index Cond: (a = f.a)
         ->  Bitmap Heap Scan on p2 (actual rows=0 loops=1)
               Recheck Cond: (a = f.a)
               ->  Bitmap Index Scan on p2_a_idx (actual rows=0 loops=1)
                     Index Cond: (a = f.a)
         ->  Bitmap Heap Scan on p3 (actual rows=0 loops=1)
               Recheck Cond: (a = f.a)
               ->  Bitmap Index Scan on p3_a_idx (actual rows=0 loops=1)
                     Index Cond: (a = f.a)
 Planning Time: 3.293 ms
 Execution Time: 0.397 ms
(17 rows)

Here p1's subplan is never executed, because foo contains only 2 and 3.

Example with both types of dynamic pruning in the same query

It's also possible for a query to contain both types of parameters, such
as in the following example.
prepare example_query as
select * from foo f where exists (select * from p where a = f.a and a <> $1);

Above query contains both types of parameters due to conditions a = f.a and
a <> $1.
explain (costs off, timing off, analyze) execute example_query (2);
                               QUERY PLAN
─────────────────────────────────────────────────────────────────────────
 Nested Loop Semi Join (actual rows=0 loops=1)
   ->  Seq Scan on foo f (actual rows=2 loops=1)
   ->  Append (actual rows=0 loops=2)
         Subplans Removed: 1
         ->  Bitmap Heap Scan on p1 (never executed)
               Recheck Cond: (a = f.a)
               Filter: (a <> $1)
               ->  Bitmap Index Scan on p1_a_idx (never executed)
                     Index Cond: (a = f.a)
         ->  Bitmap Heap Scan on p3 (actual rows=0 loops=1)
               Recheck Cond: (a = f.a)
               Filter: (a <> $1)
               ->  Bitmap Index Scan on p3_a_idx (actual rows=0 loops=1)
                     Index Cond: (a = f.a)
 Planning Time: 4.189 ms
 Execution Time: 0.331 ms
(16 rows)

Note that there are both Sublans Removed:  and (never executed) in this
case.:
Sublans Removed:  exists because of the parameter $1.  When it is
executed with value 2, partition p2's subplan is removed, because the
condition becomes a <> 2.
(never executed) is shown for p1, because the parameter f.a never becomes
1, because foo only contain 2 and 3, so p1's subplan is never executed.
So, the only partition that's ends up getting scanned is p3.
Partition-wise join and aggregation

Join and aggregation are widely used operations in complex query workloads and
their efficiency can severely affect the performance when the underlying datasets
are large.  PostgreSQL 11 implements two techniques to create efficient plans when
the tables being joined and aggregated are partitioned tables, which should be a
common scaling technique for large datasets anyway.

Partitionwise join

If two partitioned tables are joined on their common partition key using an
equality join condition, it makes sense to join their partitions which respectively
contain same set of data, instead of joining them in whole.  Consider the following
example, where the orders table used earlier in this article is now accompanied by
an order_items table and both are HASH-partitioned on their id column.
create table orders (id int, order_date date) partition by hash (id);
create table orders_0 partition of orders for values with (modulus 3, remainder 0);
create table orders_1 partition of orders for values with (modulus 3, remainder 1);
create table orders_2 partition of orders for values with (modulus 3, remainder 2);

create table order_items (id int, content text) partition by hash (id);
create table order_items_0 partition of order_items for values with (modulus 3, remainder 0);
create table order_items_1 partition of order_items for values with (modulus 3, remainder 1);
create table order_items_2 partition of order_items for values with (modulus 3, remainder 2);

select count(*) from orders inner join order_items on orders.id = order_items.id

For the above query, older PostgreSQL versions would first combine the outputs of
the partitions of orders and order_items, respectively, followed by joining
them using condition orders.id = order_items.id, as shown below:

However, given how partitioning works, it should be clear in this case that
any rows of the partition orders_0 won't match the rows of partitions on
the other side order_items_1 and order_items_2.  So, it makes sense to
join orders_0 to only order_items_0, and similarly orders_1 to only
order_items_1 and orders_2 to only order_items_2, as shown below:

PostgreSQL 11 can create a plan of this shape using newly implemented
mechanism in the planner called partitionwise join.  The main advantage
of being able to create a plan like this is that the planner can select
a better join algorithm for individual pairs of partitions than it can for
the whole table join (due to better accuracy of statistics on individual
smaller partitions than on the whole table), resulting in an overall
better plan.  A major limitation currently is that both partitioned tables
must have exactly matching partition bounds.  Also, planning takes
O(N) time in the number of partitions and the current implementation
consumes severe amounts of CPU and memory for each partition pair.
Due to these inefficiencies, partitionwise join mechanism is disabled
by default.  Users who're willing to tolerate the long planning time and
pay the memory cost for a better plan in return can enable it using
the enable_partitionwise_join GUC parameter:
https://www.postgresql.org/docs/devel/runtime-config-query.html#GUC-ENABLE-PARTITIONWISE-JOIN

Partitionwise aggregation

Aggregation is another operation for which it helps to divide it into
smaller independent units.  For example, an aggregation step applied to a
partitioned table can be divided into multiple aggregation steps, one for
each of its partitions and their outputs combined if necessary.  Consider
the following query:
select count(*) from orders;

Older PostgreSQL versions would first scan all partitions of orders and
only then it would start counting, so the plan would look like:
explain (costs off, verbose) select count(*) from orders;
                 QUERY PLAN                  
─────────────────────────────────────────────
 Aggregate
   Output: count(*)
   ->  Append
         ->  Seq Scan on public.orders_0
         ->  Seq Scan on public.orders_dec18
         ->  Seq Scan on public.orders_jan19
(6 rows)

With PostgreSQL 11, it looks like this:
explain (costs off, verbose) select count(*) from orders;
                    QUERY PLAN                     
───────────────────────────────────────────────────
 Finalize Aggregate
   Output: count(*)
   ->  Append
         ->  Partial Aggregate
               Output: PARTIAL count(*)
               ->  Seq Scan on public.orders_0
         ->  Partial Aggregate
               Output: PARTIAL count(*)
               ->  Seq Scan on public.orders_dec18
         ->  Partial Aggregate
               Output: PARTIAL count(*)
               ->  Seq Scan on public.orders_jan19
(12 rows)

So, it would first calculate the partial counts for individual partitions
and finalize the count for the whole query once all partitions have been
processed.
If the aggregation is grouped and the grouping key matches the partition
key, then each partition's aggregation output is the final output for a
given grouping key, because the same key cannot be present in more than
one partition.  So, PostgreSQL 11 can generate a plan like this:
explain (costs off, verbose) select id, count(*) from orders group by id;
                 QUERY PLAN                  
─────────────────────────────────────────────
 Append
   ->  HashAggregate
         Output: orders_0.id, count(*)
         Group Key: orders_0.id
         ->  Seq Scan on public.orders_0
               Output: orders_0.id
   ->  HashAggregate
         Output: orders_dec18.id, count(*)
         Group Key: orders_dec18.id
         ->  Seq Scan on public.orders_dec18
               Output: orders_dec18.id
   ->  HashAggregate
         Output: orders_jan19.id, count(*)
         Group Key: orders_jan19.id
         ->  Seq Scan on public.orders_jan19
               Output: orders_jan19.id
(16 rows)

Similar to joins, the main advantage of planner's ability to apply a given
aggregation operation to individual partitions is that it can make better
decisions regarding the algorithm to use for aggregation.  For example, it
it can prefer hash-based aggregation over sort-based aggregation, because
the former is cheaper but is only considered for smaller input sizes.
The disadvantage is same as the partitionwise join technique, that is,
planning takes O(N) time in the number of partitions and can consume excess
amounts of CPU and memory.  So, it is disabled by default.  Users who're
willing to tolerate that in return for a better plan can enable it using
the enable_partitionwise_aggregate GUC parameter.
https://www.postgresql.org/docs/devel/runtime-config-query.html#GUC-ENABLE-PARTITIONWISE-AGGREGATE
Summary

To summarize, PostgreSQL 11 takes an important step in the direction of
improving the efficiency of query execution on partitioned data by
implementing optimization techniques based on partitioning metadata.
More work will need to be done to ensure that these techniques are
scalable in terms of the number of partitions, but this seems like a
a good start.