zanmato1984/proposal_tidb_collocated_optimization.md

## proposal_tidb_collocated_optimization.md

      
    Raw
  

              proposal_tidb_collocated_optimization.md
            
          
Team: 我垫你们蹲


Members: Sun Ruoxi (👨‍✈️), Xu Fei, Geng Liqi, Chen Shuning


Code: pd (first commit), tidb (first commit), tiflash (first commit), kvproto (first commit)


Abstract


Collocated optimization is considered to be an essential optimization for analytical processing in distributed databases, and unfortunately lacking in current TiDB. We introduce a new kind of index, namely "Redistributed Index", to enable the ability of collocated optimization in TiDB. With this redistributed index and the consequent collocated optimization, TiDB and its MPP (Massively Parallel Processing) engine TiFlash, are able to achieve significant performance improvement.
Our approach enhances the analytical processing ability of TiDB, meanwhile keeps its HTAP nature (realtime and simplicity) intact, in an innovative, systematic and non-intrusive way.
Background

Collocated optimization optimizes the query plan by removing the data exchange (also known as data shuffle) operators if the data being operated is already distributed in certain pattern, namely "collocated". Data exchange is costly because it involves all nodes transferring huge amount of data to all nodes through network. As a result, the processing easily hits the network bottleneck and causes lower CPU utilization because of more frequent I/O requests and worse data locality. Therefore, collocated optimization plays an essential role in analytical processing optimization and is widely adopted by most OLAP databases.
For example, given a collocated setup for table t1 and t2 on nodes N1 and N2 as shown in the following figure:

Both t1 and t2 are distributed by column c1. It is being said that the rows from both t1 and t2 having the same c1 value are on the same node.
So for a query with aggregation using c1 as key:
select count(*) from t1 group by c1;

Each node is able to aggregate each c1 group locally, rather than exchanging t1 data based on c1 values with each other.
And for another query joining t1 and t2 using t1.c1 and t2.c1:
select count(*) from t1 join t2 on t1.c1 = t2.c1;

Each node is able to join t1 and t2  locally, rather than exchanging t1 and t2 data based on t1.c1 and t2.c1 values with each other.
Note that the correctness of this optimization is based on the fact that both t1 and t2 are distributed by column c1. However, this kind of distribution is not always possible for TiDB. As an HTAP database, TiDB restricts data distribution to be by primary key to better serve transactional workloads. This unfortunate restriction effectively kills the chance of collocated optimization in TiDB/TiFlash. As a result, TiDB/TiFlash performs much worse than competitors in the scenarios that collocated optimization comes into play. This becomes an un-negligible reason that users prefer other products than TiDB.
Rationale

Given that the primary key distribution of data is so critical for transactional workloads that we can't change it, we are in turn looking for some sort of "replica". The first thing that comes to mind is that we have "region replica", which replicates the data in the unit of region and synchronizes using Raft. The problem with this replica is that it obeys the same primary key distribution as the main replica. Let's suppose we are to do something that detaches the region's logical meta and physical data. Then we can maintain the illusion of logical region and only redistribute the physical data based on another column, so that collocated optimization is available. But as a result, a mapping from the logical meta to the physical data is inevitable. This mapping is N-to-N and very likely inter-nodes, thus both complicated and impractical. The second idea is to invent a new kind of replica, to which we are free to specify any distribution. The problem is that the synchronization of this replica is difficult to retain both consistency and freshness at the same time. (Think of an ETL process.)
By rethinking the above two ideas, we see that what we are looking for is a new kind of replica with the following two preconditions hold:

The distribution of the replica must be independent of the primary key;
The consistency and freshness of the replica must be retained.

Ring a bell? It's essentially very similar to something long-existing in TiDB - secondary index. A secondary index itself is a replica of the main table, i.e., it has its own regions. The distribution of the secondary index data is independent of the primary key, i.e., by the index column(s). The ACID properties of the secondary index are retained by nature, i.e., through transactions.
So if we extend this index by adding some new properties, we end up with this new kind of index:

Like a secondary index, its data is stored independent of the main table;
Like a clustered index, it contains the complete row;
Its consistency and freshness are guaranteed by transactions;
Its data is distributed across multiple nodes by the index column(s), which are possibly non-PK and/or non-unique, independent of the primary key;
Multiple indexes from different tables could be collocated if they use the same distribution rule.

We name it "Redistributed Index". You can get an intuitive impression from the following figure.

Redistributed index effectively enables collocated optimization in TiDB. What's more interesting is that, it can leverage most existing abilities of index and keep the intrusiveness and engineering effort as minor as possible.
Design

Index Specifics

DDL

Redistributed index is operated using the same index DDL statements, except an extra keyword REDISTRIBUTED to distinguish from other kinds of indexes. For example, one can specify one or more unique or non-unique redistributed indexes in CREATE TABLE statement:
CREATE TABLE t(
id INT PRIMARY KEY,
c1 INT REDISTRIBUTED UNIQUE KEY,
c2 INT REDISTRIBUTED KEY,
c3 BIGINT,
...); 

In the above case, one unique redistributed index on column c1 and one non-unique redistributed index on column c2 are created.
Or alternatively, one can create redistributed indexes after table creation using ALTER TABLE ... ADD INDEX statement:
ALTER TABLE t ADD REDISTRIBUTED INDEX (c3);

Other redistributed index operations like renaming and deletion follow the original syntax, we don't list them here.
Data Representation

The index data is stored in the underlying KV store, i.e., TiKV. We have to specify the index data representation in KV format. As mentioned earlier, redistributed index has properties of both secondary index and clustered index. The KV key representation is the same as that of a secondary index - containing the index column(s), by which the index data is distributed. And the KV value representation is the same as that of a clustered index - containing the entire row, from which we can scan all the columns.
Specifically, for a unique redistributed index, the KV representation is:

Key: t{table_id}_i{index_id}_{index_column(s)}
Value: {column_1}{column_2}...

For a non-unique redistributed index, the KV representation is:

Key: t{table_id}_i{index_id}_{index_column(s)}_{primary_key}
Value: {column_1}{column_2}...

Consistency and Freshness

Like any other kinds of indexes, redistributed index data is automatically synced with the main table. Any manipulation of the table data, i.e., insertion, updating and deletion, is reflected to the index data synchronously within the same transactions. The ACID properties of the index data are retained, so consistency and freshness are guaranteed by nature. Moreover, all the above comes at the cost of almost zero intrusiveness and engineering effort, which we consider as a major advantage of this approach.
Data Distribution

The intrinsic of collocation is that the data with the same index value will be located together. By designing the data representation of redistributed index described earlier, as an example, the following figure shows a valid collocation of redistributed index 1 (on column c1) for table 1 (thus the prefix t1_i1_ in the keys) in terms of regions.

This collocation example is valid because all the data with the same index value is located on the same node. However, regions may be split, merged or moved. Suppose a new record PK: 9, c1: 103, c2: bar, c3: 2022 is inserted, it goes into region 3. This insertion effectively triggers region 3 to split, as shown in the following figure.

If the new-split region 4 remains on node 1, then everything is fine - the collocation still holds. But if it gets moved to node 2, the collocation breaks - one of the row with c1 = 103 slipped to elsewhere.
The problem is that regions split, merge and movement does not always respect collocation. To address this problem, we introduce affinity-based scheduling using PD's placement rule. This scheduling assures that the collocation boundary is always respected during regions split, merge and movement. For example, it may choose to split region 3 at the key t1_i1_104_6 from the very beginning, and move region 4 to wherever. Or alternatively, it may choose to later split region 4 at the key t1_i1_104_6 (say into region 5), pin region 4 on the same node as region 3 and move region 5 to wherever. Both strategies retain collocation, in the level of single table. This single-table-level collocation allows collocated optimization to eliminate the exchange operator of the aggregation. We'll come to this later.
Collocation Group

Other than single-table-level, collocation is sometimes required by multiple tables to apply collocated optimization on multiple tables, such as eliminating the exchange operators of both sides of the join. To specify such relationship, we introduce "Collocation Group". A collocation group is specified by users to include multiple redistributed indexes that must be further collocated with each other rather than only having single-table-level collocation. The following figure shows an example of index 1 of table 1 and index 1 of table 2 in the same collocation group a and a dangling index 1 of table 3 outside the group.

A collocation group can be operated by a set of specialized DDL statements.
Creating a collocation group a including redistributed indexes t1_i1 and t2_i1:
CREATE COLLOCATION GROUP a(t1_i1, t2_i1);

Adding redistributed indexes t3_i1 and t4_i1 into existing collocation group a:
ADD INTO COLLOCATION GROUP a(t3_i1, t4_i1) ;

Specifying collocation group a when creating table t with redistributed index on column c1:
CREATE TABLE t(...
c1 INT REDISTRIBUTED KEY GROUP a,
...); 

Other collocation group operations like renaming and deletion are similar to those of other database entities, we don't list them here.
Collocation group leverages the aforementioned affinity-based scheduling further. We extend a set of placement rules that assure the collocation boundaries of all the associated tables are respected during regions split, merge and movement. It also leaves the flexibility of scheduling for indexes outside, like index 1 of table 3 in the above figure.
TiFlash Replica of Redistributed Index

All the redistributed index data must make their way to TiFlash so that the collocated optimization can really come to work. The following figure shows how things cooperate to make that happen.

Note that we only need to focus on the red parts by which modifications are required. The rest parts are long-existing mechanisms. When users create TiFlash replica for table 1, TiDB will ingest placement rules to add learner replicas for the regions of index 1 into TiFlash, as well as the ones for the aforementioned collocation requirement. Then through TiFlash's schema syncing, TiFlash prepares a physical columnar table for table 1's index 1. This is done by creating a physical columnar table for index 1 in TiFlash using table 1's schema along with the physical table for table 1 atomically. When index data is synced to TiFlash, TiFlash recognizes the _i portion in the data key and targets it to the corresponding physical columnar table for index 1.
Collocated Optimization

Finally, this long journey has come to our ultimate goal - collocated optimization. The following figure shows an example of collocated optimization applied to an aggregation select count(*) from t group by c1, assuming table t has a redistributed index i on column c1.

Similarly, the following figure shows an exmaple of collocated optimization applied to a join select * from t1 join t2 on t1.c1 = t2.c2, assuming table t1 has a redistributed index i1 on column c1 and table t2 does not. One thing special about this case is that the Exchange operator on top of t2's TableScan needs to be modified to be aware of the distribution rule of index i1.

Last, as a special case of join, the following figure shows an exmaple of collocated optimization applied to a join select * from t1 join t2 on t1.c1 = t2.c1, assuming both table t1 and t2 have redistributed index i1 and i2 on both's column c1s in the same collocation group.

This optimization is done by the following steps:

When generating data source sub-plan, other than TableScan, emit IndexScans as candidates for each redistributed index associated with the target table;
When generating aggregation sub-plan, it checks if the distribution properties are satisfied in the data source plan, i.e., distributed by the group key(s), to decide if insert Exchange operator or not. So for the candidate with TableScan, Exchange operator is inserted. Whereas the candidate with the IndexScan on the group key column(s) remains as is;
When generating join sub-plan:


It first checks if the distribution properties are satisfied in both sides data source plans, i.e., distributed by the join key(s) and both sides are in the same collocation group, to decide if insert Exchange operators or not. So for the candidate with TableScans, Exchange operators are inserted. Whereas the candidate with the IndexScans on the join key column(s) remains as are, i.e., no Exchanges.
If the two IndexScans are not in the same collocation group, it simply discards this candidate, leaving the rest one-side-index candidates to be optimized as described next;
If there is an IndexScan on only one side, when inserting Exchange operator for the other side, it specifies the Exchange operator to use redistributed-index-aware exchanging method;


The candidate with the IndexScans wins the competition because of the lower cost, thus becomes the optimized plan.

Benchmark Result

We ran all 22 queries of TPCH with sf=100 on 4 nodes, and achieved 71.17% performance improvement by simply adding some redistributed indexes.

Future Works


Reduce the impact on transactional workloads: As a kind of index, redistributed index has its own overhead: more CPU/memory/disk usage and larger RTT. Though this overhead is considered to be a reasonable trade-off of using indexes and is sometimes quite acceptable for scenarios like ODS, etc., redistributed index impacts extreme latency/throughput-intensive transactional workloads. We will try to alleviate this overhead by:

Leverage the strong scalability of TiKV to reduce the latency/throughput impact;
More flexible table/index storage specification: For example we may only store table data on TiKV and only store redistributed index data on TiFlash;
More flexible index definition: Only specify necessary columns to be in the index value (KV value);


Exploit the "Secondary Clustered Index": Redistributed index is essentially a secondary clustered index, with extra distribution property guarantees. The "secondary clustered index" itself could be useful enough in some scenarios. For example, we see certain customers complaining about the performance of going back to the main table after index lookup, where a secondary clustered index could come to aid.
Further exploit collocation group: The aforementioned affinity-based scheduling has the potential to support more kinds of collocation including table/partition level primary keys, regular index, etc. We don't cover it in this proposal because it is kind of orthogonal to this work. But more advanced collocation group may enable chances of other efficient optimizations in TiDB, so we shall study it further.