zanmato1984/tidb_2020_hackathon_rfc.md

## tidb_2020_hackathon_rfc.md

      
    Raw
  

              tidb_2020_hackathon_rfc.md
            
          
Proposal: GPU Accelerated TiDB For Analytical Queries


Authors: Mods Team (Sun Ruoxi, Xu Fei)
Last updated: 2021/01/11
Discussion at: TiDB 2020 Hackathon RFC

Abstract

Empower TiDB using GPU acceleration techniques to improve performance of CPU-intensive analytical query processing, such as joins, aggregations, etc.
Other than scaling out, which TiDB is already capable of, the ability of scaling up by embracing new generation hardware will be extended.

Background

As an HTAP database, TiDB is able to process analytical queries efficiently. At certain points, however, CPU becomes the bottleneck of processing queries, such as join, aggregation, etc. on large amounts of data.
On the other hand, GPU is rapidly gaining popularity in areas of scientific computing, AI, data processing, etc. It out-performs CPU by orders of magnitude in such areas. There are also GPU-accelerated databases emerging, such as Brytlyt, BlazingDB, Kinetica, OmniSci, and SQream.
Now is the time to explore how GPU can accelerate TiDB for processing CPU-intensive queries.

Proposal


As the above figure shows, the architecture of the proposed approach contains three components:

GPU Relational Algebra Engine (GRAE): programs GPU to execute database operations in a relational algebra fashion
Plan Translator (PT): translates TiDB query plan into a GRAE plan
Execution Adapter (EA): adapts TiDB execution model to the execution of the GRAE plan with the actual data properly fed into

GPU Relational Algebra Engine

GRAE mainly implements database operations, such as relational algebra operators, expressions, etc., by leveraging the existing GPU data processing and programming primitives, such as cuDF, thrust, and CUDA.
The essential consideration of making it an individual component is that there is a programming language boundary between TiDB (in Go) and GPU programming (in C/C++/CUDA), which is difficult to cross. Therefore GRAE is implemented using C++/CUDA. Beyond that, GRAE is built as a library rather than a service so that there is no overhead of RPC and network transmission. In order to finally cross the programming language boundary between Go and C++/CUDA, a group of simplified C APIs are exported for EA to call via Cgo.
Data is fed into GRAE via the C APIs. It is assumed by GRAE that data is resident in host memory and of the Arrow layout. Data processed by TiDB operators is organized as Chunks, which are memory resident and of Arrow-compatible layout, so it can be fed into GRAE in a zero-copy way.
Plan Translater

PT translates the optimized TiDB query plan, i.e., with optimal join orders and columns properly pruned, etc., into a GRAE plan that is later actually executed by EA.
PT is implemented within TiDB using Go, specifically as another pass after regular optimizer. The translated GRAE plan reflects how the query should be executed by GPU, i.e., when to execute which operator in GRAE, how much data from which table to feed into the current operator in GRAE, etc.
For a given TiDB query plan, PT translates the operators that are implemented by GRAE into GRAE operators, which are organized into a GRAE plan. And the rest operators, i.e., unimplemented by GRAE, are left unchanged. As a result, the final query plan is a fused one containing both the CPU portion, i.e., operators that are not translated by PT, and the GPU portion, i.e., the GRAE plan.
Execution Adapter

EA adapts TiDB execution model, i.e., the organization of various executors in TiDB, to the execution of the aforementioned GRAE plan.
EA is implemented within TiDB using Go, specifically as an individual executor like any other kinds of executors for various TiDB legacy operators. It traverses the GRAE plan to execute GRAE operators by actually calling GRAE APIs via Cgo.
During the query execution, the interoperation among EA and other executors is in original TiDB manner, i.e., EA pulls data from downstream executors, processes it in GRAE, and pushes data to upstream executors. Within EA, data in-and-out GRAE is all through parameters of GRAE API calls.


Implementation

A subset of TiDB operators and expressions are chosen to implement using CUDA, so that the test set chosen in Testing Plan can be fully executed on GPU. These operators and expressions include:

Join
Aggregate
Filter
Project
Arithmetic expressions
Logical expressions
Date related expressions
Etc.


Testing Plan

TPCH is used to benchmark the performance improvement of GPU against TiDB’s legacy CPU implementation. Rather than the complete 22 queries of TPCH, only the most CPU-intensive ones, such as q5, q9, q17 and q18, are chosen due to engineering effort and the limited time. Besides, some customized simple queries are made based on the TPCH data set to show a more straightforward result of the performance gain for a specific kind of operator.
As disk IO takes a large amount of the whole execution time, the performance improvement of GPU against CPU will be greatly amortized. TiDB’s Coprocessor cache is used as a workaround to bypass disk IO. In this way, the Coprocessor cache acts as a memory store, making TiDB a “hypothetical” in-memory database.

Open Issues

The following topics are left open:


Though this project mainly focuses on GPU accelerated TiDB, the core component, i.e., GRAE, is designed in a way that, it is potentially to be easily adopted by other components of TiDB ecosystem, such as TiKV, and TiFlash, by either FFI of Rust or native C++.


As mentioned in Testing Plan, this project assumes TiDB a “hypothetical” in-memory database with some workarounds. It is hoped that someday TiDB will evolve a true memory store which will be a natural fit for GPU.


This project uses CUDA as the GPU programming model due to it’s programmer friendly and rich in documentation. This limits the hardware choice, i.e., only NVIDIA cards can be used. Meanwhile other GPU programming models exist, such as OpenCL, and have a more flexible hardware requirement.


Other GPU acceleration techniques exist, such as GPUDirect to bypass the host memory between network/disk and GPU memory, and multi-GPU to multiplex the acceleration.