idoqo/GSOC '21 Project Report.md

## GSOC '21 Project Report.md

      
    Raw
  

              GSOC '21 Project Report.md
            
          
    Introduction

The overall goal of the project is to add support for storing histogram statistics on disk in JSON format. As described in the related Jira page, storing histograms in comes with some benefits like:

Allow DBAs to examine (and modify) the histograms. For instance, they can take advantage of existing JSON_* functions like JSON_LENGTH and JSON_EXTRACT to examine the generated histograms.
Improve the histogram precision especially for VARCHAR data types.
Provide a base for other histogram formats such as most-common-values (MCV).

NB: Below, "binary histograms" is used to refer to both single and double precision histograms (i.e SINGLE_PREC_HB and DOUBLE_PREC_HB).
Relevant Links


MDEV-21130 - Tracking JIRA page for this project's issue.
MDEV-26125 - Sample bug demonstrating a limitation in the current binary histogram.
EITS - MariaDB Histogram statistics documentation
PR-1854 - Combined Pull Request link to MariaDB/server

Highlighted Milestones

The project was done in milestones. Asides reducing the scope of each step, the milestones approach also helped me clarify and get familiar with the overall MariaDB codebase, and development workflows to improve productivity.
The milestones are tracked on the JIRA page, but here are some of the highlights.
Add "JSON" as recognized histogram type

My first task was to make the MariaDB server recognize "JSON" as a valid histogram type (i.e making JSON a valid value for the histogram_type system variable).
Some of the related commits are:

Prepare JSON as valid histogram_type
[Add json statistics test and change histogram column type to blob](Prepare JSON as valid histogram_type )

Write JSON objects build histogram

Once we can set the histogram_type to JSON, the next step is to attempt writing JSON objects to the statistics table. This involved setting up a new histogram builder that creates JSON objects instead of bytes (as in the binary histogram).
It was a bit involved as I had to spend some time getting familiar with how JSON arrays are built currently in the code-base.
Commits from this milestone include:

Rough base for json histogram builder
Store bucket bounds and extend test cases for JSON histogram
Fix garbage null values at end of json array elements

Parse the JSON data into a structure that allows lookups

Writing JSON is one part, and reading/parsing them is the other. For JSON histograms, we needed to be able to read the JSON objects from statistics table into an in-memory data structure. In the end, we ended up using the KeyTuple format.
The KeyTuple format is a tuple consisting of both key and value where the length of the key part is dependent on the data type only (and independent of the value that is stored). The format is more completely documented in opt_range.cc.
Commits related to parsing the JSON data include:

Extract json parser functions from class
[Add parser to read JSON array (of histograms) into string vector](Prepare JSON as valid histogram_type )

Error on invalid histograms

Since the histograms are stored in JSON format, it makes sense for users/DBAs to be able to modify the generated statistics. In cases where the histogram is invalidated after being modified (or for any other reason), we need to be able to report the correct error. This milestone added a new error (ER_JSON_HISTOGRAM_PARSE_FAILED) and tests to catch such cases.
The related commit is at:

[Report parse error when parsing JSON histogram fails](Prepare JSON as valid histogram_type )

Implement range and point selectivities

Histogram estimates are computed using range selectivity (when ranges are given to the select statement e.g col between val1 and val2) or point selectivity (when a single point range is provided e.g col=val1). The goal of this milestone was to implement both of these selectivities for JSON histograms.
Some commits related to this milestone include:

Use binary search to compute range selectivity
Replace range_selectivity methods for Histograms and add tests
Refactor histogram point selectivity
Implement point selectivity for JSON histograms

Tests and Checks

Cases that were checked to inspect the behaviours of the generated histograms include:

Re-using the existing statistics test (since a new histogram type should still behave nicely with existing statistics).
Randomized population table based on population sizes of countries in the world e.g below shows the optimizer output for binary vs JSON histograms

---Binary Histograms:---

MariaDB [pop_test]> explain extended select  * from generated_pop where country='Chile';
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
| id   | select_type | table         | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
|    1 | SIMPLE      | generated_pop | ALL  | NULL          | NULL | NULL    | NULL | 1000000 |    21.09 | Using where |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.002 sec)

MariaDB [pop_test]> explain extended select  * from generated_pop where country='China';
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
| id   | select_type | table         | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
|    1 | SIMPLE      | generated_pop | ALL  | NULL          | NULL | NULL    | NULL | 1000000 |    21.09 | Using where |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.001 sec)

---JSON Histogram:---
MariaDB [pop_test]> set histogram_type=json;
Query OK, 0 rows affected (0.000 sec)

MariaDB [pop_test]> analyze table generated_pop persistent for all;
+------------------------+---------+----------+-----------------------------------------+
| Table                  | Op      | Msg_type | Msg_text                                |
+------------------------+---------+----------+-----------------------------------------+
| pop_test.generated_pop | analyze | status   | Engine-independent statistics collected |
| pop_test.generated_pop | analyze | status   | OK                                      |
+------------------------+---------+----------+-----------------------------------------+
2 rows in set (6.719 sec)

MariaDB [pop_test]> explain extended select  * from generated_pop where country='Chile';
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
| id   | select_type | table         | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
|    1 | SIMPLE      | generated_pop | ALL  | NULL          | NULL | NULL    | NULL | 1000000 |     0.78 | Using where |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.001 sec)

MariaDB [pop_test]> explain extended select  * from generated_pop where country='China';
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
| id   | select_type | table         | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
|    1 | SIMPLE      | generated_pop | ALL  | NULL          | NULL | NULL    | NULL | 1000000 |    21.18 | Using where |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.001 sec)

MariaDB [pop_test]>

The details are further described in MDEV-26125.

More specific scenarios e.g point selectivity cases like col=VAL,  cases where histogram value is a valid JSON but it can't be parsed, etc. The complete test file and its result are in statistics_json.test and statistics_json.result respectively.

Future Work

Some cases to be ironed out include:

Values that occur relatively rarely in the column can "disappear" in the generated histogram, causing them to be estimated wrongly. For example, if a value appears $X times in a table (where $X=n_rows_in_table / rows_in_bucket - 1). In this case, the value doesn't show up in the histogram, causing the estimate provided by the histogram to be wildly off.
Histogram estimates doesn't work for columns of BIT data type.
Fix existing memory leak as reported by mtr.
Support for other histogram types such as Most-Common-Value histograms (e.g https://jira.mariadb.org/browse/MDEV-21131)

Conclusion

I've always wondered about how databases work under the hood, and I'm happy that I got the chance to do that through the 2021 Summer of Code.
I'm hoping to continue working on the feature till it's released, and contribute to the wider MariaDB community as time goes on.
I'm also thankful for Sergey (for his help, and answering questions even when they weren't related to the project).