The overall goal of the project is to add support for storing histogram statistics on disk in JSON format. As described in the related Jira page, storing histograms in comes with some benefits like:
- Allow DBAs to examine (and modify) the histograms. For instance, they can take advantage of existing JSON_* functions like JSON_LENGTH and JSON_EXTRACT to examine the generated histograms.
- Improve the histogram precision especially for VARCHAR data types.
- Provide a base for other histogram formats such as most-common-values (MCV).
NB: Below, "binary histograms" is used to refer to both single and double precision histograms (i.e SINGLE_PREC_HB and DOUBLE_PREC_HB).
- MDEV-21130 - Tracking JIRA page for this project's issue.
- MDEV-26125 - Sample bug demonstrating a limitation in the current binary histogram.
- EITS - MariaDB Histogram statistics documentation
- PR-1854 - Combined Pull Request link to MariaDB/server
The project was done in milestones. Asides reducing the scope of each step, the milestones approach also helped me clarify and get familiar with the overall MariaDB codebase, and development workflows to improve productivity.
The milestones are tracked on the JIRA page, but here are some of the highlights.
My first task was to make the MariaDB server recognize "JSON" as a valid histogram type (i.e making JSON
a valid value for the histogram_type
system variable).
Some of the related commits are:
- Prepare JSON as valid histogram_type
- [Add json statistics test and change histogram column type to blob](Prepare JSON as valid histogram_type )
Once we can set the histogram_type
to JSON, the next step is to attempt writing JSON objects to the statistics table. This involved setting up a new histogram builder that creates JSON objects instead of bytes (as in the binary histogram).
It was a bit involved as I had to spend some time getting familiar with how JSON arrays are built currently in the code-base.
Commits from this milestone include:
- Rough base for json histogram builder
- Store bucket bounds and extend test cases for JSON histogram
- Fix garbage null values at end of json array elements
Writing JSON is one part, and reading/parsing them is the other. For JSON histograms, we needed to be able to read the JSON objects from statistics table into an in-memory data structure. In the end, we ended up using the KeyTuple
format.
The KeyTuple
format is a tuple consisting of both key and value where the length of the key part is dependent on the data type only (and independent of the value that is stored). The format is more completely documented in opt_range.cc
.
Commits related to parsing the JSON data include:
- Extract json parser functions from class
- [Add parser to read JSON array (of histograms) into string vector](Prepare JSON as valid histogram_type )
Since the histograms are stored in JSON format, it makes sense for users/DBAs to be able to modify the generated statistics. In cases where the histogram is invalidated after being modified (or for any other reason), we need to be able to report the correct error. This milestone added a new error (ER_JSON_HISTOGRAM_PARSE_FAILED
) and tests to catch such cases.
The related commit is at:
- [Report parse error when parsing JSON histogram fails](Prepare JSON as valid histogram_type )
Histogram estimates are computed using range selectivity (when ranges are given to the select statement e.g col between val1 and val2
) or point selectivity (when a single point range is provided e.g col=val1
). The goal of this milestone was to implement both of these selectivities for JSON histograms.
Some commits related to this milestone include:
- Use binary search to compute range selectivity
- Replace range_selectivity methods for Histograms and add tests
- Refactor histogram point selectivity
- Implement point selectivity for JSON histograms
Cases that were checked to inspect the behaviours of the generated histograms include:
- Re-using the existing statistics test (since a new histogram type should still behave nicely with existing statistics).
- Randomized population table based on population sizes of countries in the world e.g below shows the optimizer output for binary vs JSON histograms
---Binary Histograms:---
MariaDB [pop_test]> explain extended select * from generated_pop where country='Chile';
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
| 1 | SIMPLE | generated_pop | ALL | NULL | NULL | NULL | NULL | 1000000 | 21.09 | Using where |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.002 sec)
MariaDB [pop_test]> explain extended select * from generated_pop where country='China';
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
| 1 | SIMPLE | generated_pop | ALL | NULL | NULL | NULL | NULL | 1000000 | 21.09 | Using where |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.001 sec)
---JSON Histogram:---
MariaDB [pop_test]> set histogram_type=json;
Query OK, 0 rows affected (0.000 sec)
MariaDB [pop_test]> analyze table generated_pop persistent for all;
+------------------------+---------+----------+-----------------------------------------+
| Table | Op | Msg_type | Msg_text |
+------------------------+---------+----------+-----------------------------------------+
| pop_test.generated_pop | analyze | status | Engine-independent statistics collected |
| pop_test.generated_pop | analyze | status | OK |
+------------------------+---------+----------+-----------------------------------------+
2 rows in set (6.719 sec)
MariaDB [pop_test]> explain extended select * from generated_pop where country='Chile';
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
| 1 | SIMPLE | generated_pop | ALL | NULL | NULL | NULL | NULL | 1000000 | 0.78 | Using where |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.001 sec)
MariaDB [pop_test]> explain extended select * from generated_pop where country='China';
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
| 1 | SIMPLE | generated_pop | ALL | NULL | NULL | NULL | NULL | 1000000 | 21.18 | Using where |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.001 sec)
MariaDB [pop_test]>
The details are further described in MDEV-26125.
- More specific scenarios e.g point selectivity cases like
col=VAL
, cases wherehistogram
value is a valid JSON but it can't be parsed, etc. The complete test file and its result are instatistics_json.test
andstatistics_json.result
respectively.
Some cases to be ironed out include:
- Values that occur relatively rarely in the column can "disappear" in the generated histogram, causing them to be estimated wrongly. For example, if a value appears
$X
times in a table (where$X=n_rows_in_table / rows_in_bucket - 1
). In this case, the value doesn't show up in the histogram, causing the estimate provided by the histogram to be wildly off. - Histogram estimates doesn't work for columns of
BIT
data type. - Fix existing memory leak as reported by
mtr
. - Support for other histogram types such as Most-Common-Value histograms (e.g https://jira.mariadb.org/browse/MDEV-21131)
I've always wondered about how databases work under the hood, and I'm happy that I got the chance to do that through the 2021 Summer of Code.
I'm hoping to continue working on the feature till it's released, and contribute to the wider MariaDB community as time goes on.
I'm also thankful for Sergey (for his help, and answering questions even when they weren't related to the project).