Skip to content

Instantly share code, notes, and snippets.

@idoqo
Last active September 7, 2022 09:26
Show Gist options
  • Save idoqo/ac520943e53f64034beaed4258b62ba5 to your computer and use it in GitHub Desktop.
Save idoqo/ac520943e53f64034beaed4258b62ba5 to your computer and use it in GitHub Desktop.

Introduction

The overall goal of the project is to add support for storing histogram statistics on disk in JSON format. As described in the related Jira page, storing histograms in comes with some benefits like:

  • Allow DBAs to examine (and modify) the histograms. For instance, they can take advantage of existing JSON_* functions like JSON_LENGTH and JSON_EXTRACT to examine the generated histograms.
  • Improve the histogram precision especially for VARCHAR data types.
  • Provide a base for other histogram formats such as most-common-values (MCV).

NB: Below, "binary histograms" is used to refer to both single and double precision histograms (i.e SINGLE_PREC_HB and DOUBLE_PREC_HB).

Relevant Links

  • MDEV-21130 - Tracking JIRA page for this project's issue.
  • MDEV-26125 - Sample bug demonstrating a limitation in the current binary histogram.
  • EITS - MariaDB Histogram statistics documentation
  • PR-1854 - Combined Pull Request link to MariaDB/server

Highlighted Milestones

The project was done in milestones. Asides reducing the scope of each step, the milestones approach also helped me clarify and get familiar with the overall MariaDB codebase, and development workflows to improve productivity.

The milestones are tracked on the JIRA page, but here are some of the highlights.

Add "JSON" as recognized histogram type

My first task was to make the MariaDB server recognize "JSON" as a valid histogram type (i.e making JSON a valid value for the histogram_type system variable).

Some of the related commits are:

Write JSON objects build histogram

Once we can set the histogram_type to JSON, the next step is to attempt writing JSON objects to the statistics table. This involved setting up a new histogram builder that creates JSON objects instead of bytes (as in the binary histogram).

It was a bit involved as I had to spend some time getting familiar with how JSON arrays are built currently in the code-base.

Commits from this milestone include:

Parse the JSON data into a structure that allows lookups

Writing JSON is one part, and reading/parsing them is the other. For JSON histograms, we needed to be able to read the JSON objects from statistics table into an in-memory data structure. In the end, we ended up using the KeyTuple format.

The KeyTuple format is a tuple consisting of both key and value where the length of the key part is dependent on the data type only (and independent of the value that is stored). The format is more completely documented in opt_range.cc.

Commits related to parsing the JSON data include:

Error on invalid histograms

Since the histograms are stored in JSON format, it makes sense for users/DBAs to be able to modify the generated statistics. In cases where the histogram is invalidated after being modified (or for any other reason), we need to be able to report the correct error. This milestone added a new error (ER_JSON_HISTOGRAM_PARSE_FAILED) and tests to catch such cases.

The related commit is at:

  • [Report parse error when parsing JSON histogram fails](Prepare JSON as valid histogram_type )

Implement range and point selectivities

Histogram estimates are computed using range selectivity (when ranges are given to the select statement e.g col between val1 and val2) or point selectivity (when a single point range is provided e.g col=val1). The goal of this milestone was to implement both of these selectivities for JSON histograms.

Some commits related to this milestone include:

Tests and Checks

Cases that were checked to inspect the behaviours of the generated histograms include:

  • Re-using the existing statistics test (since a new histogram type should still behave nicely with existing statistics).
  • Randomized population table based on population sizes of countries in the world e.g below shows the optimizer output for binary vs JSON histograms

---Binary Histograms:---


MariaDB [pop_test]> explain extended select  * from generated_pop where country='Chile';
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
| id   | select_type | table         | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
|    1 | SIMPLE      | generated_pop | ALL  | NULL          | NULL | NULL    | NULL | 1000000 |    21.09 | Using where |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.002 sec)

MariaDB [pop_test]> explain extended select  * from generated_pop where country='China';
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
| id   | select_type | table         | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
|    1 | SIMPLE      | generated_pop | ALL  | NULL          | NULL | NULL    | NULL | 1000000 |    21.09 | Using where |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.001 sec)

---JSON Histogram:---

MariaDB [pop_test]> set histogram_type=json;
Query OK, 0 rows affected (0.000 sec)

MariaDB [pop_test]> analyze table generated_pop persistent for all;
+------------------------+---------+----------+-----------------------------------------+
| Table                  | Op      | Msg_type | Msg_text                                |
+------------------------+---------+----------+-----------------------------------------+
| pop_test.generated_pop | analyze | status   | Engine-independent statistics collected |
| pop_test.generated_pop | analyze | status   | OK                                      |
+------------------------+---------+----------+-----------------------------------------+
2 rows in set (6.719 sec)

MariaDB [pop_test]> explain extended select  * from generated_pop where country='Chile';
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
| id   | select_type | table         | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
|    1 | SIMPLE      | generated_pop | ALL  | NULL          | NULL | NULL    | NULL | 1000000 |     0.78 | Using where |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.001 sec)

MariaDB [pop_test]> explain extended select  * from generated_pop where country='China';
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
| id   | select_type | table         | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
|    1 | SIMPLE      | generated_pop | ALL  | NULL          | NULL | NULL    | NULL | 1000000 |    21.18 | Using where |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.001 sec)

MariaDB [pop_test]>

The details are further described in MDEV-26125.

  • More specific scenarios e.g point selectivity cases like col=VAL, cases where histogram value is a valid JSON but it can't be parsed, etc. The complete test file and its result are in statistics_json.test and statistics_json.result respectively.

Future Work

Some cases to be ironed out include:

  • Values that occur relatively rarely in the column can "disappear" in the generated histogram, causing them to be estimated wrongly. For example, if a value appears $X times in a table (where $X=n_rows_in_table / rows_in_bucket - 1). In this case, the value doesn't show up in the histogram, causing the estimate provided by the histogram to be wildly off.
  • Histogram estimates doesn't work for columns of BIT data type.
  • Fix existing memory leak as reported by mtr.
  • Support for other histogram types such as Most-Common-Value histograms (e.g https://jira.mariadb.org/browse/MDEV-21131)

Conclusion

I've always wondered about how databases work under the hood, and I'm happy that I got the chance to do that through the 2021 Summer of Code.

I'm hoping to continue working on the feature till it's released, and contribute to the wider MariaDB community as time goes on.

I'm also thankful for Sergey (for his help, and answering questions even when they weren't related to the project).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment