Skip to content

Instantly share code, notes, and snippets.

@uds5501
Last active January 6, 2024 20:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save uds5501/fbb2f96c24f24c21500926e5ffb35b91 to your computer and use it in GitHub Desktop.
Save uds5501/fbb2f96c24f24c21500926e5ffb35b91 to your computer and use it in GitHub Desktop.
mariadb EXPLAIN rough notes

5th Jan 2024

I will be updating this gist as I dig deeper into the optimization planner. My research will be strictily bound to SELECT qeury right now.

When running an explain query, you can find the output like ->

MariaDB [sample_db]> explain select * from sample_table;
+------+-------------+--------------+------+---------------+------+---------+------+------+-------+
| id   | select_type | table        | type | possible_keys | key  | key_len | ref  | rows | Extra |
+------+-------------+--------------+------+---------------+------+---------+------+------+-------+
|    1 | SIMPLE      | sample_table | ALL  | NULL          | NULL | NULL    | NULL | 4    |       |
+------+-------------+--------------+------+---------------+------+---------+------+------+-------+

The entry point for any select query is defined in file sql_select.cc

/**
  An entry point to single-unit select (a select without UNION).

  @param thd                  thread handler
  @param rref_pointer_array   a reference to ref_pointer_array of
                              the top-level select_lex for this query
  @param tables               list of all tables used in this query.
                              The tables have been pre-opened.
  @param fields               list of items in SELECT list of the top-level
                              select
                              e.g. SELECT a, b, c FROM t1 will have Item_field
                              for a, b and c in this list.
  @param conds                top level item of an expression representing
                              WHERE clause of the top level select
  @param og_num               total number of ORDER BY and GROUP BY clauses
                              arguments
  @param order                linked list of ORDER BY agruments
  @param group                linked list of GROUP BY arguments
  @param having               top level item of HAVING expression
  @param proc_param           list of PROCEDUREs
  @param select_options       select options (BIG_RESULT, etc)
  @param result               an instance of result set handling class.
                              This object is responsible for send result
                              set rows to the client or inserting them
                              into a table.
  @param select_lex           the only SELECT_LEX of this query
  @param unit                 top-level UNIT of this query
                              UNIT is an artificial object created by the
                              parser for every SELECT clause.
                              e.g.
                              SELECT * FROM t1 WHERE a1 IN (SELECT * FROM t2)
                              has 2 unions.

  @retval
    FALSE  success
  @retval
    TRUE   an error
*/

bool
mysql_select(THD *thd, TABLE_LIST *tables, List<Item> &fields, COND *conds,
             uint og_num, ORDER *order, ORDER *group, Item *having,
             ORDER *proc_param, ulonglong select_options, select_result *result,
             SELECT_LEX_UNIT *unit, SELECT_LEX *select_lex)

On first invokation, it tries to setup linkage, determine if this table is derived table or not, creates a JOIN datastruct (not sure why) and then prepare the query. Once the query is prepared, it works on optimizing the said JOIN structure.

The actual query optimization happens in greedy_search() and static enum_best_search best_extension_by_limited_search() where the former is a greedy search (sherlock...) and the latter is exhasutive search upto certain depth. Will be digging deeper in the exhaustive search tomorrow.

6th Jan 2024

Reading the QEP implementation was a rather humbling exercise and I admit, I can't really come up with something more elegant that what they already have (and I could grasp like 10% of it).

The overall idea about running a SELECT query is:

  1. Initializing a JOIN struct (or a UNION struct, depending on the query).
  2. Prepare the join (gather the required statistics like table, column and index stats, which inturn are generally pre-computed, you'll read more about them later in the blog.)
  3. We try and optimize this JOIN
    1. first layer of optimization (JOIN::optimize_inner()) will be performed which roughly does the following:
      1. Merges all the derived tables into this layer of select.
      2. Transforms the IN predicates into subqueries.
      3. Converts subqueries to semi-joins wherever applicable. See convert_join_subqueries_to_semijoins() to understand what do I mean here.
      4. Creates a bitmap of tables used from the SELECT list. (They loooooove bitmaps.)
      5. Create an EXPLAIN query if it's not already existing (They ensure that the explain query is built anyway and kept till the JOIN has not dropped even if you haven't explicitly run a EXPLAIN query).
      6. Do some optimizations for nested joins & constant subueries
      7. Evaluate all the EQ conditions in WHERE and HAVING clause.
      8. Transform the IN queries into EQ queries and optimize all the conditions.
      9. ....... Did not understand what's the planner doing for quite a while.
      10. Call make_join_statistics() which calculates the best possible join structure.
      11. FINALLY MARK STATE TO OPTIMIZATION_PHASE_1_DONE! (PS : By default it's not required, it'll set it to OPTIMIZATION_DONE instead by default)
    2. Second layer of optimization is triggered if value is set to OPTIMIZATION_PHASE_1_DONE.
  4. We finally execute the planned JOIN.

However, after reading this internal blog about conditional selectivity (How many rows of a table do we estimate to return in a join / condition evalution?), my eyes darted towards the statistics elements of the engine with a new goal - messing up with histograms, which I'll explain tomorrow.

7th Jan 2024

First, why are histograms even present in a database? Amazing question, thankfully mariadb has an answer here -> Histogram Based Statistics

And how do you play with them? They got that basics here -> DECODE_HISTOGRAM

Will be posting a separate thread about hacking and playing with histograms.

Appendix

Will add something here if I find anything insightful and not present in docs already :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment