Skip to content

Instantly share code, notes, and snippets.

@adamkiss
Last active March 18, 2020 14:59
Show Gist options
  • Save adamkiss/1f148832e33008bb5d3e3d45a1923d76 to your computer and use it in GitHub Desktop.
Save adamkiss/1f148832e33008bb5d3e3d45a1923d76 to your computer and use it in GitHub Desktop.
Retour Overview Query Optimization

SQL Optimization — Retour database, events overview

@distantnative asked me (well, anyone) to optimize a SQL query which generates an overview of Retour events.

#Programming/SQL

Original Query

with recursive dates as (
    select "2019-10-01" as date
    union all
    select date(date, "+1 day") from dates where date < "2020-02-29"
)
SELECT
    dates.date,
    COUNT(redirect) AS redirected,
    COUNT(wasResolved) - COUNT(wasResolved + redirect) AS resolved,
    COUNT(path) - COUNT(wasResolved + redirect) - COUNT(redirect) AS failed
FROM
    dates
LEFT JOIN
    records
ON
    strftime("%Y-%m-%d", dates.date) = strftime("%Y-%m-%d", records.date)
WHERE
    strftime("%s", dates.date) >= strftime("%s", "2019-10-01")
AND
    strftime("%s", dates.date) <= strftime("%s", "2020-02-29")
GROUP BY
    strftime("%Y-%m-%d", dates.date)
ORDER BY
    strftime("%Y-%m-%d", dates.date);

With following demo database: retour.sqlite

Analyzing Query

The best tool/command to optimize Sqlite queries is definitely explain query plan (just prepend it to the query you want to optimize). Running it with original query, we get following output:

id	parent	notused		detail
2	0	0		CO-ROUTINE 2
5	2	0		SETUP
6	5	0		SCAN CONSTANT ROW
17	2	0		RECURSIVE STEP
18	17	0		SCAN TABLE dates
34	0	0		SCAN SUBQUERY 2
43	0	0		SCAN TABLE records
54	0	0		USE TEMP B-TREE FOR GROUP BY

Looking at the query plan, we can see there’s a lot of scanning (SCAN TABLE is basically “read every value on disk” and thus the operation you’re looking for when optimizing) and expensive recursive step.

Further looking at the query itself, majority of functions (strftime), and the temp table generation are there to get the date portion of the event time.

Adding a date_cache

First try to optimize the query then, should be creating a new table with the same records, but additional value (tentatively named date_cache), which will have the date portion of of the time. I've used Table Plus for this, but the generated query is this:

CREATE TABLE "records_with_cache" (
	"id" integer,
	"date" text NOT NULL,
	"date_cache" varchar(10) NOT NULL,
	"path" text NOT NULL,
	"referrer" text,"redirect" text,
	"wasResolved" integer,
	PRIMARY KEY (id)
);

Then, we can copy all records and generate the cached date value by running following query:

insert into records_with_cache
select *,  strftime("%Y-%m-%d", date) as date_cache from records

Thus, we can remove from the original SQL query:

  • generation of the dates between start date and end date
  • the left join between the temp recursive table
  • all the strftime function calls

The result looks like this and runs in ~13ms:

SELECT
    date_cache,
    COUNT(redirect) AS redirected,
    COUNT(wasResolved) - COUNT(wasResolved + redirect) AS resolved,
    COUNT(path) - COUNT(wasResolved + redirect) - COUNT(redirect) AS failed
FROM
	records_with_cache
WHERE
    date_cache >= "2019-10-01"
AND
    date_cache <= "2020-02-29"
GROUP BY
    date_cache
ORDER BY
    date_cache;

Further optimization

If we run explain query plan again, we get the following output on the new query:

id	parent	notused	detail
7	0	0	SCAN TABLE records_with_cache
13	0	0	USE TEMP B-TREE FOR GROUP BY

Looking at the query and at the results, we can plainly see that both WHERE and GROUP BY use the date_cache column, so we add an index on it to prevent full disk read and the TEMP B-TREEthing that's the second most expensive.

CREATE INDEX "date_cache" ON "records_with_cache" ("date_cache");

And then the query plan gives us:

8 0 0 SEARCH TABLE records_with_cache USING INDEX date_cache (date_cache>? AND date_cache<?)

So, no full disk read and no temporary b-tree for a groupy. The resulting speed for the same query set is 2ms.

Empty values

It’s at this point that I realized that the new query doesn’t generate empty values (i.e. dates that haven’t happened yet, or nothing happened on them). One solution would be to generate those in the PHP and do an array_merge with the formatted date as a key (so the rows from database overwrite the empty values and keep the empty values intact), the other is to get the pre-generated date table back and shuffle the order of the joins so we join the records_with_cache into the generated dates table, because Sqlite doesn't have neither right nor full joins.

with recursive dates as (
    select "2019-10-01" as date
    union all
    select date(date, "+1 day") from dates where date < "2020-02-29"
)
SELECT
    dates.date,
    COUNT(redirect) AS redirected,
    COUNT(wasResolved) - COUNT(wasResolved + redirect) AS resolved,
    COUNT(path) - COUNT(wasResolved + redirect) - COUNT(redirect) AS failed
FROM
    dates
LEFT JOIN
	records_with_cache
ON
	dates.date = records_with_cache.date_cache
WHERE
    dates.date >= "2019-10-01"
AND
    dates.date <= "2020-02-29"
GROUP BY
    dates.date
ORDER BY
    dates.date;

This again runs in about 13ms, which is sad (compared to 2ms!), but still much better than 400ms.

Note to remember

If we in the end up using the union/join in SQL method to generated the series of dates, consider dropping the index on the records_with_cache table — it drops the query further to 25ms, but every insert/update/delete will be slightly faster - on my local machine, it's about 10µs (from ~90µs to ~80µs). This might or might not be negligible in your case, but you have to remember that the insert is called everytime the plugin activates (a client hits nonexistent or redirected page).

@adamkiss
Copy link
Author

Additional: Interesting resources to learn SQL from:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment