@distantnative asked me (well, anyone) to optimize a SQL query which generates an overview of Retour events.
#Programming/SQL
with recursive dates as (
select "2019-10-01" as date
union all
select date(date, "+1 day") from dates where date < "2020-02-29"
)
SELECT
dates.date,
COUNT(redirect) AS redirected,
COUNT(wasResolved) - COUNT(wasResolved + redirect) AS resolved,
COUNT(path) - COUNT(wasResolved + redirect) - COUNT(redirect) AS failed
FROM
dates
LEFT JOIN
records
ON
strftime("%Y-%m-%d", dates.date) = strftime("%Y-%m-%d", records.date)
WHERE
strftime("%s", dates.date) >= strftime("%s", "2019-10-01")
AND
strftime("%s", dates.date) <= strftime("%s", "2020-02-29")
GROUP BY
strftime("%Y-%m-%d", dates.date)
ORDER BY
strftime("%Y-%m-%d", dates.date);
With following demo database: retour.sqlite
The best tool/command to optimize Sqlite queries is definitely explain query plan
(just prepend it to the query you want to optimize). Running it with original query, we get following output:
id parent notused detail
2 0 0 CO-ROUTINE 2
5 2 0 SETUP
6 5 0 SCAN CONSTANT ROW
17 2 0 RECURSIVE STEP
18 17 0 SCAN TABLE dates
34 0 0 SCAN SUBQUERY 2
43 0 0 SCAN TABLE records
54 0 0 USE TEMP B-TREE FOR GROUP BY
Looking at the query plan, we can see there’s a lot of scanning (SCAN TABLE
is basically “read every value on disk” and thus the operation you’re looking for when optimizing) and expensive recursive step.
Further looking at the query itself, majority of functions (strftime), and the temp table generation are there to get the date portion of the event time.
First try to optimize the query then, should be creating a new table with the same records, but additional value (tentatively named date_cache
), which will have the date portion of of the time. I've used Table Plus for this, but the generated query is this:
CREATE TABLE "records_with_cache" (
"id" integer,
"date" text NOT NULL,
"date_cache" varchar(10) NOT NULL,
"path" text NOT NULL,
"referrer" text,"redirect" text,
"wasResolved" integer,
PRIMARY KEY (id)
);
Then, we can copy all records and generate the cached date value by running following query:
insert into records_with_cache
select *, strftime("%Y-%m-%d", date) as date_cache from records
Thus, we can remove from the original SQL query:
- generation of the dates between start date and end date
- the left join between the temp recursive table
- all the strftime function calls
The result looks like this and runs in ~13ms:
SELECT
date_cache,
COUNT(redirect) AS redirected,
COUNT(wasResolved) - COUNT(wasResolved + redirect) AS resolved,
COUNT(path) - COUNT(wasResolved + redirect) - COUNT(redirect) AS failed
FROM
records_with_cache
WHERE
date_cache >= "2019-10-01"
AND
date_cache <= "2020-02-29"
GROUP BY
date_cache
ORDER BY
date_cache;
If we run explain query plan
again, we get the following output on the new query:
id parent notused detail
7 0 0 SCAN TABLE records_with_cache
13 0 0 USE TEMP B-TREE FOR GROUP BY
Looking at the query and at the results, we can plainly see that both WHERE
and GROUP BY
use the date_cache
column, so we add an index on it to prevent full disk read and the TEMP B-TREE
thing that's the second most expensive.
CREATE INDEX "date_cache" ON "records_with_cache" ("date_cache");
And then the query plan gives us:
8 0 0 SEARCH TABLE records_with_cache USING INDEX date_cache (date_cache>? AND date_cache<?)
So, no full disk read and no temporary b-tree for a groupy. The resulting speed for the same query set is 2ms.
It’s at this point that I realized that the new query doesn’t generate empty values (i.e. dates that haven’t happened yet, or nothing happened on them). One solution would be to generate those in the PHP and do an array_merge
with the formatted date as a key (so the rows from database overwrite the empty values and keep the empty values intact), the other is to get the pre-generated date table back and shuffle the order of the joins so we join the records_with_cache
into the generated dates table, because Sqlite doesn't have neither right nor full joins.
with recursive dates as (
select "2019-10-01" as date
union all
select date(date, "+1 day") from dates where date < "2020-02-29"
)
SELECT
dates.date,
COUNT(redirect) AS redirected,
COUNT(wasResolved) - COUNT(wasResolved + redirect) AS resolved,
COUNT(path) - COUNT(wasResolved + redirect) - COUNT(redirect) AS failed
FROM
dates
LEFT JOIN
records_with_cache
ON
dates.date = records_with_cache.date_cache
WHERE
dates.date >= "2019-10-01"
AND
dates.date <= "2020-02-29"
GROUP BY
dates.date
ORDER BY
dates.date;
This again runs in about 13ms, which is sad (compared to 2ms!), but still much better than 400ms.
If we in the end up using the union/join in SQL method to generated the series of dates, consider dropping the index on the records_with_cache
table — it drops the query further to 25ms, but every insert/update/delete will be slightly faster - on my local machine, it's about 10µs (from ~90µs to ~80µs). This might or might not be negligible in your case, but you have to remember that the insert is called everytime the plugin activates (a client hits nonexistent or redirected page).
Additional: Interesting resources to learn SQL from: