Skip to content

Instantly share code, notes, and snippets.

@tomsing1
Created December 31, 2022 22:50
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tomsing1/304762fce66353310b254e8b22094a6e to your computer and use it in GitHub Desktop.
Save tomsing1/304762fce66353310b254e8b22094a6e to your computer and use it in GitHub Desktop.
Experimenting with JSON fields in a SQLite database
---
title: "Experimenting with SQLite"
editor_options:
chunk_output_type: inline
---
```{r}
library(glue)
library(RSQLite)
library(jsonlite)
```
The steps below are inspired by
[this tutorial](https://tirkarthi.github.io/programming/2022/02/26/sqlite-json-improvements.html)
```{r}
con <- dbConnect(RSQLite::SQLite(), ":memory:")
```
```{r}
if (dbExistsTable(con, "user")) {
dbRemoveTable(con, "user")
}
sql <- glue_sql(
"CREATE TABLE IF NOT EXISTS user (
id integer PRIMARY KEY,
name text,
interests json,
created timestamp default current_timestamp not null
);", .con = con)
DBI::dbExecute(con, sql)
```
```{r}
interests <- list(
John = list(
likes = c("skating", "reading", "swimming"),
dislikes = c("cooking")
),
Kate = list(
likes = c("reading", "swimming"),
dislikes = c("skating")
),
Jim = list(
likes = c("reading", "swimming"),
dislikes = c("cooking")
)
)
for (id in names(interests)) {
dbWithTransaction(con, {
payload <- toJSON(interests[[id]], autounbox = TRUE)
sql <- glue_sql(
"INSERT INTO user (id, name, interests) VALUES(null, {id}, {payload});",
.con = con)
DBI::dbExecute(con, sql)
})
}
```
We can use SQLite's
[-> and ->> JSON operators](https://www.sqlite.org/json1.html#jptr)
for extracting subcomponents of a JSON column.
### Left operands
Both operators take a JSON string as their *left* operand.
### Right operands
On the right side, you can use ither a
[PATH expression](https://dev.mysql.com/worklog/task/?id=8607)
(similar to MySQL) or an object field label or array index (like Postgres).
### Return values
- `->`: returns a JSON representation of the selected subcomponent or `NULL` if
that subcomponent does not exist.
- `->>`: returns an SQL TEXT, INTEGER, REAL, or NULL value that represents the
selected subcomponent, or NULL if the subcomponent does not exist.
Because `->` always returns a JSON representation, with can use it to chain
operations.
In a first example using `PATH expression` syntax, we specify the `likes`
attribute in the `interests` JSON column with the `$.likes` path expression and
chain it to `->>` to return the **first** element of the array. (We could also
use negative indexing, e.g. `$[#-1]` to access the last element of the array.)
```{sql connection=con}
select id, name from user
where interests->'$.likes'->>'$[0]' = 'skating';
```
Alternatively, we can execute the same query using a Postgres-like syntax [^1].
```{sql connection=con}
select id, name from user
where interests->'likes'->> 0 = 'skating';
```
## Filtering JSON columns
::: {.callout-note}
All of the methods shown below (e.g. using `json_each` and `json_tree`) scan
through the full table, e.g. they don't usee an index to make the search more efficient.
:::
If we have information about the structure of the JSON object, then we can
unpack its elements and to focus on specific fields. (See
[this tutorial](https://database.guide/sqlite-json_tree/)
for some nice examples.)
For example, we might want to return all users like `skating`, e.g. whose
`likes` JSON field *contains* this value, e.g. regardless of the position of the value within the array.
### json_each
The `json_each` *table-valued* function walks along the specified JSON field and returns a new table with row for each value. The returned table includes e.g.
`key` and `value` columns we can use to filter.
```{sql connection=con}
SELECT DISTINCT user.id, name, interests->'likes' as likes
FROM user, json_each(interests, '$.likes')
WHERE json_each.value LIKE '%swimming%'
```
### json_tree
The `json_tree` function steps through the full JSON object, e.g. it also
descends into the two sub-arrays `likes` and `dislikes`.
```{sql connection=con}
SELECT name, key, value
FROM user, json_tree(interests)
```
We can define the root for the tree by providing a second argument, e.g. to
focus only on the `likes` sub-array:
```{sql connection=con}
SELECT name, key, value
FROM user, json_tree(interests, '$.likes')
```
To focus only on the node leafs of the tree, we can retain only `atomic` fields:
```{sql connection=con}
SELECT name, key, value
FROM user, json_tree(interests, '$.likes')
WHERE json_tree.atom IS NOT NULL
```
Now that we have row for each value of the `likes` sub-array, we can filter the
table:
```{sql connection=con}
SELECT name, key, value
FROM user, json_tree(interests, '$.likes')
WHERE json_tree.atom IS NOT NULL AND json_tree.value LIKE '%skating%'
```
::: {.callout-warning}
Both the `json_each` and `json_tree` function scan the entire `interests`
column of the table, e.g. they don't use an index. I have not been able to
find documentation on how to add an (expression) index to speed up the search,
so for large datasets the queries shown above might be slow.
:::
## Casting JSON to text
Another way to filter JSON columns is to simply cast them into text, e.g. a JSON
string, and perform a text comparison. (Note that we are including the double
quotes in the search term to avoid unexpected partial matches.)
```{sql connection=con}
SELECT id, name, CAST(interests->'likes' AS TEXT)
FROM user
WHERE CAST(interests->'likes' AS TEXT) LIKE '%"skating"%'
```
Alternatively, we can also extract the TEXT representation of the `likes`
sub-array with the `json_extract` function:
```{sql connection=con}
SELECT id, name, json_extract(interests, '$.likes') as likes
FROM user
WHERE json_extract(interests, '$.likes') LIKE '%"skating"%'
```
### Indexing
::: {.callout-warning}
Using an index for fuzzy matching with the `LIKE` operator is only supported
for patterns of the format `XXX%`, e.g. a leading constant value before any
wildcards (as outlined
[here](https://stackoverflow.com/questions/57329285/how-to-properly-implement-indexing-for-use-in-variable-like-statement-with-sqlit)).
It is also necessary to make sure the search is case insensitive by defining
the casted TEXT explicitely as such with the `COLLATE NOCASE` clause.
:::
```{sql connection=con, include = FALSE}
DROP INDEX IF EXISTS dislikes_idx;
```
Without an index, the table is scanned entirely:
```{sql connection=con}
EXPLAIN QUERY PLAN
SELECT name, CAST(interests->'dislikes' AS TEXT) COLLATE NOCASE as dislikes
FROM user
WHERE dislikes = '["cooking"]'
```
To speed up future queries, we can add an index based on the same
*expression* we are planning to use in the query:
```{sql connection=con}
CREATE INDEX dislikes_idx
ON user(CAST(interests->'dislikes' AS TEXT) COLLATE NOCASE);
```
When the equality (`=`) filter is returned using the `likes_idx` index, speeding
up the search.
```{sql connection=con}
EXPLAIN QUERY PLAN
SELECT name, CAST(interests->'dislikes' AS TEXT) COLLATE NOCASE as dislikes
FROM user
WHERE dislikes = '["cooking"]'
```
Similarly, when we use a pattern with a wildcard at the end, the index is used
as well:
```{sql connection=con}
EXPLAIN QUERY PLAN
SELECT name, CAST(interests->'dislikes' AS TEXT) COLLATE NOCASE as dislikes
FROM user
WHERE dislikes LIKE 'cooking"%'
```
But a wildcard at the start of the term triggers a full scan of the table.
```{sql connection=con}
EXPLAIN QUERY PLAN
SELECT name, CAST(interests->'dislikes' AS TEXT) COLLATE NOCASE as dislikes
FROM user
WHERE dislikes LIKE '%cooking'
```
That's a severe limitation because it precludes matching patterns that are _not_
at the start of the TEXT string.
There might be other ways to implement a
[full text search](https://www.sqlite.org/fts5.html)
in SQLite, but I haven't discovered these features, yet.
## Cleanup
As always, we disconnect from the database when we are done.
```{r}
dbDisconnect(con)
```
[^1]: If the right operand is a text label `X`, then it is interpreted as the
JSON path `$.X`. If the right operand is an integer value `N`, then it is
interpreted as the JSON path `$[N]`.
<details>
<summary>
SessionInfo
</summary>
```{r}
#| echo: false
sessioninfo::session_info()
```
</details>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment