Skip to content

Instantly share code, notes, and snippets.

@seancolsen
Last active July 14, 2023 14:58
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save seancolsen/42d5f3873e644e3905eaac0b69f876ac to your computer and use it in GitHub Desktop.
Save seancolsen/42d5f3873e644e3905eaac0b69f876ac to your computer and use it in GitHub Desktop.
Querydown, explained for Mathesar devs

Querydown, explained for Mathesar devs

Hello Mathesar Core Team!

I want to introduce you to my hobby project: Querydown!

Querydown is a domain specific programming language that generates SQL SELECT queries. It's like "markdown for SQL".

The code is here: https://github.com/seancolsen/querydown

This Gist explains Querydown using Mathesar's Library Management sample data, and it also links out to the (still nascent) Querydown documentation for more detail.


Table of Contents generated with DocToc

Why I've been building Querydown

I actually began designing Querydown about 10 years ago when I was administering a large CiviCRM installation. CiviCRM has some pretty sophisticated searching and reporting functionality, but nonetheless my users still regularly had data questions which CiviCRM couldn't answer. For example:

  • Show me the people who have signed up for this certain kind of event at least twice in the past three years but not yet this year.

  • For each of the past 5 fiscal years (not calendar years), give me a breakdown of the total revenue we've received, per campaign type.

  • For our big annual event coming up this year, I want to judge our performance in registering event attendees so far. Analyze the previous 5 annual events producing charts which show the cumulative number of registered attendees vs the relative time leading up to the event. Overlay these accumulation curves on the same chart, aligned by their event date, so I can see where we are on the curve this year compared to prior years.

For many years I've been helping people use various CRM-like products which go to great lengths to build powerful search and reporting functionality, but I've seen those GUI-based searching and reporting tools consistently fall short of users' real-world needs. The GUI tools usually end up falling into an uncanny valley where they are the worst of both worlds — too complex to appease a novice user, and too limiting to satisfy a power user.

For my CiviCRM users, I found myself writing a lot of ad-hoc SQL to meet their needs, often piping the results into other tools to generate pretty graphs. Being somewhat cavalier, I usually ran my SQL against the live DB. Several times I forgot to include an ON clause with a join, causing an enormous cartesian product that would freeze the production server. Oops! I learned that SQL is powerful, but also dangerous — even for seemingly innocuous SELECT queries! For this reason, I would be wary of allowing end users to run SQL on a live database, even if they claim to be proficient. And with SQL being so verbose and finicky, I would never want to recommend that a novice attempt to use it to answer business questions like the above.

I think it would be awesome if we could give users a safe, user-friendly query language to answer complex questions without going through a DBA or BI analyst. I want a query language with low barriers to entry, so that non-technical users can perform simple queries without feeling like they need to "learn" something — and high limitations so that power users can eventually learn enough write almost all the SELECT queries they would otherwise write in SQL (but much more easily)! As far as I can tell, there is no project with this goal. Nothing like it exists! The closest thing is PRQL, but it's still designed for engineers and analysts — not users. From a user's perspective, PRQL is remarkably similar to SQL.

Many products already have some sort of bespoke user-facing query language masquerading as "advanced search". For example, Spotify, Wikipedia, and GitHub, but they're usually too simplistic to give users the power and flexibility they would need to answer more meaningful questions. For example, recently I wanted GitHub to show me tickets in which core team members were involved and GSoC contributors were involved. I can search with involves:seancolsen, but when I add another involves term it gets added with an "or" condition instead of an "and" condition. It's frustrating that I can't control that logic! Simple Querydown code looks a lot like those user-facing advanced search languages, but its additional features make it way more powerful!

The Querydown playground

playground

You can run these queries in your browser by doing the following:

  1. Open the Querydown playground.

  2. Copy this library_schema.json and paste it into the schema.json tab within the playground.

    This passes metadata to the Querydown compiler so that it knows about the tables, columns, and relationships present in the database. The compiler uses this metadata for things like identifier lookups and joins. By default, the playground loads metadata for a sample issue tracker schema which I've been using for development because it has some more complex topology. All the examples in this gist will use our Mathesar library schema though.

  3. Copy Querydown code from any of the examples below, and paste it into playground to see the SQL it generates.

    Disclaimer: The compiler does not yet report very useful errors, but that will be easy to change later.

  4. Copy the generated SQL and run it against our sample data to see the data it produces!

Syntax basics

A simple query

Find books titled "Anna Karenina" and published after 2000.

#books title:"Anna Karenina" publication_year:>2000

🚀 SQL output:

SELECT
  "Books".*
FROM "Books"
WHERE
  "Books"."Title" = 'Anna Karenina' AND
  "Books"."Publication Year" > 2000;

Each query begins by setting the base table (which is Books in the example above). A query always has one and only one base table (similar to Mathesar's Data Explorer). Table names can occur in several places throughout the query and are always prefixed with a # sigil.

Flexible identifiers

If you write SQL, you need to quote the Publication Year column with double quotes (not single!) and also remember to quote columns like Title, and year.

Querydown is more forgiving. In the example above, publication_year compiled to "Publication Year". How?

  • If a table or column in Querydown isn't found exactly as specified, then the compiler attempts to find a unique match with a flexible strategy comparing only lowercased ASCII letters and numbers. This means that foo_bar will resolve to Foo Bar, but only if it doesn't also resolve to any other identifiers like foobar. Further, column names like year are never a problem because there are no keywords (and functions names are always clear to the parser from other syntax).

  • Tables and columns can also be quoted with backticks (e.g. `Publication Year`) to handle ambiguity if absolutely necessary.

Values

Conditions

Basic conditions

  • After the base table, a query can have a space-separated list of conditions to filter the result rows. These conditions are conjoined via a logical AND.
  • A condition can be any expression, but they are typically comparison expressions. For example, the expression a:1 compiles roughly to "a" = 1, meaning it will equate to TRUE when the value of column a is equal to 1.

Comparison operators

See a table of all comparison operators.

AND vs OR

Curly brackets { } enclose a set of AND conditions. Square brackets [ ] enclose a set of OR conditions. Sets of conditions can be nested.

Find items that were acquired since 2023 for at least $10 or acquired since 2022 for at least $20:

#items [
  {acquisition_price:>=10 acquisition_date:>=@2023-01-01}
  {acquisition_price:>=20 acquisition_date:>=@2022-01-01}
]
🚀 Show SQL output
SELECT
  "Items".*
FROM "Items"
WHERE
  ("Items"."Acquisition Price" >= 10 AND
  "Items"."Acquisition Date" >= DATE '2023-01-01' OR "Items"."Acquisition Price" >= 20 AND
  "Items"."Acquisition Date" >= DATE '2022-01-01');

-- NOTE from Sean: SQL formatting hasn't been a priority thus far,
-- which is why this one looks a bit ugly.

(If no brackets are present at the top level, then a set of AND conditions is inferred.)

Comparison expansion

Comparisons operators can use .. on either side to "expand" the comparison into a set of expressions.

Find books with titles that include "color" or "colour".

#books title:~..["color" "colour"]

Desugars to: #books [title:~"color" title:~"colour"]

🚀 Show SQL output
SELECT
  "Books".*
FROM "Books"
WHERE
  ("Books"."Title" ~* 'color' OR "Books"."Title" ~* 'colour');

Ranges

Range literals can be specified using the .. syntax.

Find books published between the year 1900 and 1999 (inclusive).

#books publication_year:1900..1999
🚀 Show SQL output
SELECT
  "Books".*
FROM "Books"
WHERE
  "Books"."Publication Year" >= 1900 AND
  "Books"."Publication Year" <= 1999;

Result columns

Specifying result columns

Result columns are specified by prefixing expressions with $.

Show the id, title, and publication year for all books

#books $id $title $publication_year
🚀 Show SQL output
SELECT
  "Books"."id",
  "Books"."Title",
  "Books"."Publication Year"
FROM "Books";

Result columns must come after conditions.

Aliasing

Aliasing is done via ->.

Show the id, title, and publication year for all books, while aliasing Publication Year to year

#books $id $title $publication_year->year
🚀 Show SQL output
SELECT
  "Books"."id",
  "Books"."Title",
  "Books"."Publication Year" AS "year"
FROM "Books";

Sorting

Sorting is done via flags appended to the result column with a backslash.

Show the id, title, and year for all books, while sorting the results by year in descending order

#books $id $title $publication_year->year \sd
🚀 Show SQL output
SELECT
  "Books"."id",
  "Books"."Title",
  "Books"."Publication Year" AS "year"
FROM "Books"
ORDER BY
  "year" DESC NULLS LAST;

Column globs

You can use * to show all the columns in the base table or a related table, similar to SQL.

Show all columns in the books table, plus all author columns, plus the publisher's name.

#books $* $author.* $publisher.name
🚀 Show SQL output
SELECT
  "Books"."id",
  "Books"."Title",
  "Books"."Publication Year",
  "Books"."Media",
  "Books"."Page Count",
  "Books"."LC Classification",
  "Books"."ISBN",
  "Books"."Dewey Decimal",
  "Books"."Dewey Wording",
  "Books"."Author",
  "Books"."Publisher",
  "Authors"."id",
  "Authors"."First Name",
  "Authors"."Last Name",
  "Authors"."Website",
  "Publishers"."Name"
FROM "Books"
LEFT JOIN "Publishers" ON
  "Books"."Publisher" = "Publishers"."id"
LEFT JOIN "Authors" ON
  "Books"."Author" = "Authors"."id";

Glob control

For more control over the query, you can follow * with a parenthetical block containing flagged expressions to alter sorting and such.

Show all columns in the books table, but hide the Dewey Wording column and sort the results by isbn.

#books $*(dewey_wording \h isbn \s)
🚀 Show SQL output
SELECT
  "Books"."id",
  "Books"."Title",
  "Books"."Publication Year",
  "Books"."Media",
  "Books"."Page Count",
  "Books"."LC Classification",
  "Books"."ISBN",
  "Books"."Dewey Decimal",
  "Books"."Author",
  "Books"."Publisher"
FROM "Books"
ORDER BY
  "Books"."ISBN" ASC NULLS LAST;

Computations and functions

Expression support basic arithmetic operators. Named functions are applied via pipe syntax.

Find the most overdue checkouts, and compute a $1.50/day late fee for each.

#checkouts check_in_time:@null $* $due_date|age|days|ceil*1.5->late_fee\sd
🚀 Show SQL output
SELECT
  "Checkouts"."id",
  "Checkouts"."Item",
  "Checkouts"."Patron",
  "Checkouts"."Checkout Time",
  "Checkouts"."Due Date",
  "Checkouts"."Check In Time",
  CEIL(EXTRACT(epoch FROM NOW() - "Checkouts"."Due Date") / 86400) * 1.5 AS "late_fee"
FROM "Checkouts"
WHERE
  "Checkouts"."Check In Time" IS NULL
ORDER BY
  "late_fee" DESC NULLS LAST;

Referencing single related records

When a column links to another table, the . character can be used after the column to refer to columns in the related table.

Find checkouts of copies of books from publishers with names containing "oxford":

#checkouts item.book.publisher.name:~"oxford"

👇 Look how much SQL you'd need to write do this!!

🚀 Show SQL output
SELECT
  "Checkouts".*
FROM "Checkouts"
LEFT JOIN "Items" ON
  "Checkouts"."Item" = "Items"."id"
LEFT JOIN "Books" ON
  "Items"."Book" = "Books"."id"
LEFT JOIN "Publishers" ON
  "Books"."Publisher" = "Publishers"."id"
WHERE
  "Publishers"."Name" ~* 'oxford';

Referencing multiple related records

😎 Here's where Querydown starts to get a lot more ergonomic than SQL!

Mandatory aggregation

Data from one-to-many relationships is always aggregated with respect to the base table. This means the query will never have more results than are present in the base table.

This behavior limits the queries that are possible with Querydown, but it simplifies the problem space a lot!

Aggregate counts

Tables related through one-to-many relationships can be referenced as expressions to obtain the count of related records.

Show all publishers, along with the total number of books that each has published:

#publishers $* $#books
🚀 Show SQL output
WITH
  "cte0" AS (
    SELECT
      "Books"."Publisher" AS "pk",
      count(*) AS "v1"
    FROM "Books"
    GROUP BY "Books"."Publisher"
  )
SELECT
  "Publishers"."id",
  "Publishers"."Name",
  "cte0"."v1"
FROM "Publishers"
LEFT JOIN "cte0" ON
  "Publishers"."id" = "cte0"."pk";

Specifying an aggregate function

You can also pipe specific columns in the related table into aggregate functions via %.

Show the year of each publisher's first publication:

#publishers $* $#books.publication_year%min
🚀 Show SQL output
WITH
  "cte0" AS (
    SELECT
      "Books"."Publisher" AS "pk",
      min("Books"."Publication Year") AS "v1"
    FROM "Books"
    GROUP BY "Books"."Publisher"
  )
SELECT
  "Publishers"."id",
  "Publishers"."Name",
  "cte0"."v1"
FROM "Publishers"
LEFT JOIN "cte0" ON
  "Publishers"."id" = "cte0"."pk";

Transitive relationships

Data related through transitive relationships can (sometimes) be referenced via the target table alone, without specifying any intermediate tables.

Show each publisher's most recent checkout:

#publishers $* $#checkouts.checkout_time%max
🚀 Show SQL output
WITH
  "cte0" AS (
    SELECT
      "Books"."Publisher" AS "pk",
      max("Checkouts"."Checkout Time") AS "v1"
    FROM "Books"
    JOIN "Items" ON
      "Books"."id" = "Items"."Book"
    JOIN "Checkouts" ON
      "Items"."id" = "Checkouts"."Item"
    GROUP BY "Books"."Publisher"
  )
SELECT
  "Publishers"."id",
  "Publishers"."Name",
  "cte0"."v1"
FROM "Publishers"
LEFT JOIN "cte0" ON
  "Publishers"."id" = "cte0"."pk";

Here, Publishers is only related to Checkouts transitively — that is, the schema has no direct relationships between the two tables. This query still works though. The compiler traverses the relationship graph looking for a unique, non-intersecting path from Publishers to Checkouts, and it essentially expands the code above to become:

#publishers $* $#books.#items.#checkouts.checkout_time%max

Full paths (like above) can be used in cases where there is ambiguity or intersection.

"Has some" and "has none" conditions

You can use the ++ and -- shorthand syntax to construct more readable conditions based on the presence or absence of related records.

Find publishers with at least one book:

#publishers ++#books
🚀 Show SQL output
WITH
  "cte0" AS (
    SELECT
      "Books"."Publisher" AS "pk"
    FROM "Books"
    GROUP BY "Books"."Publisher"
  )
SELECT
  "Publishers".*
FROM "Publishers"
LEFT JOIN "cte0" ON
  "Publishers"."id" = "cte0"."pk"
WHERE
  "cte0"."pk" IS NOT NULL;

Find publishers with no books:

#publishers --#books
🚀 Show SQL output
WITH
  "cte0" AS (
    SELECT
      "Books"."Publisher" AS "pk"
    FROM "Books"
    GROUP BY "Books"."Publisher"
  )
SELECT
  "Publishers".*
FROM "Publishers"
LEFT JOIN "cte0" ON
  "Publishers"."id" = "cte0"."pk"
WHERE
  "cte0"."pk" IS NULL;

Conditions to filter aggregate data

You can add a condition block after any aggregated table.

For each publisher, show the number of related checkouts from the past year:

#publishers $* $#checkouts{checkout_time|ago:<@1Y}
🚀 Show SQL output
WITH
  "cte0" AS (
    SELECT
      "Books"."Publisher" AS "pk",
      count(*) AS "v1"
    FROM "Books"
    JOIN "Items" ON
      "Books"."id" = "Items"."Book"
    JOIN "Checkouts" ON
      "Items"."id" = "Checkouts"."Item"
    WHERE
      NOW() - "Checkouts"."Checkout Time" < make_interval(years => 1)
    GROUP BY "Books"."Publisher"
  )
SELECT
  "Publishers"."id",
  "Publishers"."Name",
  "cte0"."v1"
FROM "Publishers"
LEFT JOIN "cte0" ON
  "Publishers"."id" = "cte0"."pk";

Future work

I've already been designing lots more features. Here are some neat library queries we'd be able to run with a bit more work:

  • With custom grouping columns:

    Over the past year, give a month-by-month report which shows the percentage of books checked out in that month that were later returned on time. Ignore checkouts which are not yet due.

    #checkouts checkout_time:>=@1Y|ago due_date:<@now
    $checkout_time|year_month \g
    $(in_date:<=due_date)%percent_true
    
  • With user-defined constants, computed columns, and case expressions:

    Show the patrons who currently have the highest total late fee

    @fee_per_day_late = 1.5
    #checkouts.days_late = ? in_date!@null~0 ~~due_date|age|days|max(0)|ceil
    #checkouts.late_fee = @fee_per_day_late * days_late
    #patrons.late_fee = #checkouts.late_fee%sum
    #patrons $* $late_fee \sd
    
  • With window functions:

    Find patrons who visit the library on a very regular basis. Show the one who have the lowest maximum interval between sequential checkouts over the past 6 months.

    @start_point = @6M|ago
    #checkouts.prev_time = %%(patron \p checkout_time \s)%lag(checkout_time)
    #checkouts.days_since_prev = (checkout_time-prev_time|else(@start_point))|days
    #patrons.days_since_final = #checkouts.checkout_date%max|age|days
    #patrons.max_interval = #checkouts{checkout_date:>@start_point}.days_since_prev%max
    #patrons $* $max_interval|max(days_since_final) \s
    

Here are some other things I have in mind which should be relatively straightforward to implement:

And here are some larger projects I've been imagining for the more distant future:

  • A module system
  • A language server
  • Code formatting
  • Static type-checking
  • Support for other SQL dialects like SQLite and MySQL
  • Ability to formulate result columns via user-defined JSON
  • More language bindings
  • Tools to assist with DDL changes and migrations
  • Language-level support for queries against the schema structure, to allow user-driven introspection
  • A GUI editor to write/edit Querydown using Blockly or something like it.
  • An AI language model to generate Querydown code from natural language prompts.

Why I think "mandatory aggregation" is important

In our early Data Explorer design discussions, I expressed an unpopular opinion that all joined data should be aggregated against the base table. I didn't want the Data Explorer to allow users to obtain a result set which has more rows than are present in the base table. We did not go that direction with the Data Explorer, but I did go that direction with Querydown. I'd like to take you on a journey to explain why...

  1. Books

    Let's take a look at the books that our sample data has from Hannu Rajaniemi.

    select
      id as book_id,
      "Title" as title
    from "Books"
    where "Author" = 320;
    book_id title
    924 The Causal Angel
    1182 The Quantum Thief
    850 Summerland
  2. Items

    Now let's take a look at all the copies of all those books that we have.

    select
      i.id as item_id,
      i."Barcode" as barcode,
      i."Acquisition Price" as price,
      b.id as book_id,
      b."Title" as title
    from "Items" i 
    left join "Books" b on i."Book" = b.id
    where b."Author" = 320
    order by title;
    item_id barcode price book_id title
    1106 DE6E-AB05-BB24-EED3 7.29 850 Summerland
    1170 CC9F-CFA5-38A1-578D 13.21 924 The Causal Angel
    1171 6F2E-74C3-0394-D074 5.82 924 The Causal Angel
    1280 3A7B-D97D-B33B-EBCE 17.40 1182 The Quantum Thief

    Here we see that we have two copies of "The Causal Angel". Cool.

  3. Book Genres

    For me to demonstrate the problem I'm talking about, we need a schema with one table that has multiple one-to-many relationships. Our schema has no such tables, so we'll extend our schema by adding genres to each book.

    drop table if exists "Book Genres";
    drop table if exists "Genres";
    create table "Genres" (
      "id" serial not null,
      "Name" text,
      constraint "Genres_pkey" primary key ("id")
    );
    create table "Book Genres" (
      "id" serial not null,
      "Book" integer,
      "Genre" integer,
      constraint "Book Genres_pkey" primary key ("id"),
      constraint "Book Genres_book_fkey"
        foreign key ("Book") references "Books" ("id"),
      constraint "Book Genres_genre_fkey"
        foreign key ("Genre") references "Genres" ("id")
    );

    Now each book has multiple items and also multiple genres. Great!

    We'll also add some genre data to Hannu Rajaniemi's books:

    insert into "Genres" ("id", "Name") values
      (1, 'Science fiction'),
      (2, 'Cyberpunk'),
      (3, 'Fantasy'),
      (4, 'Historical Fiction'),
      (5, 'Space Opera');
    insert into "Book Genres" ("Book", "Genre") values
      (850, 1),
      (850, 3),
      (850, 4),
      (924, 1),
      (924, 2),
      (924, 4),
      (1182, 1),
      (1182, 2),
      (1182, 5);
  4. Listing all the genres for each book

    We can use aggregations for this as follows:

    select
      b.id as book_id,
      b."Title" as title,
      array_agg(distinct g."Name") as genres
    from "Books" b
    left join "Book Genres" bg on bg."Book" = b.id
    left join "Genres" g on g.id = bg."Genre"
    where b."Author" = 320
    group by b.id;
    book_id title genres
    850 Summerland {Fantasy,"Historical Fiction","Science fiction"}
    924 The Causal Angel {Cyberpunk,"Historical Fiction","Science fiction"}
    1182 The Quantum Thief {Cyberpunk,"Science fiction","Space Opera"}
  5. Summing the total items price for each book

    We can use aggregations for this as follows:

    select
      b.id as book_id,
      b."Title" as title,
      sum(i."Acquisition Price") as sum_items_price
    from "Books" b
    left join "Items" i on i."Book" = b.id
    where b."Author" = 320
    group by b.id;
    book_id title sum_items_price
    850 Summerland 7.29
    924 The Causal Angel 19.03
    1182 The Quantum Thief 17.40

    Notice that price of 19.03. That's the sum of 13.21 and 5.82, the individual costs of the two copies of that book. Excellent!

  6. Trying to combine both aggregate columns into the same query

    How can we show one table of books that has both of those aggregate columns?

    A naive approach would be to combine the SQL we used to make each query:

    select
      b.id as book_id,
      b."Title" as title,
      array_agg(distinct g."Name") as genres,
      sum(i."Acquisition Price") as sum_items_price
    from "Books" b
    left join "Book Genres" bg on bg."Book" = b.id
    left join "Genres" g on g.id = bg."Genre"
    left join "Items" i on i."Book" = b.id
    where b."Author" = 320
    group by b.id;
    book_id title genres sum_items_price
    850 Summerland {Fantasy,"Historical Fiction","Science fiction"} 21.87
    924 The Causal Angel {Cyberpunk,"Historical Fiction","Science fiction"} 57.09
    1182 The Quantum Thief {Cyberpunk,"Science fiction","Space Opera"} 52.20

    😲 Oh no! What happened? All the numbers are different!! That price that was 19.03 has now tripled to 57.09!

    You'll get this same behavior if you try to do this query in the Data Explorer.

    image

    In my experience, this is a very common type of query, and the way SQL encourages us to perform aggregations can be a real footgun! The aggregate constituents will intermingle with each other and bork everything up! The worst part is: if you don't have some expectations of the values, you may not even notice that anything is amiss!

  7. CTEs

    As far as I know if you want to do this aggregation properly, you need to use CTEs or temporary tables like so:

    WITH
      "cte0" AS (
        SELECT
          "Book Genres"."Book" AS "pk",
          string_agg("Genres"."Name", ', ') AS "v1"
        FROM "Book Genres"
        JOIN "Genres" ON
          "Book Genres"."Genre" = "Genres"."id"
        GROUP BY "Book Genres"."Book"
      ),
      "cte1" AS (
        SELECT
          "Items"."Book" AS "pk",
          sum("Items"."Acquisition Price") AS "v1"
        FROM "Items"
        GROUP BY "Items"."Book"
      )
    SELECT
      "Books"."id",
      "Books"."Title",
      "cte0"."v1" AS "genres",
      "cte1"."v1" AS "sum_items_price"
    FROM "Books"
    LEFT JOIN "cte0" ON
      "Books"."id" = "cte0"."pk"
    LEFT JOIN "cte1" ON
      "Books"."id" = "cte1"."pk"
    WHERE
      "Books"."Author" = 320;
    id Title genres sum_items_price
    850 Summerland Science fiction, Fantasy, Historical Fiction 7.29
    924 The Causal Angel Science fiction, Cyberpunk, Historical Fiction 19.03
    1182 The Quantum Thief Science fiction, Cyberpunk, Space Opera 17.40

    That gives us the numbers we expect, but golly that's a lot of SQL to write!!

    Fortunately, I didn't have to write that SQL — I used the following Querydown code to generate it!

    #books author:320 $id $title
    $#genres.name%list -> genres
    $#items.acquisition_price%sum -> sum_items_price
    

    Caveats:

    • If you want to run the above query in the playground, you'll need to use this modified schema.json that has the extra tables we added.
    • Querydown doesn't have an array aggregation function yet, so I used the list aggregate function, which performs string aggregation.

Querydown aggregates all joined data with respect to the base table — and it does so in isolation so that you don't end up with problems like the above. This "mandatory aggregation" behavior is a fundamental design of the language and a significant divergence from SQL. While it simplifies the problem space, it also means Querydown is more limited than SQL.

For example, you cannot use Querydown to list pairs of books with the same title

select
  a.id,
  b.id,
  a."Title"
from "Books" a
join "Books" b on b."Title" = a."Title";

This is impossible because each row in the result set corresponds to a pair of books, and there is no table with rows like that.

However, once I've implemented grouping and pipelines you will be able to do something kind of similar. The query below lists book titles for which more than one book exists, and it shows an aggregated list of all the book ids which share that same title:

#books $title\g $id%list $%count->n ~~~ n:>1

This behavior still fits within Querydown's limitations because you start with books and then pare the result set down to fewer rows (as opposed to SQL joins which produce more rows in the results).

Q&A

Why does the syntax have so many symbols? Won't it be hard to learn and hard to read?

As software engineers, I think we're all accustomed to placing a high value on code readability because we're used to a process of collaboratively building software which requires us to read much more code than we write. In that setting, readability is very important, but Querydown is not designed for that setting. Rather, Querydown is designed for smaller, more self-contained, and bespoke code snippets which people write frequently but read infrequently. In that setting the priorities are shifted slightly.

Given those priorities, I chose to design Querydown without keywords. Avoiding keywords might make Querydown feel like regular expressions, and indeed I would say that Querydown shares with regex an emphasis on writability over readability. But there's also a more important reason why I didn't want keywords hanging out in there...

Fow newcomers, I want the language to "get out of the way". What I mean is: I want people to be able to be able to learn a few of the comparison operators and then use Querydown to their heart's content without ever being surprised by what it did or why it gave them an error. Keywords make this harder because they can collide with column names. Keywords certainly help readability, but they can also sneak up on users who don't yet have a comprehensive understanding of the language. I considered prefixing column names with their own sigil, but that felt like too much divergence from the patterns established by "advanced search" languages. I would be open to reconsidering it though.

To continue a comparison with regex, I would say that the regex language "gets in your way" a lot! Before you can reliably construct regex patterns without getting surprised, you need to at least be aware of all the special characters, otherwise you may end up writing a special character while intending to match that value literally. By contrast, Pomsky offers a neat alternative. It's a DSL that compiles to regex, and it requires all literals to be quoted. The Querydown language design is more like Pomsky, and less like regex.

Given the no-keyword rule, my approach to subsequent language design choices has been to make common things relatively easy and succinct while making less common things relatively harder and more verbose. With some aspects of the language, I feel rather attached to the current design (e.g. [ ] for AND, { } for OR), while other aspects of the design still seem questionable to me (e.g. ++ and -- syntactic sugar).

Would non-technical users actually be able to use Querydown?

For simple queries, yes. I would expect them to use it kind of like the Spotify Advanced Search and similar tools aimed at non-technical users.

For complex queries, no. I would not expect non-technical users to be patient enough with Querydown to attempt answering more complex data questions with it.

That said, I'd like to expand our thinking on this topic a bit...

Our discussion around user personas sometimes end up dichotomizing users into either "technical" or "non-technical". I'd like to suggest a middle category which I'll call "slightly technical". For example:

  • A non-technical user might use spreadsheet formulas with arithmetic and functions like SUM, but probably not get too much fancier than that. Over the years, I've supported dozens (maybe over a hundred, depending on how you count) users with their spreadsheet needs, and seen this to be the case with the vast majority of users.

  • A slightly technical user would be comfortable with more complex spreadsheet formulas like VLOOKUP and might also use pivot tables.

  • A technical user might use SQL or Python to analyze the data instead.

With Querydown, my aim is to make simple queries accessible to non-technical users, and to make more complex queries accessible to slightly technical users.

How would non-technical users perform more complex queries with Querydown?

Basically, they wouldn't. Not directly, anyway. But I have a lot more to say about this...

I'd like to delve deeper into the inherent problems of non-technical users performing more complex queries. I would say Mathear's answer to this use case has been the Data Explorer. I've expressed concern about this product direction from the start of the Data Explorer. I'll expand on that here.

Let's look at a counter-example: spreadsheet pivot tables. Pivot tables are immensely powerful! They also behave in a near identical manner across all spreadsheet products, making them tremendously accessible to non-technical users. Yet, with all the people I have supported in using spreadsheets, I've never encountered a non-technical user who is comfortable with pivot tables. I had a co-worker who took a months-long professional development course on spreadsheets, and the curriculum covered pivot tables last. Even after that she was still hesitant to use them in cases where they fit perfectly. I have led workshops on spreadsheets where I have tried to teach pivot tables to people. I've tutored colleagues one-on-one, trying to teach them. I've built shared spreadsheets with pivot tables, trying to get other people to use them. In the end, non-technical users just don't want to engage with pivot tables because they are too confusing conceptually.

What I find fascinating is that the UI for building a pivot table does not require any code — it's all drag and drop. And yet, non-technical users seem to generally be more comfortable working with formulas than they are with pivot tables. I think there are two reasons for this discrepancy:

  • The fundamental concept is simpler with formulas: I enter a formula into a cell; I get a value in the cell!
  • You can start with very simple formulas which don't seem that intimidating at first, whereas creating a even a simple pivot table demands more conceptual understanding and presents more distractions as users try to make sense of the complex UI.

To get back to the question... What if a non-technical user wants to perform a more complex query with Querydown? These are the options I see:

  • Build a complex query by consulting with a slightly technical user.

    Most non-technical users don't care enough to understand the fundamental concepts necessary to formulate complex queries. They just want quick answers to their questions so that they can move on with their other work that they actually care about.

    With spreadsheets, I see this "phone a friend" pattern all the time. A few weeks ago my room-mate Ricki showed me a spreadsheet of theirs, explaining that their brother had helped them with some of the more complex formulas (VLOOKUP). Then Ricki later inserted a column made some adjustments to those formulas to change the VLOOKUP index. Having a text-only representation of the formula makes this communication and re-adjustment process so much easier than it would otherwise be if we all had to configure our custom spreadsheet logic through a series of forms and fields. We can communicate about the code (and changes to it) in much simpler terms than we can when discussing some elaborate UI.

    I don't think any amount of effort in pursuit of creating "easy to use" GUI search and reporting tools will ever satisfy non-technical users. The problem space just contains too much intrinsic complexity. That's okay! We should embrace this. Most non-technical users want someone else to just do it for them. The more we can improve the experience for slightly technical users, then the more we end up improving the experience for non-technical users, vicariously.

  • Build a complex query by becoming a slightly technical user.

    Of course, many non-technical users do have the patience and interest to learn more. Spreadsheets can be really great for these users because they're like little sandboxes that curious people can use to experiment.

    I was once tutoring a colleague on spreadsheet formulas and was surprised to see that she didn't understand that * meant multiplication. (After all, high school math teaches us that multiplication is × or .) Years later, she became a Certified Financial Planner and I think that her learning journey with spreadsheets proved to be a useful on-ramp towards more complex concepts.

    On-ramps are crucial. This is why I've designed Querydown to be really simple for really simple queries. This philosophy encourages people to start small and learn as they go.

    It's also worth mentioning that I've had my eye on eventually building a language server for Querydown throughout most of my language design process. A Querydown language server will allow application developers to easily implement awesome intellisense for the in-app Querydown editor. It would provide context-aware code completion for sooo much stuff. The language does have a lot of sigils. They might look a bit ugly, but they do allow the language play very nicely with intellisense! All of this helps people learn!

  • Build a complex query by using a visual programming language.

    Maybe you saw this HN post from last year which discusses Scratch, the visual programming language for kids. It's awesome to see how easy it is for people to build complex things (e.g. Scratch inside Scratch) with this visual interface. Under the hood, it uses Blockly, a JS editor for building visual programming languages. I'm dreaming about eventually building a blockly-like interface to write Querydown. For many people, that would be less intimidating than the raw Querydown code. And if the interface could also sync with the Querydown code bidirectionally, it would be an awesome way to learn Querydown and make those "on-ramps" even smoother!

  • Build a complex query by using an AI model.

    This is really the future in my opinion. I think we're at the dawn of a user-interface renaissance, where we're about to see more and more user interfaces moving towards natural language. Already one of our competitors, Outerbase is touting "EZQL" as their flagship feature. It's a language model which generates SQL. But I think a language model to generate Querydown would be even better because users would have an easier time inspecting the adjusting the output.

Would integrating Querydown into Mathesar be feasible?

  • I see a number of interesting technical challenges in potentially integrating Querydown into Mathesar. For example:

    • Dealing with names vs oids/attnums
    • DDL changes via-à-vis saved queries
    • Record summaries
    • Display options, cell interactivity, and other cell features
    • Cell provenience, to support updating cells and deleting rows
    • Supporting outline-style row grouping
    • Integrating a Rust library into the front end and backend

    I've been ruminating on all of these challenges for many months and I have a lot of neat ideas on how to address them, some of which I think have the potential to offer architectural improvements over our current approaches. I'll spare you the detail here, but just wanted to mention this in an effort to preempt some of the inevitable skepticism about the viability of Querydown within Mathesar. We can wade further into the weeds as necessary.

  • A simple way to integrate Querydown into Mathesar would be to build a self-contained area of the app for users to write, run, and save Querydown queries. This would be entirely separate from the Data Explorer and would be somewhat straightforward to implement.

  • At the other end of the spectrum, I could imagine a world in which many of Mathesar's features are actually built on top of Querydown, like the Data Explorer, the Table view, filtering, sorting, formulas, display options, record summaries, and visualizations. I have a been forming a rough vision for this approach which I'd like to elaborate upon if there is enough excitement about Querydown among the team.

Why did you write it in Rust?

I won't try to argue that Rust is the best language for everything, but when it comes to implementing a DSL, I'm convinced that Rust suits the problem better than any other language currently does. It's focus on performance and correctness is an excellent combination for this sort of problem. Also, low-level Rust code plays fairly well with higher-level code many other languages, which makes Rust a good choice for writing a library that many devs might want to integrate into a wide variety of codebases. For example, the PRQL project (again, the closet thing to Querydown that already exists) is written in Rust and has language bindings for Python, JavaScript, C, C#, Elixir, Java, and PHP.

  • Based on my research, I'm fairly confident that calling Rust code from Python will be straightforward with PyO3, though I've not yet worked on this sort of integration.
  • Calling Rust code from JavaScript is relatively straightforward with WASM, and is demonstrated in the Querydown playground. For compiling to WASM, Rust is by far the best support low-level language, which I think is probably due to Rust being one of the only modern languages without a garbage collector. (Compiling to WASM with Golang, for example, yields larger binaries because the GC runtime must be included in the binary.) This is not to suggest that we'd need the front end to actually run the Querydown compiler. The reason I'm interested in JS support is that I envision eventually writing a language server that could run in a browser to manipulate Querydown code, e.g. by using tower-lsp.

As inspired by Markdown

I'll close by reiterating the analogy with Markdown. You can get started writing Markdown by making text bold and italic. Then you can learn as you go — eventually going quite deep by using Markdown to generate some fairly complex HTML. But Markdown will never be able to generate all HTML. Markdown is deliberately limited for the sake of simplicity. Querydown follows this same design philosophy and has a similar target audience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment